Ragas Evaluator
What are Ragas Evaluators?
Ragas Evaluators are specialized tools designed to evaluate the performance of retrieval-augmented generation (RAG) workflows. They focus on metrics like context relevance, faithfulness, recall, and robustness, ensuring that outputs derived from external knowledge bases or retrieval systems are accurate and reliable.
Why use Ragas Evaluators?
If your system retrieves information from external sources, these evaluators are essential. They ensure that responses are factually consistent, include all necessary details, and stay focused on relevant context. For applications like customer support or document summarization, Ragas Evaluators help guarantee the integrity and quality of your AI’s outputs.
Ragas Evaluator Response
Ragas return a number response between 0 and 1. Depending on the measurement taken (relevance, faithfulness, etc.) the number returned will be closer to 1.
Example: When measuring pertinence of the response, a very pertinent answer will return a number closer to 1.
Example
Imagine a customer asks a chatbot, “What’s included in my insurance policy?” and the system retrieves chunks of information from a knowledge base. A Ragas Evaluator can verify if the retrieved chunks focus on the user’s question (e.g., home insurance details) and exclude irrelevant details (e.g., unrelated auto insurance policies). This ensures the response is accurate and useful.
Ragas Evaluators can be found in the Hub where we have many already available Evaluators ready to be used.
List of Ragas evaluators
Here-after you can find the list of Evaluators ready to be added to your project within the Hub
Evaluator | Description | Example |
---|---|---|
Ragas Context precision | Context Precision assesses how well the retrieved information aligns with the user’s query, focusing on accuracy and relevance in each retrieved chunk. This is particularly valuable in applications like customer support, where using relevant context to provide precise and relevant answers is essential. Without a reference answer provided, the LLM compares each retrieved context chunk directly with the user’s query to evaluate relevance. With reference, the comparison is made between the reference answer and context chunks. | Imagine a customer asks an AI support bot, “What is covered under my home insurance policy?” The system retrieves information chunks, some related to home insurance and others about auto insurance. Without reference, the evaluator checks relevance based on the user’s question, prioritizing home insurance chunks. With reference, it compares the retrieved chunks to an ideal response about home insurance, filtering out irrelevant auto insurance details to ensure a precise answer. |
Ragas Response relevancy | Response Relevancy evaluates how well the generated answer directly responds to the original question, ensuring relevance and conciseness. The metric calculates relevancy by comparing the similarity between the user’s question and rephrased questions generated from the answer, with higher similarity indicating stronger alignment. This is particularly valuable in applications like customer support or Q&A, where clear and focused answers improve user satisfaction. | Suppose a customer asks a bank's chatbot, “What are the fees for international transfers?” and the AI responds, “Our international transfer fee is $15 per transaction.” The response relevancy evaluator would generate similar questions from the answer and compare them to the original question. Since the response directly addresses the question without adding irrelevant details, it would score highly. If the response instead included unrelated information about domestic transfer fees, the score would be lower, reflecting the importance of focused answers in customer service. |
Ragas Faithfulness | Faithfulness evaluates the factual consistency of a generated answer against the provided context, ensuring that all claims in the response can be directly supported by the given information. A high faithfulness score indicates that the response accurately reflects the context without introducing unsupported or incorrect details. This metric is especially useful in customer support or documentation systems, where providing reliable and accurate answers according to the context is essential. | Imagine a user asks an HR chatbot, “What is the company’s policy on remote work?” and the AI responds, “Employees can work remotely up to three days a week.” The faithfulness evaluator would cross-check each claim (in this case, “three days a week”) against the company’s official policy document. If the retrieved context confirms this information, the answer would score highly for faithfulness. However, if the policy actually allows only two days, the score would be lower, indicating that the response inaccurately represented the context. |
Updated 14 days ago