Ragas Evaluators

What are Ragas Evaluators?

Ragas Evaluators are specialized tools designed to evaluate the performance of retrieval-augmented generation (RAG) workflows. They focus on metrics like context relevance, faithfulness, recall, and robustness, ensuring that outputs derived from external knowledge bases or retrieval systems are accurate and reliable.


Why use Ragas Evaluators?

If your system retrieves information from external sources, these evaluators are essential. They ensure that responses are factually consistent, include all necessary details, and stay focused on relevant context. For applications like customer support or document summarization, Ragas Evaluators help guarantee the integrity and quality of your AI’s outputs.


Ragas Evaluator Response

Ragas return a number response between 0 and 1. Depending on the measurement taken (relevance, faithfulness, etc.) the number returned will be closer to 1.

Example: When measuring pertinence of the response, a very pertinent answer will return a number closer to 1.


Example

Imagine a customer asks a chatbot, “What’s included in my insurance policy?” and the system retrieves chunks of information from a knowledge base. A Ragas Evaluator can verify if the retrieved chunks focus on the user’s question (e.g., home insurance details) and exclude irrelevant details (e.g., unrelated auto insurance policies). This ensures the response is accurate and useful.


List of Ragas evaluators

EvaluatorDescriptionExample
Context precisionContext Precision assesses how well the retrieved information aligns with the user’s query, focusing on accuracy and relevance in each retrieved chunk. This is particularly valuable in applications like customer support, where using relevant context to provide precise and relevant answers is essential. Without a reference answer provided, the LLM compares each retrieved context chunk directly with the user’s query to evaluate relevance. With reference, the comparison is made between the reference answer and context chunks.Imagine a customer asks an AI support bot, “What is covered under my home insurance policy?” The system retrieves information chunks, some related to home insurance and others about auto insurance. Without reference, the evaluator checks relevance based on the user’s question, prioritizing home insurance chunks. With reference, it compares the retrieved chunks to an ideal response about home insurance, filtering out irrelevant auto insurance details to ensure a precise answer.
Response relevancyResponse Relevancy evaluates how well the generated answer directly responds to the original question, ensuring relevance and conciseness. The metric calculates relevancy by comparing the similarity between the user’s question and rephrased questions generated from the answer, with higher similarity indicating stronger alignment. This is particularly valuable in applications like customer support or Q&A, where clear and focused answers improve user satisfaction.Suppose a customer asks a bank's chatbot, “What are the fees for international transfers?” and the AI responds, “Our international transfer fee is $15 per transaction.” The response relevancy evaluator would generate similar questions from the answer and compare them to the original question. Since the response directly addresses the question without adding irrelevant details, it would score highly. If the response instead included unrelated information about domestic transfer fees, the score would be lower, reflecting the importance of focused answers in customer service.
FaithfulnessFaithfulness evaluates the factual consistency of a generated answer against the provided context, ensuring that all claims in the response can be directly supported by the given information. A high faithfulness score indicates that the response accurately reflects the context without introducing unsupported or incorrect details. This metric is especially useful in customer support or documentation systems, where providing reliable and accurate answers according to the context is essential.Imagine a user asks an HR chatbot, “What is the company’s policy on remote work?” and the AI responds, “Employees can work remotely up to three days a week.” The faithfulness evaluator would cross-check each claim (in this case, “three days a week”) against the company’s official policy document. If the retrieved context confirms this information, the answer would score highly for faithfulness. However, if the policy actually allows only two days, the score would be lower, indicating that the response inaccurately represented the context.
Context entity recallContext Entity Recall measures how effectively the system retrieves essential entities (e.g., people, places, events) present in the reference answer, checking that no critical entities are overlooked in the retrieved content. This is particularly useful in fact-based applications like tourism information, historical databases, or customer support, where specific entities must be included for an accurate response.Imagine a user asks, “Tell me about the Taj Mahal,” and the system retrieves several pieces of context. The reference answer mentions entities like “Shah Jahan,” “Mumtaz Mahal,” “UNESCO World Heritage Site,” and “Agra.” The context entity recall evaluator checks if these entities are also present in the retrieved context. A high recall score confirms that all essential entities are included, ensuring a comprehensive response about the Taj Mahal.
Context recallContext Recall measures how well a system retrieves all necessary information from a context by comparing it against a reference answer, ensuring no critical details are missed. A high context recall score indicates that the system has included all relevant information needed to address the question accurately. This is particularly valuable in applications like research assistance or document retrieval, where completeness of information is essential.Suppose a user asks, “What are the main benefits of product X?” and the system retrieves various details about the product. If the reference answer lists benefits such as “improved efficiency,” “cost savings,” and “ease of use,” the context recall evaluator would check if all these points are covered in the retrieved context. A high recall score means that the system successfully retrieved all these key benefits, providing a comprehensive response.
Noise sensitivityNoise Sensitivity assesses how well a system maintains accurate responses when exposed to irrelevant or distracting information in the retrieved context. A lower noise sensitivity score indicates that the system focuses on relevant details without being misled by unrelated content. This is especially useful in applications that involve information retrieval, like customer support systems and search engines.Suppose a user asks, “What is the Life Insurance Corporation of India (LIC) known for?” and the system retrieves both relevant details (like “LIC is the largest insurance company in India”) and irrelevant ones (like “The Indian economy is growing fast”). The noise sensitivity evaluator checks if the system’s response stays focused on LIC’s attributes, ignoring unrelated economic information. A low score indicates that the system successfully filters out irrelevant data, providing an accurate response about LIC.