Ragas Evaluator

What are Ragas Evaluators?

Ragas Evaluators are specialized tools designed to evaluate the performance of retrieval-augmented generation (RAG) workflows. They focus on metrics like context relevance, faithfulness, recall, and robustness, ensuring that outputs derived from external knowledge bases or retrieval systems are accurate and reliable.

Why use Ragas Evaluators?

If your system retrieves information from external sources, these evaluators are essential. They ensure that responses are factually consistent, include all necessary details, and stay focused on relevant context. For applications like customer support or document summarization, Ragas Evaluators help guarantee the integrity and quality of your AI’s outputs.

Entities and Parameters

Before diving into specific evaluators, it's essential to understand the key entities involved in RAG evaluation:

EntityAPI ParameterDescriptionExample
User QueryqueryThe original question or request from the user"What are the benefits of our premium insurance plan?"
Knowledge Base RetrievalsretrievalsArray of document chunks retrieved from your knowledge base["Premium plan includes 24/7 support...", "Coverage extends to international travel..."]
Generated ResponseoutputThe AI's answer based on the retrieved context"Our premium plan offers comprehensive coverage including..."
Reference AnswerreferenceA high-quality answer to compare against (may be ground truth or expert-written)Human-written ideal response for the query
ModelmodelThe AI model used for evaluation"openai/gpt-4o"

Ragas Evaluator Response

Ragas return a number response between 0 and 1. Depending on the measurement taken (relevance, faithfulness, etc.) the number returned will be closer to 1.

Example: When measuring pertinence of the response, a very pertinent answer will return a number closer to 1.

Example

Imagine a customer asks a chatbot, “What’s included in my insurance policy?” and the system retrieves chunks of information from a knowledge base. A Ragas Evaluator can verify if the retrieved chunks focus on the user’s question (e.g., home insurance details) and exclude irrelevant details (e.g., unrelated auto insurance policies). This ensures the response is accurate and useful.

Ragas Evaluators can be found in the Hub where we have many already available Evaluators ready to be used.

List of Ragas evaluators

Here-after you can find the complete list of evaluators with their parameter requirements and detailed descriptions:

NameRequired ParametersOptional ParametersDescriptionExample
Ragas Coherencequery, output, modelreferenceChecks if the generated response presents ideas in a logical, organized manner.✅ Good: "First, log into your account. Then, navigate to settings. Finally, click 'Change Password'."
❌ Poor: "Click settings. Your account has security features. Navigate first to login. Change password option exists."
Ragas Concisenessquery, output, modelreferenceEvaluates if the response conveys information clearly and efficiently, without unnecessary details.✅ Concise: "The meeting is at 2 PM."
❌ Verbose: "The meeting, which we scheduled earlier, is at 2 PM in the afternoon today."
Ragas Context Entities Recallquery, output, modelreference, retrievalsMeasures how well your retrieval system captures important entities (people, places, things) mentioned in the ideal answer.Ground truth mentions "John Smith, Sarah Jones, New York office" but retrieved documents only mention "John Smith, Sarah Jones" = 67% recall.
Ragas Context Precisionquery, output, modelreference, retrievalsMeasures what proportion of retrieved documents are actually relevant to the user's question.User asks about "project deadlines" and 7 out of 10 retrieved documents discuss deadlines = 70% precision.
Ragas Context Recallmodel, referencequery, output, retrievalsMeasures if the retrieved documents contain all the information needed to answer the question properly.Ideal answer has 4 key facts, but retrieved context only contains 3 of them = 75% recall.
Ragas Correctnessquery, output, modelreferenceDirectly compares the AI's answer against the known correct answer for factual accuracy.Generated: "The deadline is Friday" vs. Ground truth: "The deadline is Monday" = low correctness.
Ragas Faithfulnessquery, output, modelretrievalsEnsures the AI's answer is factually consistent with the source documents it was given.Context: "Budget increased 10%" but Answer: "Budget doubled" = low faithfulness.
Ragas Harmfulnessquery, output, modelretrievalsDetects if the response could potentially cause harm to individuals, groups, or society.A response containing discriminatory language or dangerous instructions would score high on harmfulness.
Ragas Maliciousnessquery, output, modelretrievalsIdentifies responses that might be trying to deceive, manipulate, or exploit users.A response trying to trick someone into sharing passwords or personal information.
Ragas Noise Sensitivityquery, output, modelretrievalsTests if the AI can maintain accuracy even when retrieved documents contain irrelevant information.Correctly answering "What time is the meeting?" even when documents also contain unrelated budget information.
Ragas Response Relevancyquery, output, modelretrievalsAssesses how well the AI's answer addresses the specific question asked.Question: "How do I reset my password?" Relevant answer gives reset steps vs. irrelevant answer about email settings.
Ragas Summarizationquery, output, modelreference, retrievalsEvaluates how well a summary captures the important information from the source documents.Summarizing a 20-page report by including all main points vs. missing key conclusions or adding irrelevant details.