They are ideal when you need clear, binary outcomes, such as verifying that a response includes a required phrase, adheres to a length limit, or contains valid links. Use them to ensure compliance, automate simple text validations, and establish robust guardrails for text generation.
Imagine you have a chatbot that generates responses for customer support. You can use a Function Evaluator to check if every response includes a specific term like “return policy” or to ensure that the response doesn’t exceed 200 characters. These evaluators act as a quality gate to keep outputs on track.Function Evaluators can be found in the Hub where we have many already available Evaluators ready to be used.
DescriptionBERT Score checks how similar the text is to the reference answer are by analyzing the meaning of each word in context, rather than just matching exact words. It uses embeddings - numerical representations for meaning - from the BERT model to understand deeper meaning, allowing it to identify similarities even when wording differs. This makes BERT Score particularly useful for tasks like summarization, paraphrasing, and question answering, where capturing the intended meaning matters more than exact wording.ExampleImagine an AI answers a question about a return policy with, “You can return items within 30 days.” BERT Score compares this to a reference like “Our return window is 30 days,” focusing on the meaning of words like “return” and “30 days.” This gives a high score since the sentences convey similar meanings, even though the wording is different.
BLEU Score
DescriptionBLEU is a popular metric for evaluating the quality of machine-translated text by comparing it to one or more reference translations. It assesses precision, focusing on how many n-grams (short sequences of words) in the AI-generated translation match those in the reference text. BLEU also applies a brevity penalty to avoid high scores for overly short translations that may technically match but lack meaningful content.ExampleImagine the AI translates “Je suis fatigué” as “I am tired.” BLEU compares this output to reference translations, such as “I’m tired,” and calculates the overlap in n-grams, like “I am” and “am tired.” With a strong overlap, BLEU would assign a high score, reflecting close alignment with the reference translation.
Contains
DescriptionThe Contains evaluator checks if a specific word or phrase appears within a text, making it a simple, efficient tool for direct text matching. It doesn’t analyze context or meaning; it only confirms the presence of specific terms. This makes it ideal for binary tasks like keyword validation, content filtering, or ensuring compliance with required phrases.ExampleIf an AI response needs to include the phrase “return policy,” the Contains evaluator scans for this exact term. For example, if the response says, “Our return policy allows…,” it would pass since “return policy” is detected. This evaluator is especially useful for quick compliance checks.
Contains All
DescriptionThe Contains All evaluator checks if a text includes all required words or phrases, ensuring that each specified term is present. It doesn’t consider context, order, or meaning—only that every target substring appears somewhere in the text. This makes it ideal for tasks that require verifying multiple key terms, like ensuring all necessary points are mentioned in a response.ExampleSuppose an AI response about returns must include both “return policy” and “30 days” to be complete. The Contains All evaluator would check for both terms in the response. If the response says, “Our return policy allows returns within 30 days,” it would pass because both phrases are present, meeting the requirement for completeness.
Contains Any
DescriptionThe Contains Any evaluator checks if a text includes at least one word or phrase from a specified list, confirming that any one of the target terms is present. It doesn’t consider context or exact meaning—only that one of the specified terms appears somewhere in the text. This makes it useful for tasks where the presence of any relevant keyword suffices, like detecting mentions of a topic or validating partial information.ExampleSuppose an AI response needs to mention at least one of the terms “refund,” “return policy,” or “exchange” to cover a policy-related query. The Contains Any evaluator would scan for any of these terms. If the response reads, “Our exchange policy allows…,” it would pass because “exchange” is present, meeting the requirement with just one of the specified terms.
Contains None
DescriptionThe Contains None evaluator ensures that a text does not contain any of the specified words or phrases, confirming the absence of restricted words or phrases. It’s often used in content moderation or quality control tasks where specific terms must be avoided.ExampleSuppose a review platform wants to restrict certain terms like “refund” or “return policy” in user reviews to avoid specific content. The Contains None evaluator would scan each review for these terms. If a review contains “I asked for a refund,” it would be flagged since the term “refund” appears.
Contains Valid Link
DescriptionThe Contains Valid Link evaluator checks if a text includes a valid, correctly structured URL. It goes beyond simple link detection by verifying that the URL format is correct, making it useful for tasks where valid links are necessary, such as confirming resource citations or verifying external references.ExampleImagine an AI is generating responses that must include functional links to resources. The Contains Valid Link evaluator would check each response for a well-formed link. If the response says, “You can read more at http://example.com/resource” it would pass if the URL is correctly structured.
Cosine Similarity
DescriptionCosine Similarity is a metric that measures the semantic similarity between generated and reference texts by comparing their vector embeddings—numerical representations of meaning. Higher score indicates a stronger alignment in meaning. This is particularly useful for tasks like summarization, translation, and text generation, where capturing intent and meaning is more important than exact wording.ExampleIn a paraphrasing task, Cosine Similarity can evaluate whether “The cat sat on the mat” and “A feline rested on a rug” convey the same meaning, despite different wording. By comparing the angle between their vector embeddings, Cosine Similarity assigns a score from 0 to 1, with higher scores indicating closer alignment in meaning.
Ends With
DescriptionThe Ends With evaluator checks if a text concludes with a specified word or phrase, ensuring it finishes as required. It doesn’t analyze the content leading up to the end, focusing only on the final word or phrase. This is particularly useful for formatting tasks or validating that responses conclude with specific information or phrases.ExampleSuppose an automated email response system requires all messages to end with “Thank you for your time.” The Ends With evaluator would check each response to confirm it concludes with this phrase. If a response ends with something different, it would be flagged, ensuring consistency in the closing statement.
Exact Match
DescriptionExact Match is a straightforward, binary metric that checks if the generated text matches the reference text exactly, character for character. It’s useful for highly structured or template-based tasks, or for simple fact-based responses where precise wording is required.ExampleSuppose an LLM is set up to respond with a specific closing phrase, “Thank you for your inquiry. We’ll get back to you within 24 hours.” The Exact Match evaluator would check if the response ends with this exact phrase. If any word deviates, such as “24 hours” being written as “a day,” it would fail the check, ensuring consistency in fixed templates.
Length Between
DescriptionThe Length Between evaluator checks if the text length falls within a specified range, ensuring it meets both minimum and maximum length requirements. It doesn’t analyze content, focusing solely on character or word count to confirm the text fits within set bounds. This is particularly useful for tasks where a specific range of information density is required, such as summary limits or form responses.ExampleSuppose a customer review must be between 50 and 200 characters to be accepted. The Length Between evaluator would check the review’s character count. If a review is 120 characters long, it would pass, as it falls within the specified range.
Length Greater Than
DescriptionThe Length Greater Than evaluator checks if the text length exceeds a specified minimum, ensuring the content is long enough. It’s often used to avoid overly brief responses in contexts where depth or detail is expected, like essays or explanations.ExampleImagine an AI-generated answer for a Q&A platform must be at least 100 characters long. The Length Greater Than evaluator would check the character count of each answer. If an answer is 150 characters, it would pass, as it meets the minimum length requirement.
Length Less Than
DescriptionThe Length Less Than evaluator verifies that the text length is below a specified maximum, ensuring the content is concise. This can be helpful in contexts where brevity is important, such as social media posts or SMS messages.ExampleSuppose a message in a notification system must be under 160 characters to fit in an SMS. The Length Less Than evaluator would check the character count of each message. If a message is 140 characters long, it would pass, as it stays within the maximum length limit.
Levenshtein Distance
DescriptionLevenshtein Distance calculates the number of single-character edits (insertions, deletions, or substitutions) needed to turn the text into a reference text, measuring how similar or different two texts are. It’s particularly useful for tasks where exact phrases are required, such as detecting typos, validating data, or assessing minor variations in spelling. This metric is ideal for error detection in tasks needing precision, like spell-checking or validating data entry fields.ExampleSuppose the AI outputs “recieve” instead of “receive.” The Levenshtein distance here would be 1 (one character substitution), indicating a minor error. This helps catch small mistakes that could impact quality in tasks like proofreading or structured data validation.
METEOR Score
DescriptionMETEOR (Metric for Evaluation of Translation with Explicit ORdering) is a scoring system used to evaluate the quality of machine-translated text by comparing it to a reference translation. Unlike simpler metrics, METEOR takes into account synonym matches, stemming, and word order, making it sensitive to language nuances and capable of recognizing semantically similar, but not identical, translations. This makes it highly effective for evaluating translation tasks that need to capture subtle linguistic variations.ExampleIf the AI translates “Je suis fatigué” as “I’m feeling tired,” METEOR would compare this with a reference like “I am tired” and recognize synonyms (e.g., “feeling” and “am”) and similar word order. This would result in a high METEOR score, as it captures both meaning and language nuances accurately.
OpenAI Moderations API
DescriptionThe OpenAI Moderations API, using a model built on GPT-4o, evaluates text to ensure it meets safety and appropriateness standards. It checks for content categories such as hate speech, violence, self-harm, and illegal activities, providing a reliable and nuanced assessment. This makes it invaluable for applications where user safety and adherence to guidelines are essential.ExampleImagine an AI-generated response that includes language encouraging self-harm. The OpenAI Moderations tool would analyze this text, detect the mention of self-harm, and flag the response as unsafe. At the same time, it would confirm that other harmful categories, like hate speech, are not present, providing a comprehensive assessment of the content’s safety.
ROUGE-N
DescriptionROUGE-N is a metric used to evaluate text summarization by measuring the overlap of n-grams (word sequences) between a generated summary and a reference summary. Specifically, ROUGE-N counts n-grams of a set length (e.g., ROUGE-1 for single words, ROUGE-2 for two-word sequences) to check how much key information from the reference is included in the generated summary. Unlike BLEU, ROUGE emphasizes recall, assessing how well the generated summary captures important details.ExampleSuppose an AI summarizes a news article, and the reference summary includes, “The election results were announced on Monday.” If the AI-generated summary includes parts like “results were announced,” ROUGE-N would calculate the score by measuring overlaps in n-grams, such as “results were” and “were announced,” to assess how closely the AI’s summary matches the reference.
Valid JSON
DescriptionThe Valid JSON evaluator checks if a text is in valid JSON format, ensuring it follows proper JSON syntax with correctly paired brackets, commas, and key-value structures. This is essential for applications that rely on structured data input, as it prevents errors in data processing, API requests, or database interactions.ExampleSuppose an API endpoint requires input in JSON format. The Valid JSON evaluator would verify the text adheres to JSON structure. If the input is malformed, it would be flagged as invalid JSON, allowing the user to correct it before it reaches the API.
They’re perfect for scenarios where nuance matters. For example, ensuring a text’s tone aligns with your brand voice, analyzing sentiment in customer feedback, or checking the grammar of a formal email. By leveraging advanced AI, these evaluators provide insights that go beyond simple rule-based checks, helping refine and elevate your content.
LLM Evaluators work like this: LLM 1 generates a response, and LLM 2 evaluates it. For instance, if LLM 1 writes an email, you can use LLM 2 to judge whether the email’s tone is professional and polite. These evaluators are like having an expert editor review your content automatically.
DescriptionThe Age-Appropriate evaluator determines whether the generated text is appropriate for a specified age group. It is especially useful for tasks such as content moderation, educational material review, or ensuring that text is suitable for specific audiences based on age. This evaluator helps ensure that content aligns with age-appropriate language, tone, and themes.ExampleIf tasked with evaluating a summarization of news for children under 8, the Age-Appropriate evaluator checks if the language is simple, the tone is gentle, and if complex or inappropriate themes are avoided. Based on these factors, it assigns a score of 1 if the text is appropriate or 0 if it is not, flagging any necessary adjustments for improvement.
Bot Detection
DescriptionThe Bot Detection evaluator determines whether the provided text was likely generated by an AI. It is particularly useful in tasks involving content validation, academic integrity checks, or identifying automated text in various contexts. This evaluator helps distinguish between AI-generated and human-written content based on stylistic and structural clues.ExampleIf a piece of text starts with phrases like “As an AI assistant” or shows repetitive patterns, the Bot Detection evaluator may flag it as likely AI-generated. It examines factors such as starting phrases, repetition, unnatural formality, and logical flow. Based on these characteristics, it assigns a binary score: 1 for AI-generated text and 0 for human-written, indicating whether further review is needed.
Fact Checking Knowledge Base
DescriptionThe Fact-Checking evaluator assesses the truthfulness of a statement by referencing an internal knowledge base and widely accepted facts. It is especially useful for tasks like verifying claims in news, public discussions, or research contexts where factual accuracy is crucial. This evaluator helps ensure statements are aligned with verified information.ExampleIf tasked with verifying the statement “Lionel Messi has won more Ballon d’Or awards than any other footballer,” the Fact-Checking evaluator checks the knowledge base for sports records to confirm or refute the claim. It assigns a score according to the PolitiFact scale: from 0 (pants on fire false) to 5 (true), or -1 if uncertain, reflecting the truthfulness of the statement. Along with the score, it provides a detailed explanation referencing trusted sources to justify the evaluation.
Grammar
DescriptionThe Grammar evaluator checks whether the provided text is grammatically correct, focusing on grammar, punctuation, and overall clarity. This tool is especially useful for tasks like content editing, proofreading, or ensuring the accuracy of formal writing. It identifies issues that may affect readability and professionalism in written content.ExampleSuppose the text reads, “The company are planning to expand their operations next year, its going to be a big investment for them.” The Grammar evaluator would identify errors such as subject-verb agreement (“The company are”), incorrect possessive form (“its” instead of “it’s”), and provide corrections. It assigns a binary score of 1 if the text is correct, or 0 if it contains errors, along with a corrected version when needed.
Localization
DescriptionThe Localization evaluator assesses the quality of localized content, focusing on accuracy, grammar, cultural appropriateness, and overall user experience. It is particularly useful for tasks involving content adaptation across languages and regions, ensuring the localized text maintains both the intended meaning and appropriateness for the target audience. This evaluation is essential for creating effective, culturally relevant content for diverse audiences.ExampleImagine the content “Join us for the Fourth of July sale” is localized as “7月4日のセール” for a Japanese audience. The Localization evaluator would examine whether the translation appropriately conveys the significance of the Fourth of July for a Japanese context, possibly suggesting a different approach if the cultural relevance is unclear. It assigns a score from 1 to 10, with a higher score indicating better localization quality and a detailed explanation highlighting any issues and suggestions for improvement.
PII
DescriptionThe PII Anonymization evaluator checks whether personally identifiable information (PII) has been correctly removed or anonymized in the output, following general anonymization guidelines. It is especially useful in contexts requiring protection of sensitive information, such as legal, healthcare, and financial data. This evaluator ensures that all identifiable details are appropriately masked to maintain privacy.ExampleSuppose the input text contains “John Doe” and “123 Main Street,” and the output replaces these with placeholders like “[NAME_1]” and “[STREET_1].” The PII Anonymization evaluator would confirm that all PII was correctly anonymized. It assigns a binary score of 1 if all PII is anonymized, or 0 if any identifying information remains, along with an explanation pointing out any missed details or errors in the anonymization process.
Sentiment Classification
DescriptionThe Sentiment Classification evaluator checks if the sentiment of the provided text - positive, negative, or neutral - has been correctly classified based on its content. It is particularly useful for tasks like customer feedback analysis or tracking sentiment related to companies, products, or services. This evaluator helps ensure that sentiment labels align accurately with the intended tone of the text.ExampleIf the text reads, “The customer support team resolved my issue quickly and was very helpful throughout the process” and is classified as “positive,” the Sentiment Classification evaluator confirms this classification. It verifies the sentiment by analyzing factors like word choice and tone, assigning a score of 1 if the classification is correct or 0 if it is not, along with an explanation detailing any discrepancies.
Summarization
DescriptionThe Summarization evaluator assesses the accuracy, completeness, and conciseness of a summary in relation to the original text. It is particularly useful for tasks like content summarization or abstract creation, where achieving a balance between brevity and detail is essential. This evaluator ensures that summaries capture the core points of the original content without extraneous information.ExampleIf a text discusses the recent launch of a new smartphone by a major tech company, and the summary mentions key features, launch date, and pricing, the Summarization evaluator will check whether it accurately and completely represents the original information. The summary is scored from 1 to 10, with a higher score reflecting a more precise, complete, and concise summary, accompanied by an explanation highlighting any omissions or excessive details.
Tone of Voice
DescriptionThe Tone of Voice evaluator checks whether the provided output aligns with the desired tone and writing style specified in the input. This is especially useful for tasks involving business communication, customer service, or social media, where tone consistency is crucial to conveying the right message. This evaluator ensures that responses match the intended style and tone guidelines.ExampleSuppose the input specifies a professional and respectful tone for an email regarding a delayed payment, and the output is polite, formal, and directly addresses the issue. The Tone of Voice evaluator would confirm that the tone aligns with the guidelines, assigning a score of 1 if it meets the requirement, or 0 if it does not. The evaluation includes feedback and suggestions for improvement if the tone needs adjustment.
Translation
DescriptionThe Translation evaluator assesses whether the provided translation accurately conveys the meaning, tone, and style of the original text. It is especially useful for tasks like adapting marketing content, user manuals, or legal documents into different languages while maintaining the intended message and cultural nuances. This evaluator ensures that the translation is not just accurate but also culturally appropriate.ExampleIf the original text says, “The early bird catches the worm,” and the Spanish translation is “El pájaro temprano atrapa el gusano,” the evaluator will check if the literal translation is fitting or if a culturally relevant phrase like “A quien madruga, Dios le ayuda” would better convey the intended meaning. The evaluator assigns a score from 1 to 10 based on accuracy, completeness, and cultural adaptation, providing a detailed explanation and suggestions for improvements.
Ragas Evaluators are specialized tools designed to evaluate the performance of retrieval-augmented generation (RAG) workflows. They focus on metrics like context relevance, faithfulness, recall, and robustness, ensuring that outputs derived from external knowledge bases or retrieval systems are accurate and reliable.
If your system retrieves information from external sources, these evaluators are essential. They ensure that responses are factually consistent, include all necessary details, and stay focused on relevant context. For applications like customer support or document summarization, Ragas Evaluators help guarantee the integrity and quality of your AI’s outputs.
Ragas return a number response between 0 and 1. Depending on the measurement taken (relevance, faithfulness, etc.) the number returned will be closer to 1.Example: When measuring pertinence of the response, a very pertinent answer will return a number closer to 1.
Imagine a customer asks a chatbot, “What’s included in my insurance policy?” and the system retrieves chunks of information from a knowledge base. A Ragas Evaluator can verify if the retrieved chunks focus on the user’s question (e.g., home insurance details) and exclude irrelevant details (e.g., unrelated auto insurance policies). This ensures the response is accurate and useful.
Ragas Evaluators can be found in the Hub where we have many already available Evaluators ready to be used.
Required Parameters:query, output, modelOptional Parameters:referenceDescriptionChecks if the generated response presents ideas in a logical, organized manner.Example✅ Good: “First, log into your account. Then, navigate to settings. Finally, click ‘Change Password’.”❌ Poor: “Click settings. Your account has security features. Navigate first to login. Change password option exists.”
Ragas Conciseness
Required Parameters:query, output, modelOptional Parameters:referenceDescriptionEvaluates if the response conveys information clearly and efficiently, without unnecessary details.Example✅ Concise: “The meeting is at 2 PM.”❌ Verbose: “The meeting, which we scheduled earlier, is at 2 PM in the afternoon today.”
Ragas Context Entities Recall
Required Parameters:query, output, modelOptional Parameters:reference, retrievalsDescriptionMeasures how well your retrieval system captures important entities (people, places, things) mentioned in the ideal answer.ExampleGround truth mentions “John Smith, Sarah Jones, New York office” but retrieved documents only mention “John Smith, Sarah Jones” = 67% recall.
Ragas Context Precision
Required Parameters:query, output, modelOptional Parameters:reference, retrievalsDescriptionMeasures what proportion of retrieved documents are actually relevant to the user’s question.ExampleUser asks about “project deadlines” and 7 out of 10 retrieved documents discuss deadlines = 70% precision.
Ragas Context Recall
Required Parameters:model, referenceOptional Parameters:query, output, retrievalsDescriptionMeasures if the retrieved documents contain all the information needed to answer the question properly.ExampleIdeal answer has 4 key facts, but retrieved context only contains 3 of them = 75% recall.
Ragas Correctness
Required Parameters:query, output, modelOptional Parameters:referenceDescriptionDirectly compares the AI’s answer against the known correct answer for factual accuracy.ExampleGenerated: “The deadline is Friday” vs. Ground truth: “The deadline is Monday” = low correctness.
Ragas Faithfulness
Required Parameters:query, output, modelOptional Parameters:retrievalsDescriptionEnsures the AI’s answer is factually consistent with the source documents it was given.ExampleContext: “Budget increased 10%” but Answer: “Budget doubled” = low faithfulness.
Ragas Harmfulness
Required Parameters:query, output, modelOptional Parameters:retrievalsDescriptionDetects if the response could potentially cause harm to individuals, groups, or society.ExampleA response containing discriminatory language or dangerous instructions would score high on harmfulness.
Ragas Maliciousness
Required Parameters:query, output, modelOptional Parameters:retrievalsDescriptionIdentifies responses that might be trying to deceive, manipulate, or exploit users.ExampleA response trying to trick someone into sharing passwords or personal information.
Ragas Noise Sensitivity
Required Parameters:query, output, modelOptional Parameters:retrievalsDescriptionTests if the AI can maintain accuracy even when retrieved documents contain irrelevant information.ExampleCorrectly answering “What time is the meeting?” even when documents also contain unrelated budget information.
Ragas Response Relevancy
Required Parameters:query, output, modelOptional Parameters:retrievalsDescriptionAssesses how well the AI’s answer addresses the specific question asked.ExampleQuestion: “How do I reset my password?” Relevant answer gives reset steps vs. irrelevant answer about email settings.
Ragas Summarization
Required Parameters:query, output, modelOptional Parameters:reference, retrievalsDescriptionEvaluates how well a summary captures the important information from the source documents.ExampleSummarizing a 20-page report by including all main points vs. missing key conclusions or adding irrelevant details.