Function Evaluators

What are Function Evaluators?

Function Evaluators are rule-based tools designed to assess specific, measurable aspects of text, such as length, formatting, or keyword presence. They work like precise checklists—deterministic, reliable, and straightforward. These evaluators don’t interpret meaning; instead, they focus on specific patterns or structures.


Why use Function Evaluators?

They are ideal when you need clear, binary outcomes, such as verifying that a response includes a required phrase, adheres to a length limit, or contains valid links. Use them to ensure compliance, automate simple text validations, and establish robust guardrails for text generation.


Example

Imagine you have a chatbot that generates responses for customer support. You can use a Function Evaluator to check if every response includes a specific term like “return policy” or to ensure that the response doesn’t exceed 200 characters. These evaluators act as a quality gate to keep outputs on track.


List of Function evaluators

EvaluatorDescriptionExample
Bert scoreBERT Score checks how similar the text is to the reference answer are by analyzing the meaning of each word in context, rather than just matching exact words. It uses embeddings - numerical representations for meaning - from the BERT model to understand deeper meaning, allowing it to identify similarities even when wording differs. This makes BERT Score particularly useful for tasks like summarization, paraphrasing, and question answering, where capturing the intended meaning matters more than exact wording.Imagine an AI answers a question about a return policy with, “You can return items within 30 days.” BERT Score compares this to a reference like “Our return window is 30 days,” focusing on the meaning of words like “return” and “30 days.” This gives a high score since the sentences convey similar meanings, even though the wording is different.
Bleu ScoreBLEU is a popular metric for evaluating the quality of machine-translated text by comparing it to one or more reference translations. It assesses precision, focusing on how many n-grams (short sequences of words) in the AI-generated translation match those in the reference text. BLEU also applies a brevity penalty to avoid high scores for overly short translations that may technically match but lack meaningful content.Imagine the AI translates "Je suis fatigué" as "I am tired." BLEU compares this output to reference translations, such as "I’m tired," and calculates the overlap in n-grams, like "I am" and "am tired." With a strong overlap, BLEU would assign a high score, reflecting close alignment with the reference translation.
ContainsThe Contains evaluator checks if a specific word or phrase appears within a text, making it a simple, efficient tool for direct text matching. It doesn’t analyze context or meaning; it only confirms the presence of specific terms. This makes it ideal for binary tasks like keyword validation, content filtering, or ensuring compliance with required phrases.If an AI response needs to include the phrase "return policy," the Contains evaluator scans for this exact term. For example, if the response says, "Our return policy allows...," it would pass since "return policy" is detected. This evaluator is especially useful for quick compliance checks.
Contains allThe Contains All evaluator checks if a text includes all required words or phrases, ensuring that each specified term is present. It doesn’t consider context, order, or meaning—only that every target substring appears somewhere in the text. This makes it ideal for tasks that require verifying multiple key terms, like ensuring all necessary points are mentioned in a response.Suppose an AI response about returns must include both "return policy" and "30 days" to be complete. The Contains All evaluator would check for both terms in the response. If the response says, "Our return policy allows returns within 30 days," it would pass because both phrases are present, meeting the requirement for completeness.
Contains anyThe Contains Any evaluator checks if a text includes at least one word or phrase from a specified list, confirming that any one of the target terms is present. It doesn’t consider context or exact meaning—only that one of the specified terms appears somewhere in the text. This makes it useful for tasks where the presence of any relevant keyword suffices, like detecting mentions of a topic or validating partial information.Suppose an AI response needs to mention at least one of the terms "refund," "return policy," or "exchange" to cover a policy-related query. The Contains Any evaluator would scan for any of these terms. If the response reads, "Our exchange policy allows...," it would pass because "exchange" is present, meeting the requirement with just one of the specified terms.
Contains emailThe Contains Email evaluator detects if a text includes an email address by scanning for typical email patterns. It doesn’t validate the email content, only confirming the presence of an email-like structure (e.g., "[email protected] "). This is useful for tasks where identifying contact information is necessary, like filtering or organizing messages.Suppose an AI response must not include any personal contact details. The Contains Email evaluator would check if an email address is present in the text. If the response contains "For inquiries, contact [email protected] ," it would be flagged, as it includes an email address.
Contains linkThe Contains Link evaluator identifies if a text contains any hyperlink, regardless of its validity. It simply scans for URL patterns, making it ideal for detecting mentions of websites or external resources in text. This can be useful in moderation tasks to flag responses that include links.Imagine a forum where users are asked not to include links in their posts. The Contains Link evaluator would scan each post for any URL. If a post includes "Check out this article: http://example.com ," it would be flagged for containing a link.
Contains noneThe Contains None evaluator ensures that a text does not contain any of the specified words or phrases, confirming the absence of restricted words or phrases. It’s often used in content moderation or quality control tasks where specific terms must be avoided.Suppose a review platform wants to restrict certain terms like "refund" or "return policy" in user reviews to avoid specific content. The Contains None evaluator would scan each review for these terms. If a review contains "I asked for a refund," it would be flagged since the term "refund" appears.
Contains valid linkThe Contains Valid Link evaluator checks if a text includes a valid, correctly structured URL. It goes beyond simple link detection by verifying that the URL format is correct, making it useful for tasks where valid links are necessary, such as confirming resource citations or verifying external references.Imagine an AI is generating responses that must include functional links to resources. The Contains Valid Link evaluator would check each response for a well-formed link. If the response says, "You can read more at http://example.com/resource," it would pass if the URL is correctly structured.
Cosine similarityCosine Similarity is a metric that measures the semantic similarity between generated and reference texts by comparing their vector embeddings—numerical representations of meaning. Higher score indicates a stronger alignment in meaning. This is particularly useful for tasks like summarization, translation, and text generation, where capturing intent and meaning is more important than exact wording.In a paraphrasing task, Cosine Similarity can evaluate whether "The cat sat on the mat" and "A feline rested on a rug" convey the same meaning, despite different wording. By comparing the angle between their vector embeddings, Cosine Similarity assigns a score from 0 to 1, with higher scores indicating closer alignment in meaning.
Ends withThe Ends With evaluator checks if a text concludes with a specified word or phrase, ensuring it finishes as required. It doesn’t analyze the content leading up to the end, focusing only on the final word or phrase. This is particularly useful for formatting tasks or validating that responses conclude with specific information or phrases.Suppose an automated email response system requires all messages to end with “Thank you for your time.” The Ends With evaluator would check each response to confirm it concludes with this phrase. If a response ends with something different, it would be flagged, ensuring consistency in the closing statement.
Exact matchExact Match is a straightforward, binary metric that checks if the generated text matches the reference text exactly, character for character. It’s useful for highly structured or template-based tasks, or for simple fact-based responses where precise wording is required.Suppose an LLM is set up to respond with a specific closing phrase, “Thank you for your inquiry. We’ll get back to you within 24 hours.” The Exact Match evaluator would check if the response ends with this exact phrase. If any word deviates, such as “24 hours” being written as “a day,” it would fail the check, ensuring consistency in fixed templates.
Length betweenThe Length Between evaluator checks if the text length falls within a specified range, ensuring it meets both minimum and maximum length requirements. It doesn’t analyze content, focusing solely on character or word count to confirm the text fits within set bounds. This is particularly useful for tasks where a specific range of information density is required, such as summary limits or form responses.Suppose a customer review must be between 50 and 200 characters to be accepted. The Length Between evaluator would check the review's character count. If a review is 120 characters long, it would pass, as it falls within the specified range.
Length greater thanThe Length Greater Than evaluator checks if the text length exceeds a specified minimum, ensuring the content is long enough. It’s often used to avoid overly brief responses in contexts where depth or detail is expected, like essays or explanations.Imagine an AI-generated answer for a Q&A platform must be at least 100 characters long. The Length Greater Than evaluator would check the character count of each answer. If an answer is 150 characters, it would pass, as it meets the minimum length requirement.
Length less thanThe Length Less Than evaluator verifies that the text length is below a specified maximum, ensuring the content is concise. This can be helpful in contexts where brevity is important, such as social media posts or SMS messages.Suppose a message in a notification system must be under 160 characters to fit in an SMS. The Length Less Than evaluator would check the character count of each message. If a message is 140 characters long, it would pass, as it stays within the maximum length limit.
Levenshtein distanceLevenshtein Distance calculates the number of single-character edits (insertions, deletions, or substitutions) needed to turn the text into a reference text, measuring how similar or different two texts are. It’s particularly useful for tasks where exact phrases are required, such as detecting typos, validating data, or assessing minor variations in spelling. This metric is ideal for error detection in tasks needing precision, like spell-checking or validating data entry fields.Suppose the AI outputs "recieve" instead of "receive." The Levenshtein distance here would be 1 (one character substitution), indicating a minor error. This helps catch small mistakes that could impact quality in tasks like proofreading or structured data validation.
Meteor scoreMETEOR (Metric for Evaluation of Translation with Explicit ORdering) is a scoring system used to evaluate the quality of machine-translated text by comparing it to a reference translation. Unlike simpler metrics, METEOR takes into account synonym matches, stemming, and word order, making it sensitive to language nuances and capable of recognizing semantically similar, but not identical, translations. This makes it highly effective for evaluating translation tasks that need to capture subtle linguistic variations.If the AI translates “Je suis fatigué” as “I’m feeling tired,” METEOR would compare this with a reference like “I am tired” and recognize synonyms (e.g., “feeling” and “am”) and similar word order. This would result in a high METEOR score, as it captures both meaning and language nuances accurately.
Moderations openaiThe OpenAI Moderations API, using a model built on GPT-4o, evaluates text to ensure it meets safety and appropriateness standards. It checks for content categories such as hate speech, violence, self-harm, and illegal activities, providing a reliable and nuanced assessment. This makes it invaluable for applications where user safety and adherence to guidelines are essential.Imagine an AI-generated response that includes language encouraging self-harm. The OpenAI Moderations tool would analyze this text, detect the mention of self-harm, and flag the response as unsafe. At the same time, it would confirm that other harmful categories, like hate speech, are not present, providing a comprehensive assessment of the content’s safety.
Moderations GoogleThe Google Moderations tool utilizes the Vertex AI Moderations API to evaluate AI-generated text for safety and appropriateness. It checks for categories like hate speech, violence, self-harm, and illegal activities, providing a thorough assessment to ensure compliance with content standards. This makes it a valuable tool for applications prioritizing user safety and adherence to guidelines.Imagine an AI-generated response that includes language suggesting self-harm. The Google Moderations tool would analyze this content, detect the reference to self-harm, and flag the response as unsafe. It would also indicate that other harmful categories, such as hate speech, are not present, offering a detailed review to help maintain a safe user experience.
One lineThe One Line evaluator checks if the text is a single continuous line, ensuring it doesn’t contain any line breaks. It doesn’t evaluate the content itself, only confirming the absence of line breaks, which can be useful for formatting requirements in specific applications.Suppose a username field only allows single-line input without any line breaks. The One Line evaluator would check the text, ensuring there are no unintended line breaks. If the input is “UserName123\n”, it would be flagged, as the newline character \n indicates a second line.
RegexThe Regex evaluator checks if the text follows a specific pattern defined by a regular expression - a pattern-matching tool for text, ensuring it meets a particular format. This is helpful for fields that require structured input, like email addresses, phone numbers, or specific ID formats.Suppose a form requires a 5-digit ZIP code, defined by the regular expression ^\d{5}$. The Regex evaluator would check if the input matches this 5-digit format. If the input is “12345,” it would pass, but “1234A” or “123456” would be flagged as they don’t fit the 5-digit numeric pattern.
Rouge-nROUGE-N is a metric used to evaluate text summarization by measuring the overlap of n-grams (word sequences) between a generated summary and a reference summary. Specifically, ROUGE-N counts n-grams of a set length (e.g., ROUGE-1 for single words, ROUGE-2 for two-word sequences) to check how much key information from the reference is included in the generated summary. Unlike BLEU, ROUGE emphasizes recall, assessing how well the generated summary captures important details.Suppose an AI summarizes a news article, and the reference summary includes, "The election results were announced on Monday." If the AI-generated summary includes parts like "results were announced," ROUGE-N would calculate the score by measuring overlaps in n-grams, such as "results were" and "were announced," to assess how closely the AI's summary matches the reference.
Start withThe Start With evaluator checks if a text begins with a specific word or phrase, making sure it starts as required. It’s particularly useful in scenarios where uniform responses are necessary, such as automated replies, formal letters, or specific formatting tasks.Imagine a helpdesk application that requires all customer service responses to start with "Thank you for contacting us." The Start With evaluator would scan each response to confirm it begins with this phrase. If a response starts with “We appreciate your message,” it would be flagged, helping the team maintain consistent communication standards.
Valid JSONThe Valid JSON evaluator checks if a text is in valid JSON format, ensuring it follows proper JSON syntax with correctly paired brackets, commas, and key-value structures. This is essential for applications that rely on structured data input, as it prevents errors in data processing, API requests, or database interactions.Suppose an API endpoint requires input in JSON format, like {"name": "John", "age": 30}. The Valid JSON evaluator would verify the text adheres to JSON structure. If the input is malformed, such as {"name": "John", "age": 30, it would be flagged as invalid JSON, allowing the user to correct it before it reaches the API.