Standard Evaluators

This page explains why and which evaluators are needed to improve your LLM output. On top of that there is a small explanation of every evaluator we currently provide on our platform.

Standard Evaluators Overview

Here is the overview of available standard Evaluators:

EvaluatorBest use casesRequires reference?
Valid JSONData exchange and API communication with language modelsNo
Valid JSON SchemaJSON document validation and structure enforcementNo
Exact MatchQuestion-answering systems where responses must match 100% of the expected outputYes
Cosine SimilarityDocument similarity and clusteringYes
BERTContextual text analysis in NLPYes
BleuMachine translation quality assessmentYes
Levenshtein DistanceSpell checking and plagiarism detectionYes
METEORAdvanced machine translation evaluationYes
Rouge NText summarization quality assessmentYes


For some Evaluators listed above, a reference is required to run correctly and compute a result within an Experiment.

For example, if we're looking to evaluate the similarity of two pieces of text, we need a reference text to base our comparison on.

References are configured when setting up Evaluators in Experiments.

Standard Evaluators Details

Valid JSON

Developers use 'Valid JSON' to ensure that the data exchanged with large language models is correctly formatted, easily parseable, and interoperable across different systems. JSON (JavaScript Object Notation) is a lightweight data-interchange format that's human-readable and machine-parseable, facilitating seamless data exchange and storage.

Valid JSON Schema

Valid JSON Schema, on the other hand, serves as a blueprint for JSON data, defining the structure, constraints, and types of data allowed. It's used to validate JSON documents, ensuring they adhere to a predefined structure and set of rules. This is crucial for maintaining data integrity, enforcing data validation rules, and automating error detection when interacting with large language models, enhancing reliability and efficiency in data processing and API communication.

Exact Match

Exact match in large language model evaluation means comparing the model's output word-for-word with a predefined correct answer. If the model's response is the same as the correct answer, it's considered a match. It assesses accuracy, especially for tasks with clear and specific answers.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors, providing a score from -1 to 1 to indicate similarity. It's favored for evaluating language models as it focuses on textual similarity, unaffected by text length, ensuring outputs align contextually with inputs, which is crucial for AI text generation.


BERT stands for Bidirectional Encoder Representations from Transformers. Quite a mouthful, but let's break it down. BERT is a revolutionary model in natural language processing (NLP) that understands language in a way that considers the full context of words—both what comes before and after—much like how we humans understand language. The BERT score uses this model to evaluate the quality of text generated by other language models, by comparing it to a reference text. Example: It understands that "bright" can refer to both light and intelligence, depending on the sentence. This depth of understanding leads to more accurate evaluations.


The BLEU (Bilingual Evaluation Understudy) score is a clever metric used to evaluate the quality of text generated by large language models, especially in translation tasks. It works by comparing the model's output with a set of high-quality reference translations. At its core, BLEU assesses the match of phrases between the generated text and the references, rewarding precision and incorporating a penalty for overly short translations. This method is popular because it provides a quantifiable way to gauge the model's linguistic prowess, offering a standardized benchmark to measure and improve upon. It's like a scorecard for a language model's fluency and accuracy in replicating human-like translations.

Levenshtein Distance

Levenshtein distance measures the dissimilarity between two strings. It counts the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. By comparing the model-generated text to a reference, Levenshtein distance quantifies the effort needed to match the model's output to the gold standard.


The METEOR (Metric for Evaluation of Translation with Explicit ORdering) score is a sophisticated evaluation tool designed for assessing the performance of language models, particularly in machine translation. Unlike simpler metrics, METEOR goes beyond mere word-to-word comparisons by incorporating synonyms and paraphrase recognition, ensuring a more nuanced analysis of linguistic quality and meaning preservation. Using alignment techniques to consider word order and sentence structure further refines accuracy. This holistic approach makes METEOR a preferred choice for developers and researchers aiming to fine-tune large language models, ensuring translations are accurate, contextually, and grammatically coherent.

Rouge N

The ROUGE-N score is a precision tool in natural language processing, focusing on the overlap of N-grams (contiguous sequences of 'N' items from a given sample of text or speech) between generated text and reference text. It's widely used for tasks like summarization, where capturing key information concisely is crucial. 'ROUGE-N' helps quantify how well an LLM can replicate or reference important pieces of information, essentially measuring the model's ability to "echo" relevant content.