Skip to main content
Evaluators are automated tools that assess model outputs within Experiments, Deployments, and Agents. They verify outputs against reference data, enforce compliance criteria, and power Guardrails that block non-compliant generations before they reach users. Two evaluator types are available:

LLM Evaluator

Use a model to judge outputs against any criteria you define in a prompt.

Python Evaluator

Write custom Python code for full flexibility. Use for statistical scoring, regex checks, length validation, or any custom evaluation logic.
HTTP and JSON evaluators are deprecated. Existing HTTP and JSON evaluators continue to work, but cannot be duplicated. Use Python evaluators instead: the requests package is now available for HTTP calls, and pydantic is available for JSON schema validation.

Use Cases

Score model outputs on dimensions like tone, accuracy, or relevance without manual review. Use LLM-as-a-Judge evaluators with custom rubrics, or import pre-built scoring functions from the Hub.
Verify that outputs meet specific format, content, or structural requirements. Use Python evaluators for custom logic such as regex checks, length validation, or structural assertions.
Attach evaluators as guardrails to block generations that fail a pass condition. Input guardrails run before the model; output guardrails run after. A failed guardrail returns HTTP 422 to the caller.
Run evaluators across a full dataset in an Experiment to track quality over time. Compare evaluator scores across runs and prompt variants to catch regressions before deploying changes.

Pre-built Evaluators

Before building one from scratch, browse the Hub for ready-to-use evaluators. Add any of them to a Project with the Add to project button, then use them in Experiments, Deployments, and Agents. The Hub groups its evaluators into three categories:
  • Function Evaluators: deterministic checks such as Contains, Valid JSON, Length Between, and BLEU Score.
  • LLM Evaluators: model-judged checks such as Tone of Voice, Grammar, PII, and Sentiment Classification.
  • RAGAS Evaluators: retrieval-augmented generation metrics such as Faithfulness, Context Precision, and Response Relevancy.

LLM Evaluator

LLM Evaluators use a model to judge outputs against any criteria you define in a prompt.
In a Project or folder, click the button and select LLM Evaluator. Select the model to use for evaluation. It must be enabled in the AI Gateway.
LLM Evaluator settings panel with prompt area, model dropdown set to DeepSeek V3.1, output type tabs, guardrail pass condition, and a playground panel with Editor and Dataset tabs.

Configure Prompt

Your prompt has access to the following string variables:
  • {{log.input}}: the last message sent to the model
  • {{log.output}}: the output response generated by the evaluated model
  • {{log.messages}}: all messages sent to the model, excluding the last message
  • {{log.retrievals}}: Knowledge Base retrievals
  • {{log.reference}}: the reference used to compare output

Model Parameters

The Model field selects which model acts as judge. Any model enabled in the AI Gateway is available. The model choice affects evaluation quality, cost, and latency.

Output and Guardrail Configuration

Select the output type that matches the evaluation criteria. The Guardrail configuration panel is visible directly in the evaluator settings. Set the pass condition for each type:
The model returns a True or False response. Use for binary pass/fail checks.Guardrail: Select True or False. The guardrail passes when the model returns the selected value.
Once configured, the evaluator is available as a guardrail in any Deployment or Agent without any additional toggle.

Examples

Rate the formality of the following output on a scale of 1 to 5:
- 1: Very casual/informal
- 5: Very formal/professional

Only output the number.

[OUTPUT] {{log.output}}
Evaluate how accurate the response [OUTPUT] is compared to the query [INPUT].

Score from 0 to 100, where:
- 0: Completely inaccurate or irrelevant
- 50: Partially accurate
- 100: Perfectly accurate and complete

Only output the score as a number.

[INPUT] {{log.input}}
[OUTPUT] {{log.output}}
Evaluate if the response adequately answers the user's question.

Return 1 if the response is satisfactory, 0 if it is not.

[QUESTION] {{log.input}}
[RESPONSE] {{log.output}}
Review the full prior conversation and the latest response.

Return 1 if the response stays consistent with what was already discussed, 0 if it contradicts earlier messages.

[CONVERSATION] {{log.messages}}
[RESPONSE] {{log.output}}
Compare the response [OUTPUT] against the reference answer [REFERENCE].

Return 1 if the response conveys the same meaning as the reference, 0 if it does not.

[OUTPUT] {{log.output}}
[REFERENCE] {{log.reference}}
Review the tool calls made during the run.

Return 1 if the correct tool was called with valid arguments for the user's request, 0 otherwise.

[REQUEST] {{log.input}}
[TOOL CALLS] {{log.tool_calls}}

Testing

Fill the payload manually. Enter values for messages, input, output, retrievals, and reference. All prompt variables resolve against what you enter.
Studio Playground panel for configuring the LLM payload sent to an LLM evaluator.
Click Run to execute the evaluator. The result appears in the Response field.
Response field showing the result of an LLM evaluator test run.
Once created, this evaluator is available as a guardrail in Deployments and Agents. See Evaluators and Guardrails in Deployments and Evaluators and Guardrails in Agents to learn more.

Python Evaluator

Python Evaluators let you write custom Python code for maximum flexibility: from simple validations (regex, length checks) to complex analyses (statistical scoring, custom algorithms).
In a Project or folder, click the button and select Python Evaluator. You are taken to the code editor. Your evaluation function has access to the following fields from the evaluated model’s log:
  • log["input"] <str>: the last message sent to generate the output
  • log["output"] <str>: the generated response from the model
  • log["reference"] <str>: the reference used to compare the output
  • log["messages"] list<str>: all previous messages sent to the model
  • log["retrievals"] list<str>: all Knowledge Base retrievals
The evaluator can return two response types:
  • Number: return a numeric score
  • Boolean: return a true/false value
Example: compare output size with the reference:
Python
def evaluate(log):
    output_size = len(log["output"])
    reference_size = len(log["reference"])
    return abs(output_size - reference_size)
You can define multiple methods within the code editor. The last method is the entry-point for the Evaluator when run.

Environment and Libraries

The Python Evaluator runs in Python 3.12 with the following preloaded libraries:
numpy==2.4.4
nltk==3.9.4
requests
pydantic
json
re

Guardrail Configuration

Within a Deployment or Agent, use the Python Evaluator as a Guardrail to block generations that don’t meet the custom evaluation logic.Use the Pass condition to define when the guardrail passes:
  • Boolean evaluators: select True or False. The guardrail passes when your function returns the selected value.
  • Number evaluators: enter a score threshold. The guardrail passes when your function’s return value is greater than or equal to the threshold.
Any evaluator created in Orq.ai, whether LLM or Python, can be attached as a guardrail in a Deployment or Agent. Only the Pass condition needs to be set.

Examples

Use the requests package to send the output to an external endpoint and confirm it comes back unchanged. Return True only when the call succeeds and the echoed text matches the output.
Python
def evaluate(log):
    import requests
    try:
        r = requests.post(
            "https://httpbin.org/post",
            json={"text": log["output"]},
            timeout=10,
        )
        echoed = r.json()["json"]["text"]
        return echoed == log["output"]   # round-trip succeeded
    except (requests.RequestException, KeyError, ValueError):
        return False
Use pydantic to validate that the output is JSON matching the expected damage report schema, including a confidence score between 0 and 1. Return True when it parses and validates, False otherwise.
Python
def evaluate(log):
    import json
    from typing import List
    from pydantic import BaseModel, field_validator, ValidationError

    class Damage(BaseModel):
        type: str
        location: str
        severity: str
        confidence: float
        description: str

        @field_validator("confidence")
        @classmethod
        def confidence_in_range(cls, v):
            if not 0.0 <= v <= 1.0:
                raise ValueError("confidence must be between 0 and 1")
            return v

    class Assessment(BaseModel):
        damages: List[Damage]
        assetType: str
        observations: List[str]
        overallAssessment: str

    try:
        Assessment(**json.loads(log["output"]))
        return True
    except (ValidationError, json.JSONDecodeError, KeyError):
        return False

Testing

Fill the payload manually in the Editor. Enter values for input, output, reference, messages, and retrievals. All log fields resolve against what you enter.
Studio Playground panel for configuring the payload sent to a Python evaluator.
Click Run to execute the evaluator. The result appears in the Response field.
Response field showing the result of a Python evaluator test run.

Versions

When you are done editing, click Publish to save your changes. You will be prompted to write a commit message and choose a version bump:
Evaluator publish
  • Patch (e.g. v1.0.0 to v1.0.1): small fixes, no behaviour change
  • Minor (e.g. v1.0.0 to v1.1.0): new functionality, backwards compatible
  • Major (e.g. v1.0.0 to v2.0.0): breaking change or significant rework
The Versions tab shows the full history with author and publish timestamp for each version.
Evaluator versions
Each published version has three action buttons:
ActionIconDescription
CompareOpen a diff view to see what changed between versions
CodeLoad a code snippet to invoke the evaluator at this exact version
EnvironmentTag the version with an Environment (e.g. production, staging)
Reference a specific version by appending @ and the version number: my-evaluator@1.0.1. Reference an environment tag directly: my-evaluator@production. Without a suffix, the latest published version is used.

List Evaluators

Use the List Evaluators API:
curl --request GET \
     --url https://api.orq.ai/v2/evaluators \
     --header 'accept: application/json' \
     --header 'authorization: Bearer ORQ_API_KEY'

Invoke an Evaluator

Fetch the evaluator ID from the List Evaluators API, then invoke it. Use the View Code button on your evaluator page in the AI Studio to get a pre-filled snippet.
Invoke an Evaluator dialog in AI Studio with Node, Python, and cURL tabs. The Python tab shows an orq.evals.invoke call with id, query, output, reference, messages, and retrievals arguments.
curl 'https://api.orq.ai/v2/evaluators/<evaluator_id>/invoke' \
-H 'Authorization: Bearer ORQ_API_KEY' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
--data-raw '{
    "query": "Your input text",
    "output": "Your output text",
    "reference": "Optional reference text",
    "messages": [{"role": "user", "content": "Your message"}],
    "retrievals": ["Your retrieval content"]
}'

Guardrail Error Response

When a guardrail evaluation fails, Orq.ai returns an HTTP 422 Unprocessable Entity. The response body lists every guardrail that did not pass.
{
  "code": 422,
  "error": "Validation failed: Not all guardrails were met while validating the response.",
  "message": "Validation failed: Not all guardrails were met while validating the response.",
  "source": "system",
  "guardrails": [
    {
      "id": "01KMR75R90XDA80020YT8MHP2W",
      "status": "completed",
      "started_at": "2026-03-27T17:58:55.330Z",
      "finished_at": "2026-03-27T17:58:55.364Z",
      "related_entities": [
        {
          "type": "evaluator",
          "evaluator_id": "01KK9D8Z0JCEC1ASQJH8R28B57",
          "evaluator_metric_name": "python_evaluator"
        }
      ],
      "passed": false,
      "reason": null,
      "evaluator_type": "output_guardrail",
      "type": "boolean",
      "value": false
    }
  ]
}
FieldTypeDescription
idstringInternal ID of the guardrail result.
statusstringExecution status: "completed" or "failed".
started_atstringISO 8601 timestamp when the guardrail evaluation started.
finished_atstringISO 8601 timestamp when the guardrail evaluation finished.
related_entitiesarrayReferences to the evaluator that ran. Each entry contains type, evaluator_id, and evaluator_metric_name.
passedbooleanfalse for every entry in this error response.
reasonstring or nullExplanation of the failure, when provided by the evaluator.
evaluator_typestring"input_guardrail" if the guardrail ran before the model. "output_guardrail" if the guardrail ran after generation.
typestringThe value type returned by the evaluator: "boolean", "number", or "categorical".
valueboolean, number, or stringThe raw value returned by the evaluator.
When the evaluator fails to execute: If the evaluator itself fails to run (for example, a network error or timeout), the guardrail is silently skipped and the generation proceeds. Monitor skipped guardrail executions through Traces.When an LLM guardrail’s underlying model fails: If the model powering an LLM guardrail is unavailable, Orq.ai fails the entire request for safety. Since the guardrail could not run, there is no way to know whether it would have blocked the generation.

Evaluatorq

Evaluatorq is a dedicated SDK for running evaluations programmatically. It supports parallel job execution, flexible data sources (inline, CSV, Orq datasets), and syncs results to the Orq.ai AI Studio.
Install:
npm install @orq-ai/evaluatorq
Usage example:
import { evaluatorq, job } from "@orq-ai/evaluatorq";

const textAnalyzer = job("text-analyzer", async (data) => {
    const text = data.inputs.text;
    return {
        length: text.length,
        wordCount: text.split(" ").length,
        uppercase: text.toUpperCase(),
    };
});

await evaluatorq("text-analysis", {
    data: [
        { inputs: { text: "Hello world" } },
        { inputs: { text: "Testing evaluation" } },
    ],
    jobs: [textAnalyzer],
    evaluators: [
        {
            name: "length-check",
            scorer: async ({ output }) => {
                const passesCheck = output.length > 10;
                return {
                    value: passesCheck ? 1 : 0,
                    explanation: passesCheck
                        ? "Output length is sufficient"
                        : `Output too short (${output.length} chars, need >10)`,
                };
            },
        },
    ],
});
See the Python Evaluatorq and TypeScript Evaluatorq repositories for more.

Cookbook: Running evaluations in parallel with Evaluatorq

Step-by-step walkthrough comparing agent variants with parallel evaluators, including DeepEval and RAGAS integration.