Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

Evaluators are automated tools that assess model outputs within Experiments, Deployments, and Agents. They verify outputs against reference data, enforce compliance criteria, and power Guardrails that block non-compliant generations before they reach users. Four evaluator types are available.

LLM Evaluator

Use a model to judge outputs against any criteria you define in a prompt.

Python Evaluator

Write custom Python code for full flexibility. Use for statistical scoring, regex checks, length validation, or any custom evaluation logic.

HTTP Evaluator

Call an external API to evaluate outputs. Use for business-specific compliance checks, custom scoring services, or domain-specific validations.

JSON Evaluator

Validate model outputs against a JSON Schema. Use to enforce correct payload structure for incoming or outgoing model responses.

Use Cases

Automated quality scoring

Score model outputs on dimensions like tone, accuracy, or relevance without manual review. Use LLM-as-a-Judge evaluators with custom rubrics, or import pre-built scoring functions from the Hub.

Output compliance checks

Verify that outputs meet specific format, content, or structural requirements. Use JSON evaluators for schema validation, Python evaluators for custom logic, or HTTP evaluators to call your own compliance APIs.

Guardrails in Deployments and Agents

Attach evaluators as guardrails to block generations that fail a pass condition. Input guardrails run before the model; output guardrails run after. A failed guardrail returns HTTP 422 to the caller.

Regression testing in Experiments

Run evaluators across a full dataset in an Experiment to track quality over time. Compare evaluator scores across runs and prompt variants to catch regressions before deploying changes.

LLM Evaluator

LLM Evaluators use a model to judge outputs against any criteria you define in a prompt.

AI Studio
API & SDK
MCP

In a Project or folder, click the + button and select LLM Evaluator. Select the model to use for evaluation. It must be enabled in the AI Router.

Use the Create an Evaluator API.

JSON

{
  "type": "llm_eval",
  "prompt": "Give a number response from 0 to 1, 0 for inappropriate, 1 for perfectly appropriate {{log.output}}",
  "path": "Default/evaluators",
  "model": "openai/gpt-4o",
  "key": "myKey",
  "guardrail_config": {
    "enabled": true,
    "type": "number",
    "value": 0.7,
    "operator": "gte"
  }
}

Retrieve an evaluator’s configuration:

Show me the current configuration for the "tone-scorer" evaluator

The assistant uses search_entities to resolve the evaluator ID, then get_llm_eval to retrieve the full configuration including prompt, model, and output type.

Create an LLM evaluator:

Create an LLM-as-a-Judge evaluator that scores responses on tone: professional, neutral, or aggressive

The assistant uses create_llm_eval with a categorical scoring rubric and confirms the evaluator ID.

Update an existing LLM evaluator:

Update the "tone-scorer" evaluator to also check for formal language and return a boolean instead of a number

The assistant uses search_entities to find the evaluator, then update_llm_eval with the updated prompt and output_type: "boolean".

Configure Prompt

AI Studio

Your prompt has access to the following string variables:

{{log.input}}: the last message sent to the model
{{log.output}}: the output response generated by the evaluated model
{{log.messages}}: all messages sent to the model, excluding the last message
{{input.all_messages}}: the full conversation, including the last user message
{{log.retrievals}}: Knowledge Base retrievals
{{log.reference}}: the reference used to compare output
{{output.tools_called}}: a numbered, human-readable summary of each tool call made during the run, with arguments and responses
{{log.tool_calls}}: alias of {{output.tools_called}}

Nested indexing such as {{log.tool_calls[0].tool_name}} resolves only on the Python evaluator runtime. On the Go evaluator runtime the value is the rendered string above.

Output Types

AI Studio

Choose which type of output your model evaluation will provide. The output type also determines how the evaluator can be used as a Guardrail.

Boolean
Number
Categorical
String

The model returns a True or False response. Use this for binary pass/fail checks.Guardrail: Select True or False. The guardrail passes when the model returns the selected value.

Examples

AI Studio

Evaluating formality on a 1-5 scale

Rate the formality of the following output on a scale of 1 to 5:
- 1: Very casual/informal
- 5: Very formal/professional

Only output the number.

[OUTPUT] {{log.output}}

Evaluating accuracy on a 0-100 scale

Evaluate how accurate the response [OUTPUT] is compared to the query [INPUT].

Score from 0 to 100, where:
- 0: Completely inaccurate or irrelevant
- 50: Partially accurate
- 100: Perfectly accurate and complete

Only output the score as a number.

[INPUT] {{log.input}}
[OUTPUT] {{log.output}}

Binary pass/fail with numeric output

Evaluate if the response adequately answers the user's question.

Return 1 if the response is satisfactory, 0 if it is not.

[QUESTION] {{log.input}}
[RESPONSE] {{log.output}}

Testing

AI Studio

Configure the LLM payload in the Studio Playground:

Studio Playground panel for configuring the LLM payload sent to an LLM evaluator.

Click Run to execute the evaluator. The result appears in the Response field.

Response field showing the result of an LLM evaluator test run.

Once created, this evaluator is available as a guardrail in Deployments and Agents. See Evaluators and Guardrails in Deployments and Evaluators and Guardrails in Agents to learn more.

Python Evaluator

Python Evaluators let you write custom Python code for maximum flexibility: from simple validations (regex, length checks) to complex analyses (statistical scoring, custom algorithms).

AI Studio
API & SDK
MCP

In a Project or folder, click the + button and select Python Evaluator. You are taken to the code editor. Your evaluation function has access to the following fields from the evaluated model’s log:

log["input"] <str>: the last message sent to generate the output
log["output"] <str>: the generated response from the model
log["reference"] <str>: the reference used to compare the output
log["messages"] list<str>: all previous messages sent to the model
log["retrievals"] list<str>: all Knowledge Base retrievals

The evaluator can return two response types:

Number: return a numeric score
Boolean: return a true/false value

Example: compare output size with the reference:

Python

def evaluate(log):
    output_size = len(log["output"])
    reference_size = len(log["reference"])
    return abs(output_size - reference_size)

You can define multiple methods within the code editor. The last method is the entry-point for the Evaluator when run.

Use the Create an Evaluator API. Use \n to indicate newlines in code.

JSON

{
  "type": "python_eval",
  "path": "Default/Evaluators",
  "key": "MyEvaluator",
  "code": "def evaluate(log):\n  output_size = len(log[\"output\"])\n  reference_size = len(log[\"reference\"])\n  return abs(output_size - reference_size)\n",
  "guardrail_config": {
    "enabled": true,
    "type": "number",
    "value": 10,
    "operator": "lte"
  }
}

Retrieve a Python evaluator’s configuration:

Show me the current configuration for the "json-validator" evaluator

The assistant uses search_entities to resolve the evaluator ID, then get_python_eval to retrieve the full configuration including code and output type.

Create a Python evaluator:

Create a Python evaluator that checks whether the response contains a valid JSON object

The assistant writes a Python snippet that parses the response and validates JSON structure, then uses create_python_eval to register it in your workspace.

Update a Python evaluator:

Update the "json-validator" evaluator to also check that the JSON contains a "status" field

The assistant uses search_entities to find the evaluator, then update_python_eval with the updated code.

Environment and Libraries

AI Studio

The Python Evaluator runs in Python 3.12 with the following preloaded libraries:

numpy==1.26.4
nltk==3.9.1
json
re

Guardrail Configuration

AI Studio

Within a Deployment or Agent, use your Python Evaluator as a Guardrail to block generations that don’t meet your custom evaluation logic.Use the Pass condition to define when the guardrail passes:

Boolean evaluators: select True or False. The guardrail passes when your function returns the selected value.
Number evaluators: enter a score threshold. The guardrail passes when your function’s return value is greater than or equal to the threshold.

Testing

AI Studio

Configure the Python payload in the Studio Playground:

Studio Playground panel for configuring the payload sent to a Python evaluator.

Click Run to execute the evaluator. The result appears in the Response field.

Response field showing the result of a Python evaluator test run.

HTTP Evaluator

HTTP evaluators call an external API to perform evaluation, enabling flexible assessments using your own or third-party endpoints. Use them for business-specific compliance checks, custom quality scoring, or domain-specific validations.

AI Studio
API & SDK

In a Project or folder, click the + button and select HTTP Evaluator. Define the following:

Field	Description
URL	The API endpoint.
Headers	Key-value pairs for HTTP headers sent during evaluation.
Payload	Key-value pairs for the HTTP body sent during evaluation.

Use the Create an Evaluator API.

JSON

{
  "type": "http_eval",
  "method": "POST",
  "headers": {
    "header-key": "header-value"
  },
  "payload": {
    "body-key": "body-value"
  },
  "url": "https://myevaluatorendpoint.com/api",
  "path": "Default/Evaluators",
  "key": "MyEvaluator",
  "guardrail_config": {
    "enabled": true,
    "type": "number",
    "value": 5,
    "operator": "gte"
  }
}

Payload Detail The following variables are accessible in the payload sent to your endpoint:

{
  "query": "",           // last message sent to the model
  "response": "",        // assistant-generated response
  "expected_output": "", // dataset reference for the evaluation
  "retrieved_context": [] // knowledge base retrievals
}

Expected Response Payload For an HTTP Evaluator to be valid, Orq.ai expects a response payload in one of the following formats. If none is returned, the evaluator is ignored during processing.

{
    "type": "boolean",
    "value": true
}

Guardrail Configuration

AI Studio

Within a Deployment or Agent, you can use your HTTP Evaluator as a Guardrail to block responses based on the value returned by your endpoint.Use the Pass condition to set a numeric threshold. The guardrail passes when the value returned by your endpoint is greater than or equal to the threshold.

Testing

AI Studio

A Playground is available in the Studio to test your evaluator against any output before using it in an Experiment or Deployment.Configure the request fields:

Studio Playground panel for configuring the request fields sent to an HTTP evaluator.

Click Run to execute the evaluator. The result appears in the Response field.

Response field showing the result of an HTTP evaluator test run.

JSON Evaluator

JSON Evaluators validate model outputs against a JSON Schema, ensuring correct payload structure for incoming or outgoing model responses.

AI Studio
API & SDK

In a Project or folder, click the + button and select JSON Evaluator. Specify a JSON Schema that defines which fields are required and their types. For example:

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The post title"
    },
    "length": {
      "type": "integer",
      "description": "The post length"
    }
  },
  "required": [ "title", "length" ]
}

Use the Create an Evaluator API. The schema field takes the JSON Schema as a serialized string. Quote characters must be escaped as \".

JSON

{
  "guardrail_config": {
    "enabled": true,
    "type": "boolean",
    "value": true
  },
  "type": "json_schema",
  "schema": "{   \"$schema\": \"http://json-schema.org/draft-07/schema#\",   \"$id\": \"https://example.com/person.schema.json\",   \"title\": \"Person\",   \"description\": \"A person object\",   \"type\": \"object\",   \"properties\": {     \"firstName\": {       \"type\": \"string\",       \"description\": \"The person's first name\"     },     \"lastName\": {       \"type\": \"string\",       \"description\": \"The person's last name\"     },     \"age\": {       \"type\": \"integer\",       \"minimum\": 0,       \"maximum\": 150,       \"description\": \"Age in years\"     },     \"email\": {       \"type\": \"string\",       \"format\": \"email\",       \"description\": \"Email address\"     }   },   \"required\": [\"firstName\", \"lastName\", \"email\"],   \"additionalProperties\": false }"
}

Guardrail Configuration

AI Studio

Within a Deployment or Agent, use your JSON Evaluator as a Guardrail to block payloads that don’t validate the given JSON Schema. Enabling the Guardrail toggle will block non-conforming payloads.

Testing

AI Studio

Configure the JSON payload in the Studio Playground:

Studio Playground panel for configuring the JSON payload sent to a JSON evaluator.

Click Run to execute the evaluator. The result appears in the Response field.

Response field showing the result of a JSON evaluator test run.

Versions

AI Studio

When you are done editing, click Publish to save your changes. You will be prompted to write a commit message and choose a version bump:

Patch (e.g. v1.0.0 to v1.0.1): small fixes, no behaviour change
Minor (e.g. v1.0.0 to v1.1.0): new functionality, backwards compatible
Major (e.g. v1.0.0 to v2.0.0): breaking change or significant rework

The Versions tab shows the full history with author and publish timestamp for each version.

Each published version has three action buttons:

Action	Icon	Description
Compare		Open a diff view to see what changed between versions
Code		Load a code snippet to invoke the evaluator at this exact version
Environment		Tag the version with an Environment (e.g. production, staging)

Reference a specific version by appending @ and the version number: my-evaluator@1.0.1. Reference an environment tag directly: my-evaluator@production. Without a suffix, the latest published version is used.

List Evaluators

API & SDK

Use the List Evaluators API:

curl --request GET \
     --url https://api.orq.ai/v2/evaluators \
     --header 'accept: application/json' \
     --header 'authorization: Bearer ORQ_API_KEY'

Invoke an Evaluator

API & SDK

Call a library evaluator (for example, the Tone of Voice evaluator):

curl --request POST \
     --url https://api.orq.ai/v2/evaluators/tone_of_voice \
     --header 'accept: application/json' \
     --header 'authorization: Bearer ORQ_API_KEY' \
     --header 'content-type: application/json' \
     --data '{
  "query": "Validate the tone of voice if it is professional.",
  "output": "Hello, how are you ??",
  "model": "openai/gpt-4o"
}'

Call a custom evaluator: fetch the evaluator ID from the List Evaluators API, then invoke it. Use the View Code button on your evaluator page in the AI Studio to get a pre-filled snippet.

Evaluator page with the View Code button exposing a pre-filled invocation snippet.

curl 'https://api.orq.ai/v2/evaluators/<evaluator_id>/invoke' \
-H 'Authorization: Bearer ORQ_API_KEY' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
--data-raw '{
    "query": "Your input text",
    "output": "Your output text",
    "reference": "Optional reference text",
    "messages": [{"role": "user", "content": "Your message"}],
    "retrievals": ["Your retrieval content"]
}'

Guardrail Error Response

When a guardrail evaluation fails, Orq.ai returns an HTTP 422 Unprocessable Entity. The response body lists every guardrail that did not pass.

Deployments
Agents

{
  "code": 422,
  "error": "Validation failed: Not all guardrails were met while validating the response.",
  "message": "Validation failed: Not all guardrails were met while validating the response.",
  "source": "system",
  "guardrails": [
    {
      "id": "01KMR75R90XDA80020YT8MHP2W",
      "status": "completed",
      "started_at": "2026-03-27T17:58:55.330Z",
      "finished_at": "2026-03-27T17:58:55.364Z",
      "related_entities": [
        {
          "type": "evaluator",
          "evaluator_id": "01KK9D8Z0JCEC1ASQJH8R28B57",
          "evaluator_metric_name": "python_evaluator"
        }
      ],
      "passed": false,
      "reason": null,
      "evaluator_type": "output_guardrail",
      "type": "boolean",
      "value": false
    }
  ]
}

Field	Type	Description
`id`	string	Internal ID of the guardrail result.
`status`	string	Execution status: `"completed"` or `"failed"`.
`started_at`	string	ISO 8601 timestamp when the guardrail evaluation started.
`finished_at`	string	ISO 8601 timestamp when the guardrail evaluation finished.
`related_entities`	array	References to the evaluator that ran. Each entry contains `type`, `evaluator_id`, and `evaluator_metric_name`.
`passed`	boolean	`false` for every entry in this error response.
`reason`	string or null	Explanation of the failure, when provided by the evaluator.
`evaluator_type`	string	`"input_guardrail"` if the guardrail ran before the model. `"output_guardrail"` if the guardrail ran after generation.
`type`	string	The value type returned by the evaluator: `"boolean"`, `"number"`, or `"categorical"`.
`value`	boolean, number, or string	The raw value returned by the evaluator.

{
  "code": 422,
  "error": "Validation failed: Not all guardrails were met while validating the messages.",
  "message": "Validation failed: Not all guardrails were met while validating the messages.",
  "source": "system"
}

The guardrails array is not included in Agent responses. Use Traces in the Orq.ai Studio to identify which guardrail failed.

When the evaluator fails to execute: If the evaluator itself fails to run (for example, a network error or timeout), the guardrail is silently skipped and the generation proceeds. Monitor skipped guardrail executions through Traces.When an LLM guardrail’s underlying model fails: If the model powering an LLM guardrail is unavailable, Orq.ai fails the entire request for safety. Since the guardrail could not run, there is no way to know whether it would have blocked the generation.

Evaluatorq

Evaluatorq is a dedicated SDK for running evaluations programmatically. It supports parallel job execution, flexible data sources (inline, CSV, Orq datasets), and syncs results to the Orq.ai AI Studio.

API & SDK

Install:

npm install @orq-ai/evaluatorq

Usage example:

import { evaluatorq, job } from "@orq-ai/evaluatorq";

const textAnalyzer = job("text-analyzer", async (data) => {
    const text = data.inputs.text;
    return {
        length: text.length,
        wordCount: text.split(" ").length,
        uppercase: text.toUpperCase(),
    };
});

await evaluatorq("text-analysis", {
    data: [
        { inputs: { text: "Hello world" } },
        { inputs: { text: "Testing evaluation" } },
    ],
    jobs: [textAnalyzer],
    evaluators: [
        {
            name: "length-check",
            scorer: async ({ output }) => {
                const passesCheck = output.length > 10;
                return {
                    value: passesCheck ? 1 : 0,
                    explanation: passesCheck
                        ? "Output length is sufficient"
                        : `Output too short (${output.length} chars, need >10)`,
                };
            },
        },
    ],
});

See the Python Evaluatorq and TypeScript Evaluatorq repositories for more.

Cookbook: Running evaluations in parallel with Evaluatorq

Step-by-step walkthrough comparing agent variants with parallel evaluators, including DeepEval and RAGAS integration.

Documentation Index

LLM Evaluator

Python Evaluator

HTTP Evaluator

JSON Evaluator

​Use Cases

​LLM Evaluator

​Configure Prompt

​Output Types

​Examples

​Testing

​Python Evaluator

​Environment and Libraries

​Guardrail Configuration

​Testing

​HTTP Evaluator

​Guardrail Configuration

​Testing

​JSON Evaluator

​Guardrail Configuration

​Testing

​Versions

​List Evaluators

​Invoke an Evaluator

​Guardrail Error Response

​Evaluatorq

Cookbook: Running evaluations in parallel with Evaluatorq

Use Cases

LLM Evaluator

Configure Prompt

Output Types

Examples

Testing

Python Evaluator

Environment and Libraries

Guardrail Configuration

Testing

HTTP Evaluator

Guardrail Configuration

Testing

JSON Evaluator

Guardrail Configuration

Testing

Versions

List Evaluators

Invoke an Evaluator

Guardrail Error Response

Evaluatorq