Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

To start creating an evaluator, use the ’+’ button in your Project or Folder. The following menu opens.

Here you can choose the type of Evaluator you want to create, see hereafter the specific documentation for each type.

Once your evaluator is configured, click Publish to save a new version. Changes remain as a draft until published. See Versions for the full version history and available actions.

HTTP Evaluator

HTTP evaluators allow users to set up a custom evaluation by calling an external API, enabling flexible and tailored assessments of generated responses. This approach lets users leverage their own or third-party APIs to perform specific checks, such as custom quality scoring, compliance verification, or domain-specific validations that align with their unique requirements. When creating an HTTP Evaluator, define the following details of your Evaluator:

Field	Description
URL	The API Endpoint.
Headers	Key-value pairs for HTTP Headers sent during evaluation.
Payload	Key-value pairs for HTTP Body sent during evaluation.

Payload Detail

Here are the payload variables accessible to perform evaluation

{
  "query": "", // string containing the last message sent to the model
  "response": "", // string containing the assistant generated response
  "expected_output": "", // string containing the dataset reference for the evaluation
  "retrieved_context": [] // array of strings containing the  knowledge base retrievals
}

Expected Response Payload

For an HTTP Evaluator to be valid, orq.ai is expecting a certain response payload returning the evaluation result.

Boolean Response

You can decide to return a Boolean response to the evaluation, the following is the expected payload:

{
    "type": "boolean",
    "value": true
}

Number Response

You can decide to return a Number response to the evaluation, the following is the expected payload:

{
    "type": "number",
    "value": 1
}

If you fail to return one of the two payloads shown above, the Evaluator will be ignored during processing.

String Response

You can decide to return a String response to the evaluation, the following is the expected payload:

{
  "type": "string",
  "value": "This response passed all compliance checks"
}

Example for HTTP Evaluators

HTTP Evaluators can be useful to implement business or industry-specific checks from within your applications. You can build an Evaluator using an API on your systems that will perform a compliance check for instance. This HTTP Evaluator has agency over calls routed through orq.ai while keeping business intelligence and logic within your environments. This ensures that generated content adheres to your organization’s specific regulatory guidelines. For example, in case the content is not adhering to regulatory guidelines, the HTTP call could return the following, failing the evaluator along the way.

{
    "type": "boolean",
    "value": true
}

Guardrail Configuration

Within a Deployment or Agent, you can use your HTTP Evaluator as a Guardrail, effectively preventing them from responding to a user depending on the Input or Output. Use the Pass condition to set a numeric threshold. The guardrail passes when the value returned by your endpoint is greater than or equal to the threshold. Once created the Evaluator will be available to use in Experiments, Deployments, and Agents, to learn more, see Using Evaluator in Experiment.

Testing an HTTP Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

JSON Evaluator

JSON Evaluators allow users to validate JSON payloads against JSON Schemas, ensuring correct incoming or outgoing payload for your model. When creating an evaluator, you can specify a JSON Schema that will be used.

A JSON Schema lets you define which fields you want to find in the evaluated payload Here is an example defining two mandatory fields: title and length

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The post title"
    },
    "length": {
      "type": "integer",
      "description": "The post length"
    }
  },
  "required": [ "title", "length" ]
}

Testing a JSON Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

Guardrail Configuration

Within a Deployment or Agent, you can use your JSON Evaluator as a Guardrail, effectively permitting a JSON Validation on input and output. Enabling the Guardrail toggle will block payloads that don’t validate the given JSON Schema. Once created the Evaluator will be available to use in Deployments and Agents, to learn more see Evaluators & Guardrails in Deployments.

LLM Evaluator

Unlike Function Evaluators, LLM Evaluators assess the context and provide human-like judgments on the quality or appropriateness of content. When creating your LLM evaluator, select the model you would like to use to evaluate the output (the model needs to be enabled in the AI Router). Choose which type of output your model evaluation will provide:

Boolean, if the evaluation generates a True/False response.
Number, if the evaluation generates a Score.

Configure Prompt

Your prompt has access to the following string variables:

{{log.input}} contains the last message sent to the model
{{log.output}} contains the output response generated by the evaluated model
{{log.messages}} contains the messages sent to the model, without the last message
{{log.retrievals}} contains Knowledge Base retrievals.
{{log.reference}} contains the reference used to compare output

Custom Rating Scales

When using a Number output type, you can define any rating scale that fits your use case. The evaluator will return whatever numeric value the LLM outputs.

You are not limited to a 1-10 scale. Use any range that makes sense for your evaluation criteria (e.g., 1-5, 0-100, or custom scales).

Examples

Evaluating formality on a 1-5 scale

Rate the formality of the following output on a scale of 1 to 5:
- 1: Very casual/informal
- 5: Very formal/professional

Only output the number.

[OUTPUT] {{log.output}}

Evaluating accuracy on a 0-100 scale

Evaluate how accurate the response [OUTPUT] is compared to the query [INPUT].

Score from 0 to 100, where:
- 0: Completely inaccurate or irrelevant
- 50: Partially accurate
- 100: Perfectly accurate and complete

Only output the score as a number.

[INPUT] {{log.input}}
[OUTPUT] {{log.output}}

Binary pass/fail with numeric output

Evaluate if the response adequately answers the user's question.

Return 1 if the response is satisfactory, 0 if it is not.

[QUESTION] {{log.input}}
[RESPONSE] {{log.output}}

Testing an LLM Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

Guardrail Configuration

Within a Deployment or Agent, you can use your LLM Evaluator as a Guardrail, effectively permitting a validation on input and output. The Pass condition depends on the output type you selected when creating the evaluator:

Boolean: select True or False. The guardrail passes when the model returns the selected value.
Number: enter a score threshold. The guardrail passes when the model’s score is greater than or equal to the threshold.
String: guardrail mode is not available for string output types.

Once created the Evaluator will be available to use in Deployments and Agents, to learn more, see Evaluators & Guardrails in Deployments.

Python Evaluator

Python Evaluators enable users to write custom Python code to create tailored evaluations, offering maximum flexibility for assessing text or data. From simple validations (e.g. regex patterns, data formatting) to complex analyses (e.g. statistical checks, custom scoring algorithms), they execute user-defined logic to measure specific criteria. When creating a Python Evaluator, You are taken to the code editor to configure your Python evaluation. To perform an evaluation, you have access to the log of the Evaluated Model, which contains the following three fields:

log["input"] <str> The last message sent to generate the output.
log["output"] <str> The generated response from the model.
log["reference"] <str> The reference used to compare the output.
log["messages"] list<str> All previous messages sent to the model.
log["retrievals"] list<str> All Knowledge Base retrievals.

The evaluator can be configured with two different response types:

Number to return a score
Boolean to return a true/false value

The following example compares the output size with the given reference.

def evaluate(log):
    output_size = len(log["output"])
    reference_size = len(log["reference"])
    return abs(output_size - reference_size)

You can define multiple methods within the code editor, the last method will be the entry-point for the Evaluator when run.

Environment and Libraries

The Python Evaluator runs in the following environment: python 3.12 The environment comes preloaded with the following libraries:

numpy==1.26.4
nltk==3.9.1
json
re

Testing a Python Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

Guardrail Configuration

Within a Deployment or Agent, you can use your Python Evaluator as a Guardrail, blocking generations that don’t meet your custom evaluation logic. Use the Pass condition to define when the guardrail passes:

Boolean evaluators: select True or False. The guardrail passes when your function returns the selected value.
Number evaluators: enter a score threshold. The guardrail passes when your function’s return value is greater than or equal to the threshold.

Once created the Evaluator will be available to use in Deployments and Agents, to learn more see Evaluators & Guardrails in Deployments.

Guardrail Error Response

When a guardrail evaluation fails, Orq.ai returns an HTTP 422 Unprocessable Entity to the caller. The response body lists every guardrail that did not pass so your application can decide how to handle or surface the failure.

Deployments
Agents

{
  "code": 422,
  "error": "Validation failed: Not all guardrails were met while validating the response.",
  "message": "Validation failed: Not all guardrails were met while validating the response.",
  "source": "system",
  "guardrails": [
    {
      "id": "01KMR75R90XDA80020YT8MHP2W",
      "status": "completed",
      "started_at": "2026-03-27T17:58:55.330Z",
      "finished_at": "2026-03-27T17:58:55.364Z",
      "related_entities": [
        {
          "type": "evaluator",
          "evaluator_id": "01KK9D8Z0JCEC1ASQJH8R28B57",
          "evaluator_metric_name": "python_evaluator"
        }
      ],
      "passed": false,
      "reason": null,
      "evaluator_type": "output_guardrail",
      "type": "boolean",
      "value": false
    }
  ]
}

Field	Type	Description
`id`	string	Internal ID of the guardrail result.
`status`	string	Execution status of the guardrail: `"completed"` or `"failed"`.
`started_at`	string	ISO 8601 timestamp when the guardrail evaluation started.
`finished_at`	string	ISO 8601 timestamp when the guardrail evaluation finished.
`related_entities`	array	References to the evaluator that ran as this guardrail. Each entry contains `type`, `evaluator_id`, and `evaluator_metric_name`.
`passed`	boolean	`false` for every entry in this error response, as the guardrail’s condition was not met. For example: a boolean guardrail configured to pass on `true` returns `passed: false` when the evaluator returns `false`.
`reason`	string or null	Explanation of the failure, when provided by the evaluator.
`evaluator_type`	string	`"input_guardrail"` if the guardrail ran before the model (request rejected before generation). `"output_guardrail"` if the guardrail ran after generation (response withheld).
`type`	string	The value type returned by the evaluator: `"boolean"` or `"number"`.
`value`	boolean or number	The raw value returned by the evaluator.

{
  "code": 422,
  "error": "Validation failed: Not all guardrails were met while validating the messages.",
  "message": "Validation failed: Not all guardrails were met while validating the messages.",
  "source": "system"
}

The guardrails array is not included in Agent responses. Use Traces in the Orq.ai Studio to identify which guardrail failed.

When the evaluator fails to execute: If the evaluator itself fails to run (for example, a network error contacting an external HTTP evaluator or a timeout), the guardrail is silently skipped and the generation proceeds. A broken evaluator does not block your users. Monitor skipped guardrail executions through Traces in the Orq.ai Studio.When an LLM guardrail’s underlying model fails: If the model powering an LLM guardrail is unavailable, Orq.ai fails the entire request for safety. Since the guardrail could not run, there is no way to know whether it would have blocked the generation, so Orq.ai errs on the side of caution.

Versions

When you are done editing, click Publish to save your changes. You will be prompted to write a commit message and choose a version bump: major, minor, or patch.

Patch (e.g. v1.0.0 to v1.0.1): small fixes, no behavior change
Minor (e.g. v1.0.0 to v1.1.0): new functionality, backwards compatible
Major (e.g. v1.0.0 to v2.0.0): breaking change or significant rework

Every time you publish, a new version of the evaluator is created. The Versions tab shows the full history. Versions are numbered (e.g. v1.0.0, v1.1.0) and each entry shows the author and publish timestamp.

Each published version has three action buttons:

Action	Icon	Description
Compare		Open a diff view to see what changed between versions
Code		Load a code snippet to invoke the evaluator at this exact version
Environment		Tag the version with an Environment (e.g. production, staging)

Reference a specific version by appending @ and the version number: my-evaluator@1.0.1. You can also reference an environment tag directly: my-evaluator@production. Without a suffix, the latest published version is used.

Getting Started

Reference

Create Evaluators | LLM Output Assessment

HTTP Evaluator

Payload Detail

Expected Response Payload

Boolean Response

Number Response

String Response

Example for HTTP Evaluators

Guardrail Configuration

Testing an HTTP Evaluator

JSON Evaluator

Testing a JSON Evaluator

Guardrail Configuration

LLM Evaluator

Configure Prompt

Custom Rating Scales

Examples

Testing an LLM Evaluator

Guardrail Configuration

Python Evaluator

Environment and Libraries

Testing a Python Evaluator

Guardrail Configuration

Guardrail Error Response

Versions

Getting Started

Reference

​HTTP Evaluator

​Payload Detail

​Expected Response Payload

​Boolean Response

​Number Response

​String Response

​Example for HTTP Evaluators

​Guardrail Configuration

​Testing an HTTP Evaluator

​JSON Evaluator

​Testing a JSON Evaluator

​Guardrail Configuration

​LLM Evaluator

​Configure Prompt

​Custom Rating Scales

​Examples

​Testing an LLM Evaluator

​Guardrail Configuration

​Python Evaluator

​Environment and Libraries

​Testing a Python Evaluator

​Guardrail Configuration

​Guardrail Error Response

​Versions

HTTP Evaluator

Payload Detail

Expected Response Payload

Boolean Response

Number Response

String Response

Example for HTTP Evaluators

Guardrail Configuration

Testing an HTTP Evaluator

JSON Evaluator

Testing a JSON Evaluator

Guardrail Configuration

LLM Evaluator

Configure Prompt

Custom Rating Scales

Examples

Testing an LLM Evaluator

Guardrail Configuration

Python Evaluator

Environment and Libraries

Testing a Python Evaluator

Guardrail Configuration

Guardrail Error Response

Versions