Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

To start creating an evaluator, use the ’+’ button in your Project or Folder. The following menu opens.

Here you can choose the type of Evaluator you want to create, see hereafter the specific documentation for each type.

Once your evaluator is configured, click Publish to save a new version. Changes remain as a draft until published. See Versions for the full version history and available actions.

HTTP Evaluator

HTTP evaluators allow users to set up a custom evaluation by calling an external API, enabling flexible and tailored assessments of generated responses. This approach lets users leverage their own or third-party APIs to perform specific checks, such as custom quality scoring, compliance verification, or domain-specific validations that align with their unique requirements. When creating an HTTP Evaluator, define the following details of your Evaluator:

Field	Description
URL	The API Endpoint.
Headers	Key-value pairs for HTTP Headers sent during evaluation.
Payload	Key-value pairs for HTTP Body sent during evaluation.

Payload Detail

Here are the payload variables accessible to perform evaluation

{
  "query": "", // string containing the last message sent to the model
  "response": "", // string containing the assistant generated response
  "expected_output": "", // string containing the dataset reference for the evaluation
  "retrieved_context": [] // array of strings containing the  knowledge base retrievals
}

Expected Response Payload

For an HTTP Evaluator to be valid, orq.ai is expecting a certain response payload returning the evaluation result.

Boolean Response

You can decide to return a Boolean response to the evaluation, the following is the expected payload:

{
    "type": "boolean",
    "value": true
}

Number Response

You can decide to return a Number response to the evaluation, the following is the expected payload:

{
    "type": "number",
    "value": 1
}

If you fail to return one of the two payloads shown above, the Evaluator will be ignored during processing.

String Response

You can decide to return a String response to the evaluation, the following is the expected payload:

{
  "type": "string",
  "value": "This response passed all compliance checks"
}

Example for HTTP Evaluators

HTTP Evaluators can be useful to implement business or industry-specific checks from within your applications. You can build an Evaluator using an API on your systems that will perform a compliance check for instance. This HTTP Evaluator has agency over calls routed through orq.ai while keeping business intelligence and logic within your environments. This ensures that generated content adheres to your organization’s specific regulatory guidelines. For example, in case the content is not adhering to regulatory guidelines, the HTTP call could return the following, failing the evaluator along the way.

{
    "type": "boolean",
    "value": true
}

Guardrail Configuration

Within a Deployment, you can use your HTTP Evaluator as a Guardrail, effectively preventing deployments to respond to a user depending on the Input or Output. Here you can define the guardrail condition:

if the HTTP Evaluator returns a value higher than the defined value, the call is accepted.
If the HTTP Evaluator returns a value lower or equal to the defined value, the call is then denied.

Once created the Evaluator will be available to use in Experiments, Deployments, and Agents, to learn more, see Using Evaluator in Experiment.

Testing an HTTP Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

JSON Evaluator

JSON Evaluators allow users to validate JSON payloads against JSON Schemas, ensuring correct incoming or outgoing payload for your model. When creating an evaluator, you can specify a JSON Schema that will be used.

A JSON Schema lets you define which fields you want to find in the evaluated payload Here is an example defining two mandatory fields: title and length

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The post title"
    },
    "length": {
      "type": "integer",
      "description": "The post length"
    }
  },
  "required": [ "title", "length" ]
}

Testing a JSON Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

Guardrail Configuration

Within a Deployment, you can use your JSON Evaluator as a Guardrail, effectively permitting a JSON Validation on input and output for a deployment generation. Enabling the Guardrail toggle will block payloads that don’t validate the given JSON Schema. Once created the Evaluator will be available to use in Deployments, to learn more see Evaluators & Guardrails in Deployments.

LLM Evaluator

Unlike Function Evaluators, LLM Evaluators assess the context and provide human-like judgments on the quality or appropriateness of content. When creating your LLM evaluator, select the model you would like to use to evaluate the output (the model needs to be enabled in your Model Garden). Choose which type of output your model evaluation will provide:

Boolean, if the evaluation generates a True/False response.
Number, if the evaluation generates a Score.

Configure Prompt

Your prompt has access to the following string variables:

{{log.input}} contains the last message sent to the model
{{log.output}} contains the output response generated by the evaluated model
{{log.messages}} contains the messages sent to the model, without the last message
{{log.retrievals}} contains Knowledge Base retrievals.
{{log.reference}} contains the reference used to compare output

Custom Rating Scales

When using a Number output type, you can define any rating scale that fits your use case. The evaluator will return whatever numeric value the LLM outputs.

You are not limited to a 1-10 scale. Use any range that makes sense for your evaluation criteria (e.g., 1-5, 0-100, or custom scales).

Examples

Evaluating formality on a 1-5 scale

Rate the formality of the following output on a scale of 1 to 5:
- 1: Very casual/informal
- 5: Very formal/professional

Only output the number.

[OUTPUT] {{log.output}}

Evaluating accuracy on a 0-100 scale

Evaluate how accurate the response [OUTPUT] is compared to the query [INPUT].

Score from 0 to 100, where:
- 0: Completely inaccurate or irrelevant
- 50: Partially accurate
- 100: Perfectly accurate and complete

Only output the score as a number.

[INPUT] {{log.input}}
[OUTPUT] {{log.output}}

Binary pass/fail with numeric output

Evaluate if the response adequately answers the user's question.

Return 1 if the response is satisfactory, 0 if it is not.

[QUESTION] {{log.input}}
[RESPONSE] {{log.output}}

Testing an LLM Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

Guardrail Configuration

Within a Deployment, you can use your LLM Evaluator as a Guardrail, effectively permitting a validation on input and output for a deployment generation. Enabling the Guardrail toggle will block payloads that don’t meet a score or expected boolean response. Once created the Evaluator will be available to use in Deployments, to learn more, see Evaluators & Guardrails in Deployments.

Python Evaluator

Python Evaluators enable users to write custom Python code to create tailored evaluations, offering maximum flexibility for assessing text or data. From simple validations (e.g. regex patterns, data formatting) to complex analyses (e.g. statistical checks, custom scoring algorithms), they execute user-defined logic to measure specific criteria. When creating a Python Evaluator, You are taken to the code editor to configure your Python evaluation. To perform an evaluation, you have access to the log of the Evaluated Model, which contains the following three fields:

log["input"] <str> The last message sent to generate the output.
log["output"] <str> The generated response from the model.
log["reference"] <str> The reference used to compare the output.
log["messages"] list<str> All previous messages sent to the model.
log["retrievals"] list<str> All Knowledge Base retrievals.

The evaluator can be configured with two different response types:

Number to return a score
Boolean to return a true/false value

The following example compares the output size with the given reference.

def evaluate(log):
    output_size = len(log["output"])
    reference_size = len(log["reference"])
    return abs(output_size - reference_size)

You can define multiple methods within the code editor, the last method will be the entry-point for the Evaluator when run.

Environment and Libraries

The Python Evaluator runs in the following environment: python 3.12 The environment comes preloaded with the following libraries:

numpy==1.26.4
nltk==3.9.1
json
re

Testing a Python Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

Guardrail Configuration

Within a Deployment, you can use your Python Evaluator as a Guardrail, blocking potential calls to Enabling the Guardrail toggle will block payloads that don’t validate the given JSON Schema. Once created the Evaluator will be available to use in Deployments, to learn more see Evaluators & Guardrails in Deployments.

Versions

When you are done editing, click Publish to save your changes. You will be prompted to write a commit message and choose a version bump: major, minor, or patch.

Patch (e.g. v1.0.0 to v1.0.1): small fixes, no behavior change
Minor (e.g. v1.0.0 to v1.1.0): new functionality, backwards compatible
Major (e.g. v1.0.0 to v2.0.0): breaking change or significant rework

Every time you publish, a new version of the evaluator is created. The Versions tab shows the full history. Versions are numbered (e.g. v1.0.0, v1.1.0) and each entry shows the author and publish timestamp.

Each published version has three action buttons:

Action	Icon	Description
Compare		Open a diff view to see what changed between versions
Code		Load a code snippet to invoke the evaluator at this exact version
Environment		Tag the version with an Environment (e.g. production, staging)

Reference a specific version by appending @ and the version number: my-evaluator@1.0.1. You can also reference an environment tag directly: my-evaluator@production. Without a suffix, the latest published version is used.

Getting Started

Reference

Organization

Create Evaluators | LLM Output Assessment

HTTP Evaluator

Payload Detail

Expected Response Payload

Boolean Response

Number Response

String Response

Example for HTTP Evaluators

Guardrail Configuration

Testing an HTTP Evaluator

JSON Evaluator

Testing a JSON Evaluator

Guardrail Configuration

LLM Evaluator

Configure Prompt

Custom Rating Scales

Examples

Testing an LLM Evaluator

Guardrail Configuration

Python Evaluator

Environment and Libraries

Testing a Python Evaluator

Guardrail Configuration

Versions

Getting Started

Reference

Organization

​HTTP Evaluator

​Payload Detail

​Expected Response Payload

​Boolean Response

​Number Response

​String Response

​Example for HTTP Evaluators

​Guardrail Configuration

​Testing an HTTP Evaluator

​JSON Evaluator

​Testing a JSON Evaluator

​Guardrail Configuration

​LLM Evaluator

​Configure Prompt

​Custom Rating Scales

​Examples

​Testing an LLM Evaluator

​Guardrail Configuration

​Python Evaluator

​Environment and Libraries

​Testing a Python Evaluator

​Guardrail Configuration

​Versions

HTTP Evaluator

Payload Detail

Expected Response Payload

Boolean Response

Number Response

String Response

Example for HTTP Evaluators

Guardrail Configuration

Testing an HTTP Evaluator

JSON Evaluator

Testing a JSON Evaluator

Guardrail Configuration

LLM Evaluator

Configure Prompt

Custom Rating Scales

Examples

Testing an LLM Evaluator

Guardrail Configuration

Python Evaluator

Environment and Libraries

Testing a Python Evaluator

Guardrail Configuration

Versions