Skip to main content
To start creating an evaluator, use the ’+’ button in your Project or Folder. The following menu opens.

Create a new Evaluator.

Here you can choose the type of Evaluator you want to create, see hereafter the specific documentation for each type.

HTTP Evaluator

HTTP evaluators allow users to set up a custom evaluation by calling an external API, enabling flexible and tailored assessments of generated responses. This approach lets users leverage their own or third-party APIs to perform specific checks, such as custom quality scoring, compliance verification, or domain-specific validations that align with their unique requirements. When creating an HTTP Evaluator, define the following details of your Evaluator:
FieldDescription
URLThe API Endpoint.
HeadersKey-value pairs for HTTP Headers sent during evaluation.
PayloadKey-value pairs for HTTP Body sent during evaluation.

Payload Detail

Here are the payload variables accessible to perform evaluation
{
  "query": "", // string containing the last message sent to the model
  "response": "", // string containing the assistant generated response
  "expected_output": "", // string containing the dataset reference for the evaluation
  "retrieved_context": [] // array of strings containing the  knowledge base retrievals
}

Expected Response Payload

For an HTTP Evaluator to be valid, orq.ai is expecting a certain response payload returning the evaluation result.

Boolean Response

You can decide to return a Boolean response to the evaluation, the following is the expected payload:
{
    "type": "boolean",
    "value": true
}

Number Response

You can decide to return a Number response to the evaluation, the following is the expected payload:
{
    "type": "number",
    "value": 1
}
If you fail to return one of the two payloads shown above, the Evaluator will be ignored during processing.

String Response

You can decide to return a String response to the evaluation, the following is the expected payload:
{
  "type": "string",
  "value": "This response passed all compliance checks"
}

Example for HTTP Evaluators

HTTP Evaluators can be useful to implement business or industry-specific checks from within your applications. You can build an Evaluator using an API on your systems that will perform a compliance check for instance. This HTTP Evaluator has agency over calls routed through orq.ai while keeping business intelligence and logic within your environments. This ensures that generated content adheres to your organization’s specific regulatory guidelines. For example, in case the content is not adhering to regulatory guidelines, the HTTP call could return the following, failing the evaluator along the way.
{
    "type": "boolean",
    "value": true
}

Guardrail Configuration

Within a Deployment, you can use your HTTP Evaluator as a Guardrail, effectively preventing deployments to respond to a user depending on the Input or Output. Here you can define the guardrail condition:
  • if the HTTP Evaluator returns a value higher than the defined value, the call is accepted.
  • If the HTTP Evaluator returns a value lower or equal to the defined value, the call is then denied.
Once created the Evaluator will be available to use in Experiments, Deployments, and Agents, to learn more, see Using Evaluator in Experiment.

Testing an HTTP Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Here you can configure all fields that will be sent to an evaluator.

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

An HTTP test response.

JSON Evaluator

JSON Evaluators allow users to validate JSON payloads against JSON Schemas, ensuring correct incoming or outgoing payload for your model. When creating an evaluator, you can specify a JSON Schema that will be used.
A JSON Schema lets you define which fields you want to find in the evaluated payload Here is an example defining two mandatory fields: title and length
{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The post title"
    },
    "length": {
      "type": "integer",
      "description": "The post length"
    }
  },
  "required": [ "title", "length" ]
}

Testing a JSON Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Here you can configure the JSON payload that will be sent to an evaluator.

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

A JSON test response.

Guardrail Configuration

Within a Deployment, you can use your JSON Evaluator as a Guardrail, effectively permitting a JSON Validation on input and output for a deployment generation. Enabling the Guardrail toggle will block payloads that don’t validate the given JSON Schema. Once created the Evaluator will be available to use in Deployments, to learn more see Evaluators & Guardrails in Deployments.

LLM Evaluator

Unlike Function Evaluators, LLM Evaluators assess the context and provide human-like judgments on the quality or appropriateness of content. When creating your LLM evaluator, select the model you would like to use to evaluate the output (the model needs to be enabled in your Model Garden). Choose which type of output your model evaluation will provide:
  • Boolean, if the evaluation generates a True/False response.
  • Number, if the evaluation generates a Score.

Configure Prompt

Your prompt has access to the following string variables:
  • {{log.input}} contains the last message sent to the model
  • {{log.output}} contains the output response generated by the evaluated model
  • {{log.messages}} contains the messages sent to the model, without the last message
  • {{log.retrievals}} contains Knowledge Base retrievals.
  • {{log.reference}} contains the reference used to compare output

Example

Evaluating the Familiarity of an output
Evaluate the familiarity of the [OUTPUT], give a score between 1 and 10, 1 being very formal, 10 being very familiar. Only output the score.

 [OUTPUT] {{log.output}}
Evaluating the accuracy of a response
Evaluate how accurate a response [OUTPUT] is compared to the query [INPUT]. Give a score between 1 and 10, 1 being not accurate at all, 10 being perfectly accurate. Only output the score.

[INPUT] {{log.input}}
[OUTPUT] {{log.output}}

Testing an LLM Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Here you can configure the LLM payload that will be sent to an evaluator.

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

A LLM Evaluator test response.

Guardrail Configuration

Within a Deployment, you can use your LLM Evaluator as a Guardrail, effectively permitting a validation on input and output for a deployment generation. Enabling the Guardrail toggle will block payloads that don’t meet a score or expected boolean response. Once created the Evaluator will be available to use in Deployments, to learn more, see Evaluators & Guardrails in Deployments.

Python Evaluator

Python Evaluators enable users to write custom Python code to create tailored evaluations, offering maximum flexibility for assessing text or data. From simple validations (e.g. regex patterns, data formatting) to complex analyses (e.g. statistical checks, custom scoring algorithms), they execute user-defined logic to measure specific criteria. When creating a Python Evaluator, You are taken to the code editor to configure your Python evaluation. To perform an evaluation, you have access to the log of the Evaluated Model, which contains the following three fields:
  • log["input"] <str> The last message sent to generate the output.
  • log["output"] <str> The generated response from the model.
  • log["reference"] <str> The reference used to compare the output.
  • log["messages"] list<str> All previous messages sent to the model.
  • log["retrievals"] list<str> All Knowledge Base retrievals.
The evaluator can be configured with two different response types:
  • Number to return a score
  • Boolean to return a true/false value
The following example compares the output size with the given reference.
def evaluate(log):
    output_size = len(log["output"])
    reference_size = len(log["reference"])
    return abs(output_size - reference_size)
You can define multiple methods within the code editor, the last method will be the entry-point for the Evaluator when run.

Environment and Libraries

The Python Evaluator runs in the following environment: python 3.12 The environment comes preloaded with the following libraries:
numpy==1.26.4
nltk==3.9.1
json
re

Testing a Python Evaluator

Within the Studio, a Playground is available to test an evaluator against any output. This helps validate quickly that an evaluator is behaving correctly To do so, first configure the request:

Here you can configure the payload that will be sent to a Python evaluator.

Use the Run button to execute your evaluator with the request payload. The result will be displayed in the Response field.

A Python test response.

Guardrail Configuration

Within a Deployment, you can use your Python Evaluator as a Guardrail, blocking potential calls to Enabling the Guardrail toggle will block payloads that don’t validate the given JSON Schema. Once created the Evaluator will be available to use in Deployments, to learn more see Evaluators & Guardrails in Deployments.