> ## Documentation Index
> Fetch the complete documentation index at: https://docs.orq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create Evaluators

> Build LLM-as-a-Judge and Python evaluators in Orq.ai to automatically score model outputs from AI Studio, the API, or the Orq MCP server.

**Evaluators** are automated tools that assess model outputs within [Experiments](/docs/experiments/overview), [Deployments](/docs/deployments/overview), and [Agents](/docs/agents/build). They verify outputs against reference data, enforce compliance criteria, and power **Guardrails** that block non-compliant generations before they reach users.

Two evaluator types are available:

<CardGroup cols={2}>
  <Card title="LLM Evaluator" icon="robot" href="#llm-evaluator">
    Use a model to judge outputs against any criteria you define in a prompt.
  </Card>

  <Card title="Python Evaluator" icon="python" href="#python-evaluator">
    Write custom Python code for full flexibility. Use for statistical scoring, regex checks, length validation, or any custom evaluation logic.
  </Card>
</CardGroup>

<Note>
  **HTTP and JSON evaluators are deprecated.** Existing HTTP and JSON evaluators continue to work, but cannot be duplicated. Use Python evaluators instead: the `requests` package is now available for HTTP calls, and `pydantic` is available for JSON schema validation.
</Note>

## Use Cases

<AccordionGroup>
  <Accordion title="Automated quality scoring" icon="star">
    Score model outputs on dimensions like tone, accuracy, or relevance without manual review. Use LLM-as-a-Judge evaluators with custom rubrics, or import pre-built scoring functions from the [Hub](/docs/hub/overview).
  </Accordion>

  <Accordion title="Output compliance checks" icon="shield-check">
    Verify that outputs meet specific format, content, or structural requirements. Use Python evaluators for custom logic such as regex checks, length validation, or structural assertions.
  </Accordion>

  <Accordion title="Guardrails in Deployments and Agents" icon="lock">
    Attach evaluators as guardrails to block generations that fail a pass condition. Input guardrails run before the model; output guardrails run after. A failed guardrail returns HTTP 422 to the caller.
  </Accordion>

  <Accordion title="Regression testing in Experiments" icon="flask">
    Run evaluators across a full dataset in an Experiment to track quality over time. Compare evaluator scores across runs and prompt variants to catch regressions before deploying changes.
  </Accordion>
</AccordionGroup>

## Pre-built Evaluators

Before building one from scratch, browse the [Hub](/docs/hub/overview) for ready-to-use evaluators. Add any of them to a [Project](/docs/projects/overview) with the **Add to project** button, then use them in [Experiments](/docs/experiments/build), [Deployments](/docs/deployments/creating), and [Agents](/docs/agents/build).

The Hub groups its evaluators into three categories:

* [Function Evaluators](/docs/hub/overview#function-evaluators): deterministic checks such as **Contains**, **Valid JSON**, **Length Between**, and **BLEU Score**.
* [LLM Evaluators](/docs/hub/overview#llm-evaluators): model-judged checks such as **Tone of Voice**, **Grammar**, **PII**, and **Sentiment Classification**.
* [RAGAS Evaluators](/docs/hub/overview#ragas-evaluators): retrieval-augmented generation metrics such as **Faithfulness**, **Context Precision**, and **Response Relevancy**.

## LLM Evaluator

LLM Evaluators use a model to judge outputs against any criteria you define in a prompt.

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    In a [Project](/docs/projects/overview) or folder, click the <kbd><Icon icon="plus" /></kbd> button and select **LLM Evaluator**. Select the model to use for evaluation. It must be enabled in the [AI Gateway](/docs/router/using-the-router).

    <Frame caption="The LLM Evaluator settings panel showing the prompt editor, model selector, output types, guardrail configuration, and playground.">
      <img src="https://mintcdn.com/orqai/MT6hA9WdZfQ3qEfV/images/evaluator-studio-410.png?fit=max&auto=format&n=MT6hA9WdZfQ3qEfV&q=85&s=c1a8fca33c548e1a935d09a51d9ef299" alt="LLM Evaluator settings panel with prompt area, model dropdown set to DeepSeek V3.1, output type tabs, guardrail pass condition, and a playground panel with Editor and Dataset tabs." width="1541" height="1227" data-path="images/evaluator-studio-410.png" />
    </Frame>
  </Tab>

  <Tab title="API & SDK" icon="code">
    Use the [Create an Evaluator API](/reference/evaluators/create-an-evaluator).

    ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
    curl --request POST \
         --url https://api.orq.ai/v2/evaluators \
         --header 'accept: application/json' \
         --header 'authorization: Bearer ORQ_API_KEY' \
         --header 'content-type: application/json' \
         --data '{
      "type": "llm_eval",
      "prompt": "Give a number response from 0 to 1, 0 for inappropriate, 1 for perfectly appropriate {{log.output}}",
      "path": "Default/evaluators",
      "model": "openai/gpt-4o",
      "key": "myKey",
      "guardrail_config": {
        "enabled": true,
        "type": "number",
        "value": 0.7,
        "operator": "gte"
      }
    }'
    ```
  </Tab>

  <Tab title="MCP" icon="https://mintcdn.com/orqai/i7ZhKI7LFRfXU7ox/images/logos/mcp.svg?fit=max&auto=format&n=i7ZhKI7LFRfXU7ox&q=85&s=cef7916eb5fe1f6bb97541398d3f7639" width="16" height="16" data-path="images/logos/mcp.svg">
    **Retrieve an evaluator's configuration:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Show me the current configuration for the "tone-scorer" evaluator
    ```

    The assistant uses `search_entities` to resolve the evaluator ID, then `get_llm_eval` to retrieve the full configuration including prompt, model, and output type.

    ***

    **Create an LLM evaluator:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create an LLM-as-a-Judge evaluator that scores responses on tone: professional, neutral, or aggressive
    ```

    The assistant uses `create_llm_eval` with a categorical scoring rubric and confirms the evaluator ID.

    ***

    **Update an existing LLM evaluator:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Update the "tone-scorer" evaluator to also check for formal language and return a boolean instead of a number
    ```

    The assistant uses `search_entities` to find the evaluator, then `update_llm_eval` with the updated `prompt` and `output_type: "boolean"`.
  </Tab>
</Tabs>

### Configure Prompt

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Your prompt has access to the following **string** variables:

    * `{{log.input}}`: the last message sent to the model
    * `{{log.output}}`: the output response generated by the evaluated model
    * `{{log.messages}}`: all messages sent to the model, excluding the last message
    * `{{log.retrievals}}`: [Knowledge Base](/docs/knowledge/overview) retrievals
    * `{{log.reference}}`: the reference used to compare output
  </Tab>
</Tabs>

### Model Parameters

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    The **Model** field selects which model acts as judge. Any model enabled in the [AI Gateway](/docs/router/using-the-router) is available. The model choice affects evaluation quality, cost, and latency.
  </Tab>
</Tabs>

### Output and Guardrail Configuration

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Select the output type that matches the evaluation criteria. The **Guardrail configuration** panel is visible directly in the evaluator settings. Set the pass condition for each type:

    <Tabs>
      <Tab title="Boolean" icon="bars">
        The model returns a **True** or **False** response. Use for binary pass/fail checks.

        **Guardrail**: Select **True** or **False**. The guardrail passes when the model returns the selected value.
      </Tab>

      <Tab title="Number" icon="hashtag">
        The model returns a numeric score. Use any scale that fits the use case (e.g. 1-5, 0-100).

        **Guardrail**: Enter a threshold in **Pass if greater or equal than**. The guardrail passes when the score meets or exceeds the threshold.
      </Tab>

      <Tab title="Categorical" icon="grid-2">
        The model classifies the output into one of the predefined labels.

        When **Categorical** is selected, a label editor appears below the output type selector. Add one label per row: enter a **Value** (the exact string the model must return) and an optional **Description** to guide the model. At least one label is required.

        **Guardrail**: Select one or more values in **Pass if output is one of**. The guardrail passes when the model's output matches any of the selected labels.

        <Frame caption="Configure which categorical labels must match for the guardrail to pass.">
          <img src="https://mintcdn.com/orqai/7yBnkUrxNQ0b0A6G/images/guardrail-categorical.png?fit=max&auto=format&n=7yBnkUrxNQ0b0A6G&q=85&s=84746abccab181820e86aa02c7b658aa" alt="Categorical guardrail configuration" width="415" height="441" data-path="images/guardrail-categorical.png" />
        </Frame>
      </Tab>

      <Tab title="String" icon="font">
        The model returns a free-form string response. Not available as a guardrail.
      </Tab>
    </Tabs>

    Once configured, the evaluator is available as a guardrail in any [Deployment](/docs/deployments/creating#evaluators-and-guardrails) or [Agent](/docs/agents/build#configure-evaluators-and-guardrails) without any additional toggle.
  </Tab>
</Tabs>

### Examples

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    <AccordionGroup>
      <Accordion title="Evaluating formality on a 1-5 scale" icon="sliders">
        ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
        Rate the formality of the following output on a scale of 1 to 5:
        - 1: Very casual/informal
        - 5: Very formal/professional

        Only output the number.

        [OUTPUT] {{log.output}}
        ```
      </Accordion>

      <Accordion title="Evaluating accuracy on a 0-100 scale" icon="bullseye">
        ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
        Evaluate how accurate the response [OUTPUT] is compared to the query [INPUT].

        Score from 0 to 100, where:
        - 0: Completely inaccurate or irrelevant
        - 50: Partially accurate
        - 100: Perfectly accurate and complete

        Only output the score as a number.

        [INPUT] {{log.input}}
        [OUTPUT] {{log.output}}
        ```
      </Accordion>

      <Accordion title="Binary pass/fail with numeric output" icon="circle-check">
        ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
        Evaluate if the response adequately answers the user's question.

        Return 1 if the response is satisfactory, 0 if it is not.

        [QUESTION] {{log.input}}
        [RESPONSE] {{log.output}}
        ```
      </Accordion>

      <Accordion title="Consistency with the prior conversation" icon="comments">
        ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
        Review the full prior conversation and the latest response.

        Return 1 if the response stays consistent with what was already discussed, 0 if it contradicts earlier messages.

        [CONVERSATION] {{log.messages}}
        [RESPONSE] {{log.output}}
        ```
      </Accordion>

      <Accordion title="Comparing output against a reference" icon="equals">
        ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
        Compare the response [OUTPUT] against the reference answer [REFERENCE].

        Return 1 if the response conveys the same meaning as the reference, 0 if it does not.

        [OUTPUT] {{log.output}}
        [REFERENCE] {{log.reference}}
        ```
      </Accordion>

      <Accordion title="Validating tool usage" icon="wrench">
        ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
        Review the tool calls made during the run.

        Return 1 if the correct tool was called with valid arguments for the user's request, 0 otherwise.

        [REQUEST] {{log.input}}
        [TOOL CALLS] {{log.tool_calls}}
        ```
      </Accordion>
    </AccordionGroup>
  </Tab>
</Tabs>

### Testing

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    <Tabs>
      <Tab title="Editor" icon="pen-to-square">
        Fill the payload manually. Enter values for `messages`, `input`, `output`, `retrievals`, and `reference`. All prompt variables resolve against what you enter.

        <Frame caption="Configure the LLM payload that will be sent to the evaluator.">
          <img src="https://mintcdn.com/orqai/XbJWQ7lqn4sIVHea/images/docs/d04c8a4424879b761bdbb59bb58193a0cb562cf0abc1d01cd7cc526af5c3b431-Screenshot_2025-06-27_at_11.12.01.png?fit=max&auto=format&n=XbJWQ7lqn4sIVHea&q=85&s=4555fbee055923f79fbfd15778d24a51" alt="Studio Playground panel for configuring the LLM payload sent to an LLM evaluator." width="806" height="558" data-path="images/docs/d04c8a4424879b761bdbb59bb58193a0cb562cf0abc1d01cd7cc526af5c3b431-Screenshot_2025-06-27_at_11.12.01.png" />
        </Frame>

        Click **Run** to execute the evaluator. The result appears in the **Response** field.

        <Frame caption="An LLM Evaluator test response.">
          <img src="https://mintcdn.com/orqai/ep9iJPTKd6tE7QFF/images/docs/a2c8694931c114e9305160eac7f7aedd285d3f07e6789090e84c226bf0ea090c-Screenshot_2025-06-27_at_11.23.38.png?fit=max&auto=format&n=ep9iJPTKd6tE7QFF&q=85&s=12ce3f6d102258f6dd7de0021709ad91" alt="Response field showing the result of an LLM evaluator test run." width="820" height="494" data-path="images/docs/a2c8694931c114e9305160eac7f7aedd285d3f07e6789090e84c226bf0ea090c-Screenshot_2025-06-27_at_11.23.38.png" />
        </Frame>
      </Tab>

      <Tab title="Dataset" icon="database">
        Select a dataset from the dropdown. Use the row pagination controls to navigate between rows. The selected row's data is shown in the tree view.

        The following variables are available in the evaluator prompt when testing with a Dataset:

        | Source              | Prompt variable     | Description                                                                                            |
        | ------------------- | ------------------- | ------------------------------------------------------------------------------------------------------ |
        | `inputs.field_name` | `{{field_name}}`    | Custom input fields from the dataset row, referenced directly by field name (e.g. `{{product_input}}`) |
        | `messages`          | `{{log.messages}}`  | The messages used to generate the output, without the last user message                                |
        | `reference`         | `{{log.reference}}` | The reference used to compare the output                                                               |

        <Warning>
          `{{log.input}}` and `{{log.output}}` are **not** available when testing with a Dataset. They are populated only at execution time when an actual model call has been made.
        </Warning>

        Click **Run test** to execute the evaluator against the selected row. The result appears in the **Response** field.
      </Tab>
    </Tabs>

    <Info>
      Once created, this evaluator is available as a guardrail in **Deployments** and **Agents**. See [Evaluators and Guardrails in Deployments](/docs/deployments/creating#evaluators-and-guardrails) and [Evaluators and Guardrails in Agents](/docs/agents/build#configure-evaluators-and-guardrails) to learn more.
    </Info>
  </Tab>
</Tabs>

## Python Evaluator

Python Evaluators let you write custom **Python code** for maximum flexibility: from simple validations (regex, length checks) to complex analyses (statistical scoring, custom algorithms).

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    In a [Project](/docs/projects/overview) or folder, click the <kbd><Icon icon="plus" /></kbd> button and select **Python Evaluator**. You are taken to the code editor. Your evaluation function has access to the following fields from the evaluated model's log:

    * `log["input"]` `<str>`: the last message sent to generate the output
    * `log["output"]` `<str>`: the generated response from the model
    * `log["reference"]` `<str>`: the reference used to compare the output
    * `log["messages"]` `list<str>`: all previous messages sent to the model
    * `log["retrievals"]` `list<str>`: all [Knowledge Base](/docs/knowledge/overview) retrievals

    The evaluator can return two response types:

    * **Number**: return a numeric score
    * **Boolean**: return a true/false value

    Example: compare output size with the reference:

    ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
    def evaluate(log):
        output_size = len(log["output"])
        reference_size = len(log["reference"])
        return abs(output_size - reference_size)
    ```

    <Info>
      You can define multiple methods within the code editor. The last method is the entry-point for the Evaluator when run.
    </Info>
  </Tab>

  <Tab title="API & SDK" icon="code">
    Use the [Create an Evaluator API](/reference/evaluators/create-an-evaluator). Use `\n` to indicate newlines in code.

    ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
    curl --request POST \
         --url https://api.orq.ai/v2/evaluators \
         --header 'accept: application/json' \
         --header 'authorization: Bearer ORQ_API_KEY' \
         --header 'content-type: application/json' \
         --data '{
      "type": "python_eval",
      "path": "Default/Evaluators",
      "key": "MyEvaluator",
      "code": "def evaluate(log):\n  output_size = len(log[\"output\"])\n  reference_size = len(log[\"reference\"])\n  return abs(output_size - reference_size)\n",
      "guardrail_config": {
        "enabled": true,
        "type": "number",
        "value": 10,
        "operator": "lte"
      }
    }'
    ```
  </Tab>

  <Tab title="MCP" icon="https://mintcdn.com/orqai/i7ZhKI7LFRfXU7ox/images/logos/mcp.svg?fit=max&auto=format&n=i7ZhKI7LFRfXU7ox&q=85&s=cef7916eb5fe1f6bb97541398d3f7639" width="16" height="16" data-path="images/logos/mcp.svg">
    **Retrieve a Python evaluator's configuration:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Show me the current configuration for the "json-validator" evaluator
    ```

    The assistant uses `search_entities` to resolve the evaluator ID, then `get_python_eval` to retrieve the full configuration including code and output type.

    ***

    **Create a Python evaluator:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create a Python evaluator that checks whether the response contains a valid JSON object
    ```

    The assistant writes a Python snippet that parses the response and validates JSON structure, then uses `create_python_eval` to register it in your workspace.

    ***

    **Update a Python evaluator:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Update the "json-validator" evaluator to also check that the JSON contains a "status" field
    ```

    The assistant uses `search_entities` to find the evaluator, then `update_python_eval` with the updated `code`.
  </Tab>
</Tabs>

### Environment and Libraries

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    The Python Evaluator runs in **Python 3.12** with the following preloaded libraries:

    ```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
    numpy==2.4.4
    nltk==3.9.4
    requests
    pydantic
    json
    re
    ```
  </Tab>
</Tabs>

### Guardrail Configuration

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Within a [Deployment](/docs/deployments/overview) or [Agent](/docs/agents/overview), use the Python Evaluator as a Guardrail to block generations that don't meet the custom evaluation logic.

    Use the **Pass condition** to define when the guardrail passes:

    * **Boolean evaluators**: select **True** or **False**. The guardrail passes when your function returns the selected value.
    * **Number evaluators**: enter a score threshold. The guardrail passes when your function's return value is greater than or equal to the threshold.

    Any evaluator created in **Orq.ai**, whether LLM or Python, can be attached as a guardrail in a [Deployment](/docs/deployments/creating#evaluators-and-guardrails) or [Agent](/docs/agents/build#configure-evaluators-and-guardrails). Only the **Pass condition** needs to be set.
  </Tab>
</Tabs>

### Examples

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    <AccordionGroup>
      <Accordion title="Checking an output survives an external API round-trip" icon="globe">
        Use the `requests` package to send the output to an external endpoint and confirm it comes back unchanged. Return `True` only when the call succeeds and the echoed text matches the output.

        ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
        def evaluate(log):
            import requests
            try:
                r = requests.post(
                    "https://httpbin.org/post",
                    json={"text": log["output"]},
                    timeout=10,
                )
                echoed = r.json()["json"]["text"]
                return echoed == log["output"]   # round-trip succeeded
            except (requests.RequestException, KeyError, ValueError):
                return False
        ```
      </Accordion>

      <Accordion title="Validating an insurance damage assessment report against a schema" icon="brackets-curly">
        Use `pydantic` to validate that the output is JSON matching the expected damage report schema, including a confidence score between 0 and 1. Return `True` when it parses and validates, `False` otherwise.

        ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
        def evaluate(log):
            import json
            from typing import List
            from pydantic import BaseModel, field_validator, ValidationError

            class Damage(BaseModel):
                type: str
                location: str
                severity: str
                confidence: float
                description: str

                @field_validator("confidence")
                @classmethod
                def confidence_in_range(cls, v):
                    if not 0.0 <= v <= 1.0:
                        raise ValueError("confidence must be between 0 and 1")
                    return v

            class Assessment(BaseModel):
                damages: List[Damage]
                assetType: str
                observations: List[str]
                overallAssessment: str

            try:
                Assessment(**json.loads(log["output"]))
                return True
            except (ValidationError, json.JSONDecodeError, KeyError):
                return False
        ```
      </Accordion>
    </AccordionGroup>
  </Tab>
</Tabs>

### Testing

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Fill the payload manually in the **Editor**. Enter values for `input`, `output`, `reference`, `messages`, and `retrievals`. All `log` fields resolve against what you enter.

    <Frame caption="Configure the payload that will be sent to the Python evaluator.">
      <img src="https://mintcdn.com/orqai/EqUGDI2og-dnTmDI/images/docs/654f199a1a1ae3287e719db4c52f61b1c1725eed4bd92fb64368ed351d8daa51-Screenshot_2025-06-27_at_11.31.08.png?fit=max&auto=format&n=EqUGDI2og-dnTmDI&q=85&s=9a06182982b2b5a27ef630ef5c2f80f6" alt="Studio Playground panel for configuring the payload sent to a Python evaluator." width="830" height="546" data-path="images/docs/654f199a1a1ae3287e719db4c52f61b1c1725eed4bd92fb64368ed351d8daa51-Screenshot_2025-06-27_at_11.31.08.png" />
    </Frame>

    Click **Run** to execute the evaluator. The result appears in the **Response** field.

    <Frame caption="A Python test response.">
      <img src="https://mintcdn.com/orqai/8ublVIDMeb653NWy/images/docs/2892c6189ecf781ba25353fac32d5bba1d7a03ea9b04b1fd437301b80c5c2c6a-Screenshot_2025-06-27_at_11.31.10.png?fit=max&auto=format&n=8ublVIDMeb653NWy&q=85&s=9e8571c1b6449522eb9e6c464d79ab25" alt="Response field showing the result of a Python evaluator test run." width="820" height="480" data-path="images/docs/2892c6189ecf781ba25353fac32d5bba1d7a03ea9b04b1fd437301b80c5c2c6a-Screenshot_2025-06-27_at_11.31.10.png" />
    </Frame>
  </Tab>
</Tabs>

## Versions

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    When you are done editing, click <kbd className="key">Publish</kbd> to save your changes. You will be prompted to write a commit message and choose a version bump:

    <Frame caption="Publish a new version of your Evaluator.">
      <img src="https://mintcdn.com/orqai/4EPXiu89-sAKjNI7/images/evaluator-publish.png?fit=max&auto=format&n=4EPXiu89-sAKjNI7&q=85&s=dd165b40b98d2b38da312ece11fc2bea" alt="Evaluator publish" width="516" height="366" data-path="images/evaluator-publish.png" />
    </Frame>

    * **Patch** (e.g. `v1.0.0` to `v1.0.1`): small fixes, no behaviour change
    * **Minor** (e.g. `v1.0.0` to `v1.1.0`): new functionality, backwards compatible
    * **Major** (e.g. `v1.0.0` to `v2.0.0`): breaking change or significant rework

    The **Versions** tab shows the full history with author and publish timestamp for each version.

    <Frame caption="Evaluator versions.">
      <img src="https://mintcdn.com/orqai/4EPXiu89-sAKjNI7/images/evaluators-versions.png?fit=max&auto=format&n=4EPXiu89-sAKjNI7&q=85&s=29095d162b428bac42c76bc00c1ed120" alt="Evaluator versions" width="577" height="411" data-path="images/evaluators-versions.png" />
    </Frame>

    Each published version has three action buttons:

    | Action      | Icon                        | Description                                                                                     |
    | ----------- | --------------------------- | ----------------------------------------------------------------------------------------------- |
    | Compare     | <Icon icon="right-left" />  | Open a diff view to see what changed between versions                                           |
    | Code        | <Icon icon="code" />        | Load a code snippet to invoke the evaluator at this exact version                               |
    | Environment | <Icon icon="layer-group" /> | Tag the version with an [Environment](/docs/administer/environments) (e.g. production, staging) |

    <Tip>
      Reference a specific version by appending `@` and the version number: `my-evaluator@1.0.1`. Reference an environment tag directly: `my-evaluator@production`. Without a suffix, the latest published version is used.
    </Tip>
  </Tab>
</Tabs>

## List Evaluators

<Tabs>
  <Tab title="API & SDK" icon="code">
    Use the [List Evaluators API](/reference/evaluators/get-all-evaluators):

    <CodeGroup>
      ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
      curl --request GET \
           --url https://api.orq.ai/v2/evaluators \
           --header 'accept: application/json' \
           --header 'authorization: Bearer ORQ_API_KEY'
      ```

      ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import { Orq } from "@orq-ai/node";

      const orq = new Orq({ apiKey: process.env["ORQ_API_KEY"] ?? "" });

      const result = await orq.evals.all({});
      console.log(result);
      ```

      ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      from orq_ai_sdk import Orq
      import os

      with Orq(api_key=os.getenv("ORQ_API_KEY", "")) as orq:
          res = orq.evals.all(limit=10)
          print(res)
      ```
    </CodeGroup>
  </Tab>
</Tabs>

## Invoke an Evaluator

<Tabs>
  <Tab title="API & SDK" icon="code">
    Fetch the evaluator ID from the [List Evaluators API](/reference/evaluators/get-all-evaluators), then invoke it. Use the **View Code** button on your evaluator page in the AI Studio to get a pre-filled snippet.

    <Frame caption="The Invoke an Evaluator dialog provides ready-to-copy Node, Python, and cURL snippets.">
      <img src="https://mintcdn.com/orqai/apdBV0S0bHg71CI1/images/invoke-evaluator-410.png?fit=max&auto=format&n=apdBV0S0bHg71CI1&q=85&s=9f115cdc9180e34b0eabf56947f83cf8" alt="Invoke an Evaluator dialog in AI Studio with Node, Python, and cURL tabs. The Python tab shows an orq.evals.invoke call with id, query, output, reference, messages, and retrievals arguments." width="1066" height="906" data-path="images/invoke-evaluator-410.png" />
    </Frame>

    <CodeGroup>
      ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
      curl 'https://api.orq.ai/v2/evaluators/<evaluator_id>/invoke' \
      -H 'Authorization: Bearer ORQ_API_KEY' \
      -H 'Content-Type: application/json' \
      -H 'Accept: application/json' \
      --data-raw '{
          "query": "Your input text",
          "output": "Your output text",
          "reference": "Optional reference text",
          "messages": [{"role": "user", "content": "Your message"}],
          "retrievals": ["Your retrieval content"]
      }'
      ```

      ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import { Orq } from "@orq-ai/node";

      const orq = new Orq({ apiKey: process.env["ORQ_API_KEY"] ?? "" });

      const evaluation = await orq.evals.invoke({
          id: "01JN5J8W4J5JP8ZSD0TADK11GJ",
          requestBody: {
              query: "Your input text",
              output: "Your output text",
              reference: "Optional reference text",
              messages: [{ role: "user", content: "Your message" }],
              retrievals: ["Your retrieval content"]
          }
      });
      console.log(evaluation);
      ```

      ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      from orq_ai_sdk import Orq
      import os

      orq = Orq(api_key=os.getenv("ORQ_API_KEY", ""))

      evaluation = orq.evals.invoke(
          id="01JN5J8W4J5JP8ZSD0TADK11GJ",
          query="Your input text",
          output="Your output text",
          reference="Optional reference text",
          messages=[{"role": "user", "content": "Your message"}],
          retrievals=["Your retrieval content"]
      )
      print(evaluation)
      ```
    </CodeGroup>
  </Tab>
</Tabs>

## Guardrail Error Response

When a guardrail evaluation fails, **Orq.ai** returns an HTTP `422 Unprocessable Entity`. The response body lists every guardrail that did not pass.

<Tabs>
  <Tab title="Deployments">
    ```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
    {
      "code": 422,
      "error": "Validation failed: Not all guardrails were met while validating the response.",
      "message": "Validation failed: Not all guardrails were met while validating the response.",
      "source": "system",
      "guardrails": [
        {
          "id": "01KMR75R90XDA80020YT8MHP2W",
          "status": "completed",
          "started_at": "2026-03-27T17:58:55.330Z",
          "finished_at": "2026-03-27T17:58:55.364Z",
          "related_entities": [
            {
              "type": "evaluator",
              "evaluator_id": "01KK9D8Z0JCEC1ASQJH8R28B57",
              "evaluator_metric_name": "python_evaluator"
            }
          ],
          "passed": false,
          "reason": null,
          "evaluator_type": "output_guardrail",
          "type": "boolean",
          "value": false
        }
      ]
    }
    ```

    | Field              | Type                       | Description                                                                                                            |
    | ------------------ | -------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
    | `id`               | string                     | Internal ID of the guardrail result.                                                                                   |
    | `status`           | string                     | Execution status: `"completed"` or `"failed"`.                                                                         |
    | `started_at`       | string                     | ISO 8601 timestamp when the guardrail evaluation started.                                                              |
    | `finished_at`      | string                     | ISO 8601 timestamp when the guardrail evaluation finished.                                                             |
    | `related_entities` | array                      | References to the evaluator that ran. Each entry contains `type`, `evaluator_id`, and `evaluator_metric_name`.         |
    | `passed`           | boolean                    | `false` for every entry in this error response.                                                                        |
    | `reason`           | string or null             | Explanation of the failure, when provided by the evaluator.                                                            |
    | `evaluator_type`   | string                     | `"input_guardrail"` if the guardrail ran before the model. `"output_guardrail"` if the guardrail ran after generation. |
    | `type`             | string                     | The value type returned by the evaluator: `"boolean"`, `"number"`, or `"categorical"`.                                 |
    | `value`            | boolean, number, or string | The raw value returned by the evaluator.                                                                               |
  </Tab>

  <Tab title="Agents">
    ```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
    {
      "code": 422,
      "error": "Validation failed: Not all guardrails were met while validating the messages.",
      "message": "Validation failed: Not all guardrails were met while validating the messages.",
      "source": "system"
    }
    ```

    The `guardrails` array is not included in Agent responses. Use [Traces](/docs/observability/traces) in the **Orq.ai** Studio to identify which guardrail failed.
  </Tab>
</Tabs>

<Info>
  **When the evaluator fails to execute:** If the evaluator itself fails to run (for example, a network error or timeout), the guardrail is silently skipped and the generation proceeds. Monitor skipped guardrail executions through [Traces](/docs/observability/traces).

  **When an LLM guardrail's underlying model fails:** If the model powering an LLM guardrail is unavailable, **Orq.ai** fails the entire request for safety. Since the guardrail could not run, there is no way to know whether it would have blocked the generation.
</Info>

## Evaluatorq

**Evaluatorq** is a dedicated SDK for running evaluations programmatically. It supports parallel job execution, flexible data sources (inline, CSV, Orq datasets), and syncs results to the **Orq.ai** AI Studio.

<Tabs>
  <Tab title="API & SDK" icon="code">
    **Install:**

    <CodeGroup>
      ```bash Node.js theme={"theme":{"light":"github-light","dark":"github-dark"}}
      npm install @orq-ai/evaluatorq
      ```

      ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      pip install evaluatorq
      ```
    </CodeGroup>

    **Usage example:**

    <CodeGroup>
      ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import { evaluatorq, job } from "@orq-ai/evaluatorq";

      const textAnalyzer = job("text-analyzer", async (data) => {
          const text = data.inputs.text;
          return {
              length: text.length,
              wordCount: text.split(" ").length,
              uppercase: text.toUpperCase(),
          };
      });

      await evaluatorq("text-analysis", {
          data: [
              { inputs: { text: "Hello world" } },
              { inputs: { text: "Testing evaluation" } },
          ],
          jobs: [textAnalyzer],
          evaluators: [
              {
                  name: "length-check",
                  scorer: async ({ output }) => {
                      const passesCheck = output.length > 10;
                      return {
                          value: passesCheck ? 1 : 0,
                          explanation: passesCheck
                              ? "Output length is sufficient"
                              : `Output too short (${output.length} chars, need >10)`,
                      };
                  },
              },
          ],
      });
      ```

      ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import asyncio
      from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult

      @job("text-analyzer")
      async def text_analyzer(data: DataPoint, row: int):
          text = data.inputs["text"]
          return {
              "length": len(text),
              "word_count": len(text.split()),
              "uppercase": text.upper(),
          }

      async def length_check_scorer(params):
          output = params["output"]
          passes_check = output["length"] > 10
          return EvaluationResult(
              value=1 if passes_check else 0,
              explanation=(
                  "Output length is sufficient"
                  if passes_check
                  else f"Output too short ({output['length']} chars, need >10)"
              )
          )

      async def main():
          await evaluatorq(
              "text-analysis",
              data=[
                  DataPoint(inputs={"text": "Hello world"}),
                  DataPoint(inputs={"text": "Testing evaluation"}),
              ],
              jobs=[text_analyzer],
              evaluators=[{"name": "length-check", "scorer": length_check_scorer}],
          )

      if __name__ == "__main__":
          asyncio.run(main())
      ```
    </CodeGroup>

    <Info>
      See the [Python Evaluatorq](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq-py) and [TypeScript Evaluatorq](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq) repositories for more.
    </Info>

    <Card title="Cookbook: Running evaluations in parallel with Evaluatorq" icon="flask" href="/docs/tutorials/evaluator-q">
      Step-by-step walkthrough comparing agent variants with parallel evaluators, including DeepEval and RAGAS integration.
    </Card>
  </Tab>
</Tabs>
