Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

Experiments run model generations across a Dataset and record Latency, Cost, and Time to First Token for each generation. Results can be reviewed manually or scored automatically with Evaluators and Human Reviews. For code-driven experiments, Orq.ai provides the evaluatorq framework to define jobs, evaluators, and data sources programmatically and sync results back to the AI Studio.

Use Cases

Compare models side by side

Run the same dataset through multiple models to compare output quality, cost, and latency. Works for newly released models, fine-tuned models, and private models added to the AI Gateway.

Optimise prompts

Test multiple prompt variants on the same dataset. Use evaluators like Cosine Similarity to quantitatively assess which version produces the best results.

Pre-deployment and regression testing

Run experiments against your current prompt configuration before shipping changes. Use historical datasets to verify that updates haven’t degraded performance in any area.

Security and red teaming

Test how your model responds to jailbreak attempts and adversarial inputs in a controlled environment before putting it into production.

Prerequisites

Dataset

A Dataset with Inputs, Messages, and/or Expected Outputs

AI Gateway

Models added to the AI Gateway

API Key

An API Key (API and MCP only)

Create an Experiment

AI Studio
API & SDK
MCP

In the AI Studio, choose a Project and folder, click the button, and select Experiment.Select a Dataset and one or more models, then click Create. Use the search field to find datasets quickly.You are taken to the Experiment Studio where you configure data entries and tasks before running.

Use the evaluatorq framework to run experiments from code.Install:

pip install orq-ai-sdk
pip install evaluatorq

Configure environment:

export ORQ_API_KEY="your-api-key"
export ORQ_ENV="production"
export ORQ_EVALUATOR_ID="your-evaluator-ulid"  # optional

ORQ_API_KEY is required to invoke Deployments and Agents, run Evaluators, and sync results to the Orq.ai UI. Without it, experiments run locally only.

Define your data. Choose one of three approaches:

Reference an existing Dataset (recommended)

from evaluatorq import DatasetIdInput

dataset_id = "01ARZ3NDEKTSV4RRFFQ69G5FAV"
# Pass DatasetIdInput directly to evaluatorq in the Run step

Load from CSV or JSON

import csv, json
from evaluatorq import DataPoint

with open("test_data.csv", "r") as f:
    test_data = [DataPoint(inputs=row) for row in csv.DictReader(f)]

# or from JSON
with open("test_data.json", "r") as f:
    test_data = [DataPoint(inputs=item) for item in json.load(f)]

Define inline

from evaluatorq import DataPoint

test_data = [
    DataPoint(inputs={"text": "Cinderella tells the story of a kind young woman..."}),
    DataPoint(inputs={"text": "Little Red Riding Hood follows a girl traveling..."}),
]

See the evaluatorq Tutorial for advanced patterns including third-party framework integration and CI/CD setup.

Create an experiment from an existing dataset:

Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 using the "user-queries" dataset

The assistant uses search_entities to find the dataset, then create_experiment with two model configurations and auto_run enabled.

Compare two prompt strategies:

Create an experiment using the "customer-feedback" dataset with two prompts: one focused on empathy and one on brevity. Run it and summarize the results.

The assistant uses create_experiment with two prompt variants and auto_run enabled, then get_experiment_run to retrieve and summarise the evaluation metrics.

Configure Tasks

AI Studio
API & SDK

The left side of the Experiment table shows the loaded Dataset entries. Each row runs separately against each configured task.Add new test rows with the Add Row button. Edit Inputs, Messages, and Expected Outputs by selecting any cell.

Columns can be reorganised and hidden using the menu.

CS_demo experiment grid in Draft state showing Inputs, Messages, Expected Output, and Response columns with gpt-4o and claude-3-5-sonnet variants and 10 dataset rows.

To add a task, open the sidebar and select +Task:

Configure a Model

Select a model to open the Prompt panel. Configure the prompt template using:

The Messages column from the dataset.
A configured Prompt.
A combination of both.

Experiment view with the Prompt panel open on the right, showing model settings for gpt-4.2 including temperature, max tokens, and messaging column configuration.

To learn more about Prompt Template configuration, see Creating a Prompt.

Configure an Agent

Choose an Agent from the +Task menu. Its configuration is automatically loaded as a new column.The agent prompt can use:

Instructions + Messages only.
Instructions + Dataset Messages column.

Experiment view with the Agent panel open on the right, showing the bank_creditcard_agent_gpt_4.2 agent with instructions for Dutch Royal Bank Credit Card Support.

To learn more about Agent configuration, see Build Agents.

Define jobs using the @job decorator (Python) or job() function (TypeScript). Each job defines one variant to test.

import asyncio, os
from evaluatorq import job, DataPoint
from orq_ai_sdk import Orq

orq_client = Orq(
    api_key=os.getenv("ORQ_API_KEY"),
    server_url=os.getenv("ORQ_SERVER_URL", "https://my.orq.ai")
)

def extract_response_text(response):
    if hasattr(response, "output") and response.output:
        if isinstance(response.output, list) and len(response.output) > 0:
            part = response.output[0]
            if hasattr(part, "parts") and part.parts:
                return part.parts[0].text if hasattr(part.parts[0], "text") else str(part.parts[0])
    if hasattr(response, "content"):
        if isinstance(response.content, list):
            return " ".join(part.text if hasattr(part, "text") else str(part) for part in response.content)
        return str(response.content)
    return str(response)

@job("summarize-variant-a")
async def summarize_variant_a(data: DataPoint, row: int):
    response = await asyncio.to_thread(
        orq_client.deployments.invoke,
        key="summarization_v2",
        context={"environments": [], "reasoning": ["minimal"]},
        inputs={"text": data.inputs["text"]},
    )
    return {"variant": "variant-a", "input": data.inputs["text"], "summary": extract_response_text(response)}

@job("summarize-variant-b")
async def summarize_variant_b(data: DataPoint, row: int):
    response = await asyncio.to_thread(
        orq_client.deployments.invoke,
        key="summarization_v2",
        context={"environments": [], "reasoning": ["medium"]},
        inputs={"text": data.inputs["text"]},
    )
    return {"variant": "variant-b", "input": data.inputs["text"], "summary": extract_response_text(response)}

Jobs can invoke Deployments, Agents, or Prompts. Third-party frameworks (LangGraph, CrewAI, LlamaIndex, AutoGen) can be integrated to compare against Orq features side-by-side.

Variables and Prompt Templating

AI Studio

Reference dataset inputs in your prompt using {{variable_name}}. Values come from the Inputs column and are substituted per row when the experiment runs.Select the Template Engine from the Prompt Settings panel:

Text (default): {{double_braces}} syntax.
Jinja: conditionals, loops, filters, and more.
Mustache: logic-less templating with sections.

Engine dropdown in the Prompt panel with Jinja selected and options for Text, Jinja, and Mustache.

Jinja
Mustache

Prompt template

You are a support assistant for {{company_name}}.

{% if user_tier == "premium" %}
{{customer_name}} is a premium customer. Greet them by name with priority support and a 2-hour SLA.
{% else %}
{{customer_name}} is on the free plan. Standard response time is 24 hours.
{% endif %}

Dataset inputs

{ "company_name": "Acme", "customer_name": "Sarah", "user_tier": "premium" }

Rendered prompt

You are a support assistant for Acme.

Sarah is a premium customer. Greet them by name with priority support and a 2-hour SLA.

Prompt template

You are a support assistant for {{company_name}}.

{{# is_premium}}
{{customer_name}} is a premium customer. Priority support with a 2-hour SLA.
{{/ is_premium}}
{{^ is_premium}}
{{customer_name}} is on the free plan. Standard response time is 24 hours.
{{/ is_premium}}

Dataset inputs

{ "company_name": "Acme", "customer_name": "Sarah", "is_premium": true }

Rendered prompt

You are a support assistant for Acme.

Sarah is a premium customer. Priority support with a 2-hour SLA.

For a complete reference of template features, see Prompt Templating.

Tool Calls for Agents

AI Studio

When using agents, attach executable tools that run in real-time during the experiment. These perform actual operations (HTTP requests, Python code, MCP calls).

Open the agent configuration panel.
Select Add Tool in the Tools section.
Choose from available tools in your project.

See Build Agents for full tool configuration options.

Tool Calls for Prompts (Historical Testing)

AI Studio

Add a historical Tool Call chain to a model’s execution to test how it handles specific tool payloads or error scenarios.

These tool calls are simulated and do not execute. They provide historical context to test function calling behaviour. For real executable tools, use Tool Calls for Agents above.

Use the button to add a tool call to any message. Configure:

Function Name: which tool was called.
Input: the payload sent to the tool.
Output: the response the tool returned.

Configure Evaluators

AI Studio
API & SDK
MCP

To add an Evaluator, go to the right of the Experiment table and select Add new Column > Evaluator.The panel shows all Evaluators available in the current Project. Enable the toggle to add an Evaluator as a new column.

Evaluators selection panel showing available evaluators including Contains Any, Contains None, Context Recall, Cosine Similarity, demo-evaluator, demo-json, Fact Checking Knowledge Base, and Factchecker with toggle controls.

To add Evaluators to your project, see Evaluators. Import from the Hub or create a custom LLM Evaluator.

Define evaluators as async functions that return an EvaluationResult with a score (0.0 to 1.0) and an explanation.

Local evaluator

from evaluatorq import EvaluationResult

async def word_count_scorer(params):
    word_count = len(params["output"].get("summary", "").split())
    if word_count >= 10:
        return EvaluationResult(value=1.0, explanation=f"Sufficient ({word_count} words)")
    elif word_count >= 5:
        return EvaluationResult(value=0.5, explanation=f"Partial ({word_count} words)")
    else:
        return EvaluationResult(value=0.0, explanation=f"Too short ({word_count} words)")

Orq Evaluator

import asyncio, os
from evaluatorq import EvaluationResult

EVAL_ID = os.environ.get("ORQ_EVALUATOR_ID", "your-evaluator-id")

async def summarization_quality_scorer(params):
    data, output = params["data"], params["output"]
    source_text = (data.inputs.get("text") or "").strip()
    summary = (output.get("summary") or "").strip()
    if not summary or not source_text:
        return EvaluationResult(value=0.0, explanation="Missing source or summary")
    evaluation = await asyncio.to_thread(
        orq_client.evals.invoke,
        id=EVAL_ID, query=source_text, output=summary,
        reference=None, messages=[], retrievals=[],
    )
    return EvaluationResult(value=float(evaluation.value.value), explanation=str(evaluation.value.explanation or ""))

Third-party evaluator (DeepEval)

from evaluatorq import EvaluationResult

async def deepeval_relevancy_scorer(params):
    from deepeval.metrics import AnswerRelevancyMetric
    from deepeval.test_case import LLMTestCase
    source_text = (params["data"].inputs.get("text") or "").strip()
    summary = (params["output"].get("summary") or "").strip()
    if not summary or not source_text:
        return EvaluationResult(value=0.0, explanation="Missing source or summary")
    metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(input=source_text, actual_output=summary)
    result = await asyncio.to_thread(metric.measure, test_case)
    return EvaluationResult(value=float(result.score), explanation=f"DeepEval relevancy: {result.score:.2f}")

See the evaluatorq Tutorial for more evaluator patterns including Ragas and other frameworks.

Create an experiment with an evaluator:

Create an experiment from the "qa-dataset" dataset with the "tone-scorer" evaluator attached

The assistant uses search_entities to find the dataset and evaluator, then create_experiment with both the dataset ID and evaluator ID, with auto_run enabled.

Create an evaluator first, then run an experiment:

Create an LLM-as-a-Judge evaluator that scores responses on tone, then run an experiment on the "customer-feedback" dataset using that evaluator

The assistant uses create_llm_eval to create the evaluator, then create_experiment with the returned evaluator key.

Human Reviews

AI Studio

To add a Human Review column, find the Human Review panel and select Add Human Review.

Experiment grid with a Select Feedback dialog open showing Good and Bad options with an explanation field, and Bad selected with the note Could've offered a link to relevant documentation.

To learn more, see Human Reviews.

Run an Experiment

AI Studio
API & SDK
MCP

Click the Run button to start the experiment. Depending on the dataset size, all generations may take a few minutes to complete. The status changes to Completed when done.

To start a new iteration with different prompts or data, use the New Run button. A new Experiment Run is created in Draft state.

Pass your data, jobs, and evaluators to evaluatorq():

import asyncio
from evaluatorq import evaluatorq, DatasetIdInput

async def main():
    await evaluatorq(
        "compare-summarization-variants",
        data=DatasetIdInput(dataset_id="01ARZ3NDEKTSV4RRFFQ69G5FAV"),
        jobs=[summarize_variant_a, summarize_variant_b],
        evaluators=[
            {"name": "word-count", "scorer": word_count_scorer},
            {"name": "quality", "scorer": summarization_quality_scorer},
        ],
    )

if __name__ == "__main__":
    asyncio.run(main())

Once complete, evaluatorq prints a summary table in the terminal and a URL to the results in the Orq.ai AI Studio.

Terminal output showing an evaluation completed summary with 3 total data points, quality and word-count evaluator scores, and a link to view results in Orq.ai.

Add evaluators from the UI after a code run:Once the experiment completes, attach evaluators and re-run evaluations directly in the AI Studio without touching code. Use the Evaluator button to attach any evaluator and trigger a new evaluation pass.

vercel-multi-agent-eval experiment Run view showing Tasks (research-agent, math-agent) and Evaluators (city-relevance, correctness, quality-rubric, tool-usage, length_less_than_uqrv, llm_evaluator_tmmq) in the left panel.

evaluatorq Tutorial

Advanced patterns: comparing Deployments and Agents, third-party framework integration, multi-job workflows, CI/CD integration.

Red Teaming LLMs with evaluatorq

Probing LLM deployments and agents for security vulnerabilities using the evaluatorq red teaming CLI.

Run an experiment with auto-run enabled:

Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 on the "user-queries" dataset and run it automatically

The assistant uses create_experiment with auto_run: true and returns the experiment ID once both configurations have run.

List recent runs:

Show me the latest experiment runs in my workspace

The assistant uses list_experiment_runs with cursor pagination to retrieve recent runs.

Evaluation-Only Mode

AI Studio

To score existing responses in your dataset without generating new outputs:

Set up the experiment with a dataset that already contains responses in the Messages column.
Do not select a prompt during setup.
Add your evaluators.
Run the experiment.

Run a Single Prompt

AI Studio

To run one task against the existing dataset without re-running everything, click next to the task and choose Run.

Context menu on the gpt-5-mini column header showing options: Run, Settings, Duplicate, Hide Column, and Delete.

Partial Runs

AI Studio

Hover on a single cell and click to re-run that row only.

Select Partial Run from the Run menu to re-run all cells that are in Error or have not been run yet.

Add Evaluators After Running

AI Studio

Add extra Evaluators or Human Reviews to an already-completed run. Use the drop-down on the Evaluator column to run only the newly added evaluations without re-running model generations.

View Results

AI Studio
API & SDK
MCP

Once the experiment status changes to Completed, open the Review tab.The Review tab has two views:

Review: inspect each model output individually.
Compare: view multiple model outputs side by side.

Results sync to the Orq.ai AI Studio automatically when ORQ_API_KEY is set. The framework prints the experiment URL at the end of the run.

compare-summarization experiment Run #5 showing Tasks (summarize-variant-a, summarize-variant-b) with input texts and quality evaluator scores for each row.

LangGraph and Vercel AI SDK agent executions are fully visualised in the UI, including individual steps and tool invocations.

vercel-multi-agent-eval experiment Review showing Job 1 of 8 for research-agent with a knowledgeBase tool call using topic population of Tokyo 2023, and evaluator scores in the Feedback panel.

Export results:

Export the latest experiment run as CSV

The assistant uses list_experiment_runs to find the most recent run, then get_experiment_run with CSV export format and returns a signed download URL.

Get results for a specific run:

Show me the results for experiment run ID "01ARZ3NDEKTSV4RRFFQ69G5FAV"

The assistant uses get_experiment_run to retrieve the full run including all evaluation scores.

Column Result Overview

AI Studio

Each response column shows an aggregated summary at the top: average evaluator score, latency, and cost across all rows.

Experiment results grid showing gpt-4o and basic_translator variant columns with a tooltip over gpt-4o showing Pass Rate 33%, Avg. Latency 2,354ms, Avg. Cost $0.00218, Input Tokens 2,376, and Total Tokens 3,090.

Compare Mode

AI Studio

Visualise multiple model executions side by side. Variables and Expected Outputs are shown on the left. Evaluator scores appear at the bottom. Human Reviews can be applied here too, from the controls at the bottom of the screen.

Side-by-side experiment comparison with two model columns, openai-gpt-5 and parent-agent, each showing System, User, and Assistant messages, and the Question and Expected output on the left.

Tool Call History

AI Studio

When reviewing a model execution, see the step-by-step tool call history including payloads sent and responses received.

See the model interpretation and reasoning around each tool call.

Multiple Runs

AI Studio

Use the Runs tab to see all previous runs for an experiment and compare Evaluator results across runs at a glance.

Runs tab for a New experiment showing a table with Status, Prompt, Cosine Similarity, JSON Schema Evaluator, Run, Creator, and Added columns, listing two Completed runs using gpt-4.1.

Export Results

AI Studio

Experiment context menu showing Edit, Duplicate, Share, Export with CSV, JSON, and JSON Lines options, Move to, and Delete.

The exported file contains: datasets, model configuration, responses, metrics (including Time to First Token), and Human Reviews.

CSV export table showing experiment log rows with timestamp, status, model, template, context, reference, and llm_response columns for gpt-3.5-turbo and meta-llama models answering questions about historical figures.

Review Results

Once an experiment completes, open the Review tab in the experiment top nav. It offers two views.

Review Mode
Compare Mode

Open Review to step through every response one at a time.

Experiment Review screen showing Response 1 of 20 for product-orchestrator-A. Left panel: Inputs (prompt, should_trigger, scenario, expectations), Expected output, and Metrics (Latency 79,231 ms, TTFT 1,032 ms, Cost $0.13991, token counts). Center panel: System instructions, User input, and Assistant output with function calls. Right panel: an Annotations comment and good/bad rating above an Evaluators section listing a json_check evaluator marked No.

The screen is divided into three panels:

Left: Inputs, Expected output, and Metrics (Latency, , Cost, token counts).
Center: the full conversation for the selected entry: the User message, the Assistant response, and the tool calls made by the agent.
Right: the Annotations panel with Human Review controls for manual annotation, above the Evaluator scores.

Use / or J / K to step through responses with the keyboard.

Duplicate an Experiment

AI Studio

To duplicate an experiment with all its configuration (dataset, prompts, evaluators):

Open the experiment.
Click in the top-right corner.
Select Duplicate.
Provide a new name and click Confirm.

​Use Cases

​Prerequisites

Dataset

AI Gateway

API Key

​Create an Experiment

​Configure Tasks

​Variables and Prompt Templating

​Tool Calls for Agents

​Tool Calls for Prompts (Historical Testing)

​Configure Evaluators

​Human Reviews

​Run an Experiment

evaluatorq Tutorial

Red Teaming LLMs with evaluatorq

​Evaluation-Only Mode

​Run a Single Prompt

​Partial Runs

​Add Evaluators After Running

​View Results

​Column Result Overview

​Compare Mode

​Tool Call History

​Multiple Runs

​Export Results

​Review Results

​Duplicate an Experiment

Use Cases

Prerequisites

Create an Experiment

Configure Tasks

Variables and Prompt Templating

Tool Calls for Agents

Tool Calls for Prompts (Historical Testing)

Configure Evaluators

Human Reviews

Run an Experiment

Evaluation-Only Mode

Run a Single Prompt

Partial Runs

Add Evaluators After Running

View Results

Column Result Overview

Compare Mode

Tool Call History

Multiple Runs

Export Results

Review Results

Duplicate an Experiment