Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.orq.ai/llms.txt

Use this file to discover all available pages before exploring further.

Experiments run model generations across a Dataset and record Latency, Cost, and Time to First Token for each generation. Results can be reviewed manually or scored automatically with Evaluators and Human Reviews. For code-driven experiments, Orq.ai provides the evaluatorq framework to define jobs, evaluators, and data sources programmatically and sync results back to the AI Studio.

Use Cases

Run the same dataset through multiple models to compare output quality, cost, and latency. Works for newly released models, fine-tuned models, and private models added to the AI Router.
Test multiple prompt variants on the same dataset. Use evaluators like Cosine Similarity to quantitatively assess which version produces the best results.
Run experiments against your current prompt configuration before shipping changes. Use historical datasets to verify that updates haven’t degraded performance in any area.
Test how your model responds to jailbreak attempts and adversarial inputs in a controlled environment before putting it into production.

Prerequisites

Dataset

A Dataset with Inputs, Messages, and/or Expected Outputs

AI Router

Models added to the AI Router

API Key

An API Key (API and MCP only)

Create an Experiment

In the AI Studio, choose a Project and folder, click the + button, and select Experiment.Select a Dataset and one or more models, then click Create. Use the search field to find datasets quickly.You are taken to the Experiment Studio where you configure data entries and tasks before running.

Configure Tasks

The left side of the Experiment table shows the loaded Dataset entries. Each row runs separately against each configured task.Add new test rows with the Add Row button. Edit Inputs, Messages, and Expected Outputs by selecting any cell.
Columns can be reorganised and hidden using the menu.
To add a task, open the sidebar and select +Task:
Select a model to open the Prompt panel. Configure the prompt template using:
  • The Messages column from the dataset.
  • A configured Prompt.
  • A combination of both.
To learn more about Prompt Template configuration, see Creating a Prompt.
Choose an Agent from the +Task menu. Its configuration is automatically loaded as a new column.The agent prompt can use:
  • Instructions + Messages only.
  • Instructions + Dataset Messages column.
To learn more about Agent configuration, see Build Agents.

Variables and Prompt Templating

Reference dataset inputs in your prompt using {{variable_name}}. Values come from the Inputs column and are substituted per row when the experiment runs.Select the Template Engine from the Prompt Settings panel:
  • Text (default): {{double_braces}} syntax.
  • Jinja: conditionals, loops, filters, and more.
  • Mustache: logic-less templating with sections.
Template engine selector
1

Prompt template

You are a support assistant for {{company_name}}.

{% if user_tier == "premium" %}
{{customer_name}} is a premium customer. Greet them by name with priority support and a 2-hour SLA.
{% else %}
{{customer_name}} is on the free plan. Standard response time is 24 hours.
{% endif %}
2

Dataset inputs

{ "company_name": "Acme", "customer_name": "Sarah", "user_tier": "premium" }
3

Rendered prompt

You are a support assistant for Acme.

Sarah is a premium customer. Greet them by name with priority support and a 2-hour SLA.
For a complete reference of template features, see Prompt Templating.

Tool Calls for Agents

When using agents, attach executable tools that run in real-time during the experiment. These perform actual operations (HTTP requests, Python code, MCP calls).
  1. Open the agent configuration panel.
  2. Select Add Tool in the Tools section.
  3. Choose from available tools in your project.
See Build Agents for full tool configuration options.

Tool Calls for Prompts (Historical Testing)

Add a historical Tool Call chain to a model’s execution to test how it handles specific tool payloads or error scenarios.
These tool calls are simulated and do not execute. They provide historical context to test function calling behaviour. For real executable tools, use Tool Calls for Agents above.
Use the button to add a tool call to any message. Configure:
  • Function Name: which tool was called.
  • Input: the payload sent to the tool.
  • Output: the response the tool returned.
Add Tool Call Experiment

Configure Evaluators

To add an Evaluator, go to the right of the Experiment table and select Add new Column > Evaluator.The panel shows all Evaluators available in the current Project. Enable the toggle to add an Evaluator as a new column.
To add Evaluators to your project, see Evaluators. Import from the Hub or create a custom LLM Evaluator.

Human Reviews

To add a Human Review column, find the Human Review panel and select Add Human Review.
Human Review Experiment
To learn more, see Human Reviews.

Run an Experiment

Click the Run button to start the experiment. Depending on the dataset size, all generations may take a few minutes to complete. The status changes to Completed when done.
To start a new iteration with different prompts or data, use the New Run button. A new Experiment Run is created in Draft state.

Evaluation-Only Mode

To score existing responses in your dataset without generating new outputs:
  1. Set up the experiment with a dataset that already contains responses in the Messages column.
  2. Do not select a prompt during setup.
  3. Add your evaluators.
  4. Run the experiment.

Run a Single Prompt

To run one task against the existing dataset without re-running everything, click next to the task and choose Run.

Partial Runs

Hover on a single cell and click to re-run that row only.Re Run PromptSelect Partial Run from the Run menu to re-run all cells that are in Error or have not been run yet.Partial Run

Add Evaluators After Running

Add extra Evaluators or Human Reviews to an already-completed run. Use the drop-down on the Evaluator column to run only the newly added evaluations without re-running model generations.
Experiment Extra Evaluator

View Results

Once the experiment status changes to Completed, open the Review tab.
The Review tab has two views:
  • Review: inspect each model output individually.
  • Compare: view multiple model outputs side by side.

Column Result Overview

Each response column shows an aggregated summary at the top: average evaluator score, latency, and cost across all rows.
Column result overview

Review Mode

The Review mode shows each output individually with:
  • Inputs and Outputs: full conversation context with system prompts, user messages, and model responses.
  • Metrics: latency, TTFT, token usage breakdown, cost, model details, streaming status.
  • Human Review and Feedback: rate and annotate outputs.
  • Defects and Evaluators: automated evaluation results.
Use / or J/K to navigate between responses.
Annotations and Human Reviews can only be added in the Review tab. Compare mode is read-only.

Compare Mode

Visualise multiple model executions side by side. Variables and Expected Outputs are shown on the left. Evaluator scores appear at the bottom.

Tool Call History

When reviewing a model execution, see the step-by-step tool call history including payloads sent and responses received.

Multiple Runs

Use the Runs tab to see all previous runs for an experiment and compare Evaluator results across runs at a glance.

Export Results

The exported file contains: datasets, model configuration, responses, metrics (including Time to First Token), and Human Reviews.

Duplicate an Experiment

To duplicate an experiment with all its configuration (dataset, prompts, evaluators):
  1. Open the experiment.
  2. Click in the top-right corner.
  3. Select Duplicate.
  4. Provide a new name and click Confirm.