Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.orq.ai/llms.txt

Use this file to discover all available pages before exploring further.

Experiments run model generations across a Dataset and record Latency, Cost, and Time to First Token for each generation. Results can be reviewed manually or scored automatically with Evaluators and Human Reviews. For code-driven experiments, Orq.ai provides the evaluatorq framework to define jobs, evaluators, and data sources programmatically and sync results back to the AI Studio.

Use Cases

Run the same dataset through multiple models to compare output quality, cost, and latency. Works for newly released models, fine-tuned models, and private models added to the AI Router.
Test multiple prompt variants on the same dataset. Use evaluators like Cosine Similarity to quantitatively assess which version produces the best results.
Run experiments against your current prompt configuration before shipping changes. Use historical datasets to verify that updates haven’t degraded performance in any area.
Test how your model responds to jailbreak attempts and adversarial inputs in a controlled environment before putting it into production.

Prerequisites

Dataset

A Dataset with Inputs, Messages, and/or Expected Outputs

AI Router

Models added to the AI Router

API Key

An API Key (API and MCP only)

Create an Experiment

In the AI Studio, choose a Project and folder, click the + button, and select Experiment.Select a Dataset and one or more models, then click Create. Use the search field to find datasets quickly.You are taken to the Experiment Studio where you configure data entries and tasks before running.

Configure Tasks

The left side of the Experiment table shows the loaded Dataset entries. Each row runs separately against each configured task.Add new test rows with the Add Row button. Edit Inputs, Messages, and Expected Outputs by selecting any cell.
Columns can be reorganised and hidden using the menu.
CS_demo experiment grid in Draft state showing Inputs, Messages, Expected Output, and Response columns with gpt-4o and claude-3-5-sonnet variants and 10 dataset rows.To add a task, open the sidebar and select +Task:
Select a model to open the Prompt panel. Configure the prompt template using:
  • The Messages column from the dataset.
  • A configured Prompt.
  • A combination of both.
Experiment view with the Prompt panel open on the right, showing model settings for gpt-4.2 including temperature, max tokens, and messaging column configuration.
To learn more about Prompt Template configuration, see Creating a Prompt.
Choose an Agent from the +Task menu. Its configuration is automatically loaded as a new column.The agent prompt can use:
  • Instructions + Messages only.
  • Instructions + Dataset Messages column.
Experiment view with the Agent panel open on the right, showing the bank_creditcard_agent_gpt_4.2 agent with instructions for Dutch Royal Bank Credit Card Support.
To learn more about Agent configuration, see Build Agents.

Variables and Prompt Templating

Reference dataset inputs in your prompt using {{variable_name}}. Values come from the Inputs column and are substituted per row when the experiment runs.Select the Template Engine from the Prompt Settings panel:
  • Text (default): {{double_braces}} syntax.
  • Jinja: conditionals, loops, filters, and more.
  • Mustache: logic-less templating with sections.
Engine dropdown in the Prompt panel with Jinja selected and options for Text, Jinja, and Mustache.
1

Prompt template

You are a support assistant for {{company_name}}.

{% if user_tier == "premium" %}
{{customer_name}} is a premium customer. Greet them by name with priority support and a 2-hour SLA.
{% else %}
{{customer_name}} is on the free plan. Standard response time is 24 hours.
{% endif %}
2

Dataset inputs

{ "company_name": "Acme", "customer_name": "Sarah", "user_tier": "premium" }
3

Rendered prompt

You are a support assistant for Acme.

Sarah is a premium customer. Greet them by name with priority support and a 2-hour SLA.
For a complete reference of template features, see Prompt Templating.

Tool Calls for Agents

When using agents, attach executable tools that run in real-time during the experiment. These perform actual operations (HTTP requests, Python code, MCP calls).
  1. Open the agent configuration panel.
  2. Select Add Tool in the Tools section.
  3. Choose from available tools in your project.
See Build Agents for full tool configuration options.

Tool Calls for Prompts (Historical Testing)

Add a historical Tool Call chain to a model’s execution to test how it handles specific tool payloads or error scenarios.
These tool calls are simulated and do not execute. They provide historical context to test function calling behaviour. For real executable tools, use Tool Calls for Agents above.
Use the button to add a tool call to any message. Configure:
  • Function Name: which tool was called.
  • Input: the payload sent to the tool.
  • Output: the response the tool returned.
Add Tool Call Experiment

Configure Evaluators

To add an Evaluator, go to the right of the Experiment table and select Add new Column > Evaluator.The panel shows all Evaluators available in the current Project. Enable the toggle to add an Evaluator as a new column.Evaluators selection panel showing available evaluators including Contains Any, Contains None, Context Recall, Cosine Similarity, demo-evaluator, demo-json, Fact Checking Knowledge Base, and Factchecker with toggle controls.
To add Evaluators to your project, see Evaluators. Import from the Hub or create a custom LLM Evaluator.

Human Reviews

To add a Human Review column, find the Human Review panel and select Add Human Review.
Experiment grid with a Select Feedback dialog open showing Good and Bad options with an explanation field, and Bad selected with the note Could've offered a link to relevant documentation.
To learn more, see Human Reviews.

Run an Experiment

Click the Run button to start the experiment. Depending on the dataset size, all generations may take a few minutes to complete. The status changes to Completed when done.
To start a new iteration with different prompts or data, use the New Run button. A new Experiment Run is created in Draft state.

Evaluation-Only Mode

To score existing responses in your dataset without generating new outputs:
  1. Set up the experiment with a dataset that already contains responses in the Messages column.
  2. Do not select a prompt during setup.
  3. Add your evaluators.
  4. Run the experiment.

Run a Single Prompt

To run one task against the existing dataset without re-running everything, click next to the task and choose Run.
Context menu on the gpt-5-mini column header showing options: Run, Settings, Duplicate, Hide Column, and Delete.

Partial Runs

Hover on a single cell and click to re-run that row only.Re Run PromptSelect Partial Run from the Run menu to re-run all cells that are in Error or have not been run yet.Partial Run

Add Evaluators After Running

Add extra Evaluators or Human Reviews to an already-completed run. Use the drop-down on the Evaluator column to run only the newly added evaluations without re-running model generations.
Experiment Extra Evaluator

View Results

Once the experiment status changes to Completed, open the Review tab.
Review tab for demo-experiment showing response 1 of 48 with a gpt-5 prompt, Feedback quality slider, HumanReview sentiment buttons, and BERT evaluator scores.
The Review tab has two views:
  • Review: inspect each model output individually.
  • Compare: view multiple model outputs side by side.

Column Result Overview

Each response column shows an aggregated summary at the top: average evaluator score, latency, and cost across all rows.
Experiment results grid showing gpt-4o and basic_translator variant columns with a tooltip over gpt-4o showing Pass Rate 33%, Avg. Latency 2,354ms, Avg. Cost $0.00218, Input Tokens 2,376, and Total Tokens 3,090.

Review Mode

The Review mode shows each output individually with:
  • Inputs and Outputs: full conversation context with system prompts, user messages, and model responses.
  • Metrics: latency, TTFT, token usage breakdown, cost, model details, streaming status.
  • Human Review and Feedback: rate and annotate outputs.
  • Defects and Evaluators: automated evaluation results.
Use / or J/K to navigate between responses.
Annotations and Human Reviews can only be added in the Review tab. Compare mode is read-only.

Compare Mode

Visualise multiple model executions side by side. Variables and Expected Outputs are shown on the left. Evaluator scores appear at the bottom.
View multiple model generations side by side.

Tool Call History

When reviewing a model execution, see the step-by-step tool call history including payloads sent and responses received.
See the model interpretation and reasoning around each tool call.

Multiple Runs

Use the Runs tab to see all previous runs for an experiment and compare Evaluator results across runs at a glance.
Runs tab for a New experiment showing a table with Status, Prompt, Cosine Similarity, JSON Schema Evaluator, Run, Creator, and Added columns, listing two Completed runs using gpt-4.1.

Export Results

Experiment context menu showing Edit, Duplicate, Share, Export with CSV, JSON, and JSON Lines options, Move to, and Delete.
The exported file contains: datasets, model configuration, responses, metrics (including Time to First Token), and Human Reviews.
CSV export table showing experiment log rows with timestamp, status, model, template, context, reference, and llm_response columns for gpt-3.5-turbo and meta-llama models answering questions about historical figures.

Duplicate an Experiment

To duplicate an experiment with all its configuration (dataset, prompts, evaluators):
  1. Open the experiment.
  2. Click in the top-right corner.
  3. Select Duplicate.
  4. Provide a new name and click Confirm.