Skip to main content

Prerequisites

Before creating an Experiment, you need a Dataset. This dataset contains the Inputs, Messages, and Expected Outputs used for running an Experiment.
  • Inputs – Variables that can be used in the prompt message, e.g. {{firstname}}.
  • Messages – The prompt template, structured with system, user, and assistant roles.
  • Expected Outputs – Reference responses that evaluators use to compare against newly generated outputs.
To get experiments ready, make sure you have models available by adding them to your Model Garden.
You don’t need to include all three entities when uploading a dataset. Depending on your experiment, you can choose to include only inputs, messages, or expected outputs as needed. For example, you can create a dataset with just inputs.

Creating an Experiment

To create an Experiment, head to the orq.ai Studio:
  • Choose a Project and Folder and select the + button.
  • Choose Experiment
Afterwards, select a Dataset to use as a base for your Experiment and choose one or multiple models. Use the search field to find datasets faster. You’ll be then taken to the Studio where you can configure Dataset entries and models before running the experiment.

Configuring Experiment

Data Entry Configuration

The left side of the table shows the loaded Dataset entries. Each entry is shown as a row and will be executed separately with each configured prompt. You can add new entries to be tested during experiment by using the Add Row button. Each entry’s Inputs, Messages and Expected Outputs can be edited independently by selecting a cell.
Columns can be reorganized and hidden at will, look for the icon to configure your columns.

Task, Prompt and Agent Configuration

Your chosen prompts are displayed as separate column within the Response section. Prompts are assigned a corresponding letter (see Aand B above) to identify their performance and Evaluators results. To add a new Prompt, open the sidebar and choose +Task. Here you have two options to find the right model configuration for your Prompt.
Select the Model you would like to use, the Prompt panel opens, here you can Configure the Prompt Template.
There are 3 ways to configure your prompt:
  • Using the Messages column in the dataset.
  • Using the configured Prompt.
  • Using a combination of the configured Prompt and the Messages column.
To learn more about Prompt Template Configuration, see Creating a Prompt.
As an alternative to Prompts, your configured Agents are also usable for Experiments. Choose your Agent from the +Task menu, its configuration will be automatically loaded as a new column for the experiment.
Similarly to a Model configuration, your Agent prompt can be configured:
  • Using the Instructions + Messages only.
  • Using the Instructions + Dataset Messages Column.
To learn more about Agent Prompt configuration, see our Agent Configuration Guide.

Tool Calls for Agents

When using agents in experiments, you can attach executable tools that actually run during experiment execution. Unlike historical tool calls for prompts (described below), these tools perform real operations like fetching current time, making HTTP requests, calling MCP servers, or executing Python code.
You can add tools directly from the agent experiment configuration screen by:
  1. Opening the agent configuration panel in your experiment
  2. Selecting Add Tool in the Tools section
  3. Choosing from available tools in your project
These tools execute in real-time during the experiment, providing dynamic data to your agent.
To learn more about Agent Prompt configuration, see our Agent Configuration Guide.

Tool Calls for Prompts (Historical Testing)

You can add a specific historical Tool Call chain to a model’s execution to test its behaviour when running into a specific tool, payload, or response.
These tool calls are simulated and do not execute. They serve as historical context to test how models handle function calling scenarios. For executable tools that run during experiments, see Tool Calls for Agents above.
These tools can be configured at will at any step of the conversation, which lets you test for the following model use-cases:
  1. Recognizes its own mistakes - Can it identify that the previous tool call had incorrect parameters?
  2. Self-corrects in context - Does it adjust its behavior when shown the wrong result?
  3. Understands conversation flow - Does adding that failure to the message history change how it reasons about the problem?
To add a Tool call to a message, use the button.
Add Tool Call Experiment
The following can be configured:
  • Tool Function Name so that you can decide whether the correct tool was called and plan behavioral response to errors.
  • Tool Input to simulate correct or incorrect translation from input to payload.
  • Tool Output to ensure a correct handling of any tool feedback.

Configuring Evaluators

Adding Evaluators to an Experiment allows for quantitative evaluation of the model-generated outputs. Using standard scientific methods or custom LLM-based evaluations, automate the scoring of models to quickly detect whether models fit a predefined hypothesis and if they stand out from one another. Within an Experiment, evaluators offer a quick way to validate the behavior of multiple models on a large Dataset. Evaluators can assess both newly generated outputs and existing responses already stored in your dataset.
To learn more about experiment use cases and benefits, see Experiments Overview.

Adding Evaluators to Experiment

To add an Evaluator to an experiment, head to the right of the table and Add new Column > Evaluator The following panel opens, showing all Evaluators available in your current Project. To add an Evaluator, enable its toggle. It will appear as a new column in the Studio table. You can also see the Evaluator details by selecting the View button.
To add more Evaluators to your Projects, see Evaluators. You can choose to import Evaluators from our Hub or create your own LLM Evaluator

Viewing Evaluator Result

Once an Experiment has been run, you can view the Evaluator results on the Review page. Evaluators will be shown as columns next to the Cost and Latency results. Evaluators display results depending on their configuration.

Configuring Human Reviews

Human Reviews are manual reviews for generated texts to help you classify and rate outputs following your own characteristics. They can be added to your experiment to extend reviews. To add a new Human Review, find the Human Review panel, choose Add Human Review, you can then add an existing Human Review to the experiment.
To learn more about Human Review and how to create them, see Human Reviews.
Human Review Experiment Pn

Using Vision in Experiments

You can also use images in combination with vision models to run an Experiment. Make sure to use the image message block and urls in your dataset. In the example screenshot below, you can see that the image_block is pointing to {{image_url}} inputs, which will iterate through the URLs in the dataset.
For detailed instructions on creating datasets with images, see Creating an Image Dataset.

Running an Experiment

Once configured, you can run the Experiment using the Run button. Depending on the Dataset size it may take a few minutes to run all model prompts generations. Once successful your Experiment Run Status will change to Completed. You can then see Experiment Results.

Only Evaluating Existing Dataset Outputs

If you want to test evaluators on datasets that already contain generated responses, you can run an evaluation-only experiment:
  1. Set up your experiment with the dataset containing existing outputs in the “messages” column
  2. Do not select a prompt during experiment setup
  3. Add your desired evaluators
  4. Run the experiment
This mode will evaluate the existing responses without generating new outputs, allowing you to retroactively score historical responses and conversation chains that are already stored in your dataset.
To run another iteration of the Experiment, with different prompts or data, use the New Run button. A new Experiment Run will be created in Draft state.

Running a single Prompt

It is often useful to add an extra prompt after running an experiment, to tweak a configuration or try a different version. Once a new Prompt is added, select and choose Run to run on the existing Dataset.

Partial Runs

By hovering on a single cell, you can use the icon to re-run a single prompt over a specific Dataset row. Re Run Prompt When an experiment has only been partially run, choose Partial Run option from the Run Experiment menu to run all cells that are in Error or haven’t been run yet. Partial Run Experiment

Running Extra Evaluators and Human Reviews

After an Experiment has run, it is possible to add extra Evaluators or Human Reviews. These newly added columns can then be run separately from the main experiment run, this lets you review the previously executed model generations easily.
Experiment Extra Evaluator
Using the Partial Run on the Experiment will also execute the newly added Evaluators in your Run.

Seeing Experiment Results

Once an Experiment is run, its status will change from Running to Completed
The Review Tab has two views, use the following buttons to choose:
  • Review.
  • Compare.

Review a Model Execution

The Review mode displays responses individually, allowing you to inspect each model output in detail, and see the following:
  • Inputs & Outputs: Full conversation context with system prompts, user messages, and model responses
  • Metrics:
    • Latency and TTFT (Time To First Token)
    • Detailed token usage breakdown: Input tokens, Output tokens, Reasoning tokens, and Total tokens
    • Cost information
    • Model and provider details
    • Streaming status
  • Human Review and Feedback: Rate and provide feedback on model outputs
  • Defects & Evaluators: View automated evaluation results and identify quality issues
Here you can annotate responses (similar to Annotation Queues) individually.
Use the / buttons or J/K keys to quickly switch to Next/Previous response.
Annotations and human reviews can only be added in the Review tab. The Comparison mode is read-only and designed for viewing model outputs side-by-side.

Comparing Model Performance

Using the Compare tab, visualize multiple model executions.
Models Comparison Experiment
The variables and expected outputs are now displayed on the left for better context, especially when working with large inputs or detailed test cases. At the bottom of the screen, the evaluators section provides scores and feedback for each result, helping you assess model quality and performance at a glance.
The Compare screen is read-only. To annotate responses or add Human Reviews, use the Review tab.

Viewing Tool Call History

When viewing a model execution, see the step-by-step execution of the model and its tool calls. In these threads you can see the details of the tool calls, including the fetched tool and payloads sent and received from the call.
To learn more about configuring tool calls in your Experiment, see Tool Calls for Prompts.
This history lets you verify the model behavior when finding the right tool to call. It also lets you validate that it’s reacting correctly to unexpected payloads or tool calls.
Experiment Tool History

Viewing Multiple Experiment Runs

Within the Runs tab, visualize all previous runs for an Experiment. Through this view, all Evaluators results are visible at a glance, making it easy to compare result and see progress between multiple Runs.

Duplicating an Experiment

To duplicate an existing Experiment with all its configurations (dataset, prompts, evaluators, etc.):
  1. Open the Experiment you want to duplicate
  2. Click the menu in the top-right corner
  3. Select Duplicate
  4. Provide a new name for the duplicated Experiment
  5. Click Confirm to create the duplicate
This helps you iterate on experiments while keeping different versions organized.

Export

Once downloaded, all information held within the experiment will be held within the document:
  • Datasets
  • Model configuration
  • Responses
  • Metrics (Time to First Token)
  • Human Reviews