Prerequisites
Before creating an Experiment, you need a Dataset. This dataset contains the Inputs, Messages, and Expected Outputs used for running an Experiment.- Inputs – Variables that can be used in the prompt message, e.g.
{{firstname}}. - Messages – The prompt template, structured with system, user, and assistant roles.
- Expected Outputs – Reference responses that evaluators use to compare against newly generated outputs.
Creating an Experiment
To create an Experiment, head to the orq.ai Studio:- Choose a Project and Folder and select the
+button. - Choose Experiment
Configuring Experiment
Data Entry Configuration

Add Row button.
Each entry’s Inputs, Messages and Expected Outputs can be edited independently by selecting a cell.
Prompt Configuration
Your chosen prompts are displayed as separate column within the Response section. Prompts are assigned a corresponding letter (seeAand B above) to identify their performance and Evaluators results.
To add a new Prompt, open the sidebar and choose +Prompt.
Select the Prompt Name to open the Prompt panel and Configure the Prompt Template.
- Using the Messages column in the dataset.
- Using the configured Prompt.
- Using a combination of the configured Prompt and the Messages column.

Open the Prompt panel by selecting its name on the left panel. Choose the messages to be sent to the evaluated model using the drop-down (blue) to use the configured Dataset.
Adding Tool Calls to a Prompt
You can add a specific historical Tool Call chain to a model’s execution to test its behaviour when running into a specific tool, payload, or response. These tools can be configured at will at any step of the conversation, which lets you test for the following model use-cases:- Recognizes its own mistakes - Can it identify that the previous tool call had incorrect parameters?
- Self-corrects in context - Does it adjust its behavior when shown the wrong result?
- Understands conversation flow - Does adding that failure to the message history change how it reasons about the problem?

Configuring a tool call input and output.
- Tool Function Name so that you can decide whether the correct tool was called and plan behavioral response to errors.
- Tool Input to simulate correct or incorrect translation from input to payload.
- Tool Output to ensure a correct handling of any tool feedback.
Configuring Evaluators
Adding Evaluators to an Experiment allows for quantitative evaluation of the model-generated outputs. Using standard scientific methods or custom LLM-based evaluations, automate the scoring of models to quickly detect whether models fit a predefined hypothesis and if they stand out from one another. Within an Experiment, evaluators offer a quick way to validate the behavior of multiple models on a large Dataset. Evaluators can assess both newly generated outputs and existing responses already stored in your dataset.Adding Evaluators to Experiment
To add an Evaluator to an experiment, head to the right of the table and Add new Column > Evaluator The following panel opens, showing all Evaluators available in your current Project.
Viewing Evaluator Result
Once an Experiment has been run, you can view the Evaluator results on the Report page. Evaluators will be shown as columns next to the Cost and Latency results. Evaluators display results depending on their configuration.
Cells will be colored depending on score, to help identify outliers in results at a glance.
Configuring Human Reviews
Human Reviews are manual reviews for generated texts to help you classify and rate outputs following your own characteristics. They can be added to your experiment to extend reviews. To add a new Human Review, find the Human Review panel, choose Add Human Review, you can then add an existing Human Review to the experiment.
Human Reviews appear as a new column, each output can be reviewed individually. Here, an output is rated.
Using Vision in Experiments
You can also use images in combination with vision models to run an Experiment. Make sure to use the image message block and urls in your dataset. In the example screenshot below, you can see that theimage_block is pointing to {{image_url}} inputs, which will iterate through the URLs in the dataset.

Setting up an Experiment with Images and a Vision model
Running an Experiment
Once configured, you can run the Experiment using the Run button. Depending on the Dataset size it may take a few minutes to run all model prompts generations. Once successful your Experiment Run Status will change to Completed. You can then see Experiment Results.Only Evaluating Existing Dataset Outputs
If you want to test evaluators on datasets that already contain generated responses, you can run an evaluation-only experiment:- Set up your experiment with the dataset containing existing outputs in the “messages” column
- Do not select a prompt during experiment setup
- Add your desired evaluators
- Run the experiment
Running a single Prompt
It is often useful to add an extra prompt after running an experiment, to tweak a configuration or try a different version. Once a new Prompt is added, select and choose Run to run on the existing Dataset.
Partial Runs
By hovering on a single cell, you can use the icon to re-run a single prompt over a specific Dataset row.

Running Extra Evaluators and Human Reviews
After an Experiment has run, it is possible to add extra Evaluators or Human Reviews. These newly added columns can then be run separately from the main experiment run, this lets you review the previously executed model generations easily.
Use the drop-down on your Evaluator column to run the newly added Evaluations.
Seeing Experiment Results
Report
Once an Experiment is ran, its status will change from Running to Completed
The total cost and runtime for the Experiment will be displayed.

The results for Prompts A and B
- Latency.
- Costs.
- Evaluators.
Comparing Model Performance
Using the Compare tab, visualize multiple model executions.
View multiple model generations side-by-side.

Feedbacks and Human Reviews are available at a Click.
Viewing Tool Call History
When viewing a model run log, see the step-by-step execution of the model and its tool calls. In these threads you can see the details of the tool calls, including the fetched tool and payloads sent and received from the call.
See the model interpretation and reasoning around the tool call.
Viewing Multiple Experiment Runs
Within the Runs tab, visualize all previous runs for an Experiment. Through this view, all Evaluators results are visible at a glance, making it easy to compare result and see progress between multiple Runs.
See at a glance how results evolved between two experiment runs.
Logs
Switching to the Logs tab lets you see the details of each call. Within logs you can process Feedback and build Curated Dataset By hovering a cell you can also directly access the related log using the See log button.Export

Exports are available after an Experiment ran successfully.
- Datasets
- Model configuration
- Responses
- Metrics (Time to First Token)
- Human Reviews

An example CSV download for an Experiment: each column holds data entries and generated responses.