Skip to main content

Prerequisites

Before creating an Experiment, you need a Dataset. This dataset contains the Inputs, Messages, and Expected Outputs used for running an Experiment.
  • Inputs – Variables that can be used in the prompt message, e.g. {{firstname}}.
  • Messages – The prompt template, structured with system, user, and assistant roles.
  • Expected Outputs – Reference responses that evaluators use to compare against newly generated outputs.
To get experiments ready, make sure you have models available by adding them to your Model Garden.
You don’t need to include all three entities when uploading a dataset. Depending on your experiment, you can choose to include only inputs, messages, or expected outputs as needed. For example, you can create a dataset with just inputs.

Creating an Experiment

To create an Experiment, head to the orq.ai Studio:
  • Choose a Project and Folder and select the + button.
  • Choose Experiment
Afterwards, select a Dataset to use as a base for your Experiment and choose one or multiple models. You’ll be then taken to the Report view where you can configure Dataset entries and models before running the experiment.

Configuring Experiment

Data Entry Configuration

The left side of the table shows the loaded Dataset entries. Each entry is shown as a row and will be executed separately with each configured prompt. You can add new entries to be tested during experiment by using the Add Row button. Each entry’s Inputs, Messages and Expected Outputs can be edited independently by selecting a cell.
Columns can be reorganized and hidden at will, look for the icon to configure your columns.

Prompt Configuration

Your chosen prompts are displayed as separate column within the Response section. Prompts are assigned a corresponding letter (see Aand B above) to identify their performance and Evaluators results. To add a new Prompt, open the sidebar and choose +Prompt. Select the Prompt Name to open the Prompt panel and Configure the Prompt Template.
There are 3 ways to configure your prompt:
  • Using the Messages column in the dataset.
  • Using the configured Prompt.
  • Using a combination of the configured Prompt and the Messages column.

Open the Prompt panel by selecting its name on the left panel. Choose the messages to be sent to the evaluated model using the drop-down (blue) to use the configured Dataset.

To learn more about Prompt Template Configuration, see Creating a Prompt.

Adding Tool Calls to a Prompt

You can add a specific historical Tool Call chain to a model’s execution to test its behaviour when running into a specific tool, payload, or response. These tools can be configured at will at any step of the conversation, which lets you test for the following model use-cases:
  1. Recognizes its own mistakes - Can it identify that the previous tool call had incorrect parameters?
  2. Self-corrects in context - Does it adjust its behavior when shown the wrong result?
  3. Understands conversation flow - Does adding that failure to the message history change how it reasons about the problem?
To add a Tool call to a message, use the button.
Add Tool Call Experiment

Configuring a tool call input and output.

The following can be configured:
  • Tool Function Name so that you can decide whether the correct tool was called and plan behavioral response to errors.
  • Tool Input to simulate correct or incorrect translation from input to payload.
  • Tool Output to ensure a correct handling of any tool feedback.

Configuring Evaluators

Adding Evaluators to an Experiment allows for quantitative evaluation of the model-generated outputs. Using standard scientific methods or custom LLM-based evaluations, automate the scoring of models to quickly detect whether models fit a predefined hypothesis and if they stand out from one another. Within an Experiment, evaluators offer a quick way to validate the behavior of multiple models on a large Dataset. Evaluators can assess both newly generated outputs and existing responses already stored in your dataset.
To learn more about experiment use cases and benefits, see Experiments Overview.

Adding Evaluators to Experiment

To add an Evaluator to an experiment, head to the right of the table and Add new Column > Evaluator The following panel opens, showing all Evaluators available in your current Project. To add an Evaluator, enable its toggle. It will appear as a new column in the Report table. You can also see the Evaluator details by selecting the View button.
To add more Evaluators to your Projects, see Evaluators. You can choose to import Evaluators from our Hub or create your own LLM Evaluator

Viewing Evaluator Result

Once an Experiment has been run, you can view the Evaluator results on the Report page. Evaluators will be shown as columns next to the Cost and Latency results. Evaluators display results depending on their configuration.

Cells will be colored depending on score, to help identify outliers in results at a glance.

Configuring Human Reviews

Human Reviews are manual reviews for generated texts to help you classify and rate outputs following your own characteristics. They can be added to your experiment to extend reviews. To add a new Human Review, find the Human Review panel, choose Add Human Review, you can then add an existing Human Review to the experiment.
To learn more about Human Review and how to create them, see Human Reviews.
Human Review Experiment Pn

Human Reviews appear as a new column, each output can be reviewed individually. Here, an output is rated.

Using Vision in Experiments

You can also use images in combination with vision models to run an Experiment. Make sure to use the image message block and urls in your dataset. In the example screenshot below, you can see that the image_block is pointing to {{image_url}} inputs, which will iterate through the URLs in the dataset.

Setting up an Experiment with Images and a Vision model

Running an Experiment

Once configured, you can run the Experiment using the Run button. Depending on the Dataset size it may take a few minutes to run all model prompts generations. Once successful your Experiment Run Status will change to Completed. You can then see Experiment Results.

Only Evaluating Existing Dataset Outputs

If you want to test evaluators on datasets that already contain generated responses, you can run an evaluation-only experiment:
  1. Set up your experiment with the dataset containing existing outputs in the “messages” column
  2. Do not select a prompt during experiment setup
  3. Add your desired evaluators
  4. Run the experiment
This mode will evaluate the existing responses without generating new outputs, allowing you to retroactively score historical responses and conversation chains that are already stored in your dataset.
To run another iteration of the Experiment, with different prompts or data, use the New Run button. A new Experiment Run will be created in Draft state.

Running a single Prompt

It is often useful to add an extra prompt after running an experiment, to tweak a configuration or try a different version. Once a new Prompt is added, select and choose Run to run on the existing Dataset.

Partial Runs

By hovering on a single cell, you can use the icon to re-run a single prompt over a specific Dataset row. Re Run Prompt When an experiment has only been partially run, choose Partial Run option from the Run Experiment menu to run all cells that are in Error or haven’t been run yet. Partial Run Experiment

Running Extra Evaluators and Human Reviews

After an Experiment has run, it is possible to add extra Evaluators or Human Reviews. These newly added columns can then be run separately from the main experiment run, this lets you review the previously executed model generations easily.
Experiment Extra Evaluator

Use the drop-down on your Evaluator column to run the newly added Evaluations.

Using the Partial Run on the Experiment will also execute the newly added Evaluators in your Run.

Seeing Experiment Results

Report

Once an Experiment is ran, its status will change from Running to Completed

The total cost and runtime for the Experiment will be displayed.

The right side of the table will be filled with results.

The results for Prompts A and B

Under each Column corresponding to each Prompt you can see results for:
  • Latency.
  • Costs.
  • Evaluators.

Comparing Model Performance

Using the Compare tab, visualize multiple model executions.
Models Comparison Experiment

View multiple model generations side-by-side.

The variables and expected outputs are now displayed on the left for better context, especially when working with large inputs or detailed test cases. At the bottom of the screen, the evaluators section provides scores and feedback for each result, helping you assess model quality and performance at a glance. You can use this screen to easily apply Feedbacks and Human Review to each output, letting you evaluate and review Experiment results efficiently.

Feedbacks and Human Reviews are available at a Click.

Viewing Tool Call History

When viewing a model run log, see the step-by-step execution of the model and its tool calls.
Use the button to view a single execution Log.
In these threads you can see the details of the tool calls, including the fetched tool and payloads sent and received from the call.
To learn more about configuring tool calls in your Experiment, see Adding Tool Calls to a Prompt.
This history lets you verify the model behavior when finding the right tool to call. It also lets you validate that it’s reacting correctly to unexpected payloads or tool calls.
Experiment Tool History

See the model interpretation and reasoning around the tool call.

Viewing Multiple Experiment Runs

Within the Runs tab, visualize all previous runs for an Experiment. Through this view, all Evaluators results are visible at a glance, making it easy to compare result and see progress between multiple Runs.

See at a glance how results evolved between two experiment runs.

Logs

Switching to the Logs tab lets you see the details of each call. Within logs you can process Feedback and build Curated Dataset By hovering a cell you can also directly access the related log using the See log button.

Export

Exports are available after an Experiment ran successfully.

Once downloaded, all information held within the experiment will be held within the document:
  • Datasets
  • Model configuration
  • Responses
  • Metrics (Time to First Token)
  • Human Reviews

An example CSV download for an Experiment: each column holds data entries and generated responses.