Prerequisite
First configure your Experiment, see Creating an ExperimentRunning an Experiment
Once configured, you can run the Experiment using the Run button. Depending on the Dataset size it may take a few minutes to run all model prompts generations. Once successful your Experiment Run Status will change to Completed. You can then see Experiment Results.Only Evaluating Existing Dataset Outputs
If you want to test evaluators on datasets that already contain generated responses, you can run an evaluation-only experiment:- Set up your experiment with the dataset containing existing outputs in the “messages” column
- Do not select a prompt during experiment setup
- Add your desired evaluators
- Run the experiment
To run another iteration of the Experiment, with different prompts or data, use the New Run button. A new Experiment Run will be created in Draft state.
Seeing Experiment Results
Report
Once an Experiment is ran, its status will change from Running to Completed
The total cost and runtime for the Experiment will be displayed.

The results for Prompts A and B
- Latency.
- Costs.
- Evaluators.
Viewing Multiple Experiment Runs
Within the Runs tab, visualize all previous runs for an Experiment. Through this view, all Evaluators results are visible at a glance, making it easy to compare result and see progress between multiple Runs.
See at a glance how results evolved between two experiment runs.
Logs
Switching to the Logs tab lets you see the details of each call. Within logs you can process Feedback and build Curated Dataset By hovering a cell you can also directly access the related log using the See log button. Comparison View For easy comparison between models and prompts, click the Show Comparison button. This opens the redesigned comparison view where you can analyze outputs side by side across multiple models or configurations.The variables and expected outputs are now displayed on the left for better context, especially when working with large inputs or detailed test cases.
At the bottom of the screen, the evaluators section provides scores and feedback for each result, helping you assess model quality and performance at a glance.

Comparing Prompt version A and B from GPT-4o and Claude 3.5 Sonnet