Experiments v2 - Evaluate your LLM Config

We’re excited to introduce Experiments V2, a major upgrade to our Experiments module that makes testing, evaluating, and benchmarking models and prompts more intuitive and flexible than ever before.

Why Experiments Matter

The Experiments module allows users to analyze model performance systematically without manually reviewing logs one by one. It leverages evaluators—such as JSON schema validators, LLM-as-a-judge, and functions like cosine similarity —to automate the scoring process and assess output quality. This ensures users can confidently measure effectiveness and make data-driven improvements.

Additionally, Experiments enable benchmarking through:

A/B testing – Compare different prompts or model configurations to determine what works best.
Regression testing – Detect unintended changes in outputs after modifying prompts or configurations.
Backtesting – Assess how new setups would have performed on historical data.

What’s New in Experiments V2?

We listened to your feedback and made significant improvements:

Compare Prompts Side by Side – Users can now directly compare multiple prompts within an experiment, making it easier to test variations and find the most effective approach.
Intuitive and Flexible UI – The new interface is streamlined for ease of use, making configuration and analysis smoother than ever.
Merged Data Sets and Variable Collections – Previously separate, these have now been combined to simplify workflows. We appreciate all the feedback on this and have made it clearer: when setting up an experiment, users now upload a data set containing input variables, messages (prompts), and expected outputs (if evaluators require a reference).