Experiments

While the Playground allows you to test prompts and model setups one generation at a time, the Experiments module lets you run thousands of scenarios to rigorously evaluate your use case at scale.

Experiments let you analyze the performance and results of model generations, leveraging Evaluators to automate the scoring and assess output quality.

Why should I use the Experiments module?

  • Benchmarking - Which prompt or model configuration performs best?
  • A/B Testing - Which variation produces better results?
  • Regression test - Has my new prompt or model introduced unintended changes?
  • Back test - How would my new setup have performed on past data?

See what running an Experiment looks like in the presentation below.