Use cases for Experiments

Comparing different models side by side

Testing newly available models

Whenever a new model has been released (example: Claude 3), you might wonder how that new model would perform against the one you're currently using: Does it generate better output?.

Running an experiment with the new model in parallel to your current model will answer this question.

Test fine-tuned models

How does your newly fine-tuned model compare to its previous version or base model?

Use the exact same configuration and compare the differences side-by-side.

Test private models

After you add your private model to the model garden, you can run experiments with it. Same as with other models, you can test the output in a side-by-side comparison.

Comparing model metrics

Comparing cost

You might not need the most sophisticated or fastest model for every use case. Especially when doing relatively simple tasks in high frequency, a cheaper model might be the best solution.

Run an experiment to find out what model has the best value for money.

Comparing latency

When your use case is dependent on low latency, comparing the latency of different models and providers is very important. After running the experiment, the heatmap with the necessary metrics will show you how the models compare against one another.

On top of that, it is possible to test out the same model on different providers, which results in a similar output but completely different latencies (example: Llama-70b on Azure vs Llama-70b on Groq).

Prompt optimization

Prompt finetuning

Twenty slightly different prompts, which one generates the best output? Experiment and find out.

Also, for a more quantitative approach, read how Evaluators like Cosine Similarity can help you assess which prompt version works best with Evaluator in Experiment.

Comparing different parameters

parameters are more important than you initially might think.

Within Experiments, you can test the same datasets with the same models and slightly different parameters. Read more about what each parameter does in our Model Parameters documentation.

Pre-Deployment Testing

Testing before deploying

Before making changes in production, you can run the exact same prompt configuration in an Experiment to test your hypothesis in a safe and secure environment.

Backtesting

Backtesting is a method used to evaluate the performance of a model by testing it on historical data.

The idea is to see how well the model would have performed if it had been used during a past period. This process helps in assessing the effectiveness, reliability, and robustness of the model before deploying it in real-world scenarios.

Regression testing

Whenever you want to improve the model's performance in one aspect, regression testing helps ensure that the model's performance in other areas hasn't degraded. Using Dataset is the best way to make sure that models outputs are still working as expected when releasing a new version.

Security testing

Jailbreak mitigation

Jailbreaking refers to the practice of bypassing the restrictions or safety protocols of a model to make it produce unauthorized or unintended responses.

Jailbreak mitigation involves strengthening the model to prevent such exploitative behavior, ensuring compliance with intended use cases.

Within Experiments, you can test how your model responds to jailbreak attempts in a safe and secure way.

Testing against known attacks and datasets with jailbreak prompts allows you to test what might happen before putting it into production.