Use cases for Experiments
11 reasons why you should use Experiments
Comparing different models side by side
Testing newly available models
Whenever a new model has been released (example: Claude 3), you might wonder how that new model would perform against the one you're currently using: Does it generate better output?.
Running an experiment with the new model in parallel to your current model will answer this question.
Test fine-tuned models
How does your newly fine-tuned model compare to its previous version or base model?
Use the exact same configuration and compare the differences side-by-side.
Test private models
After you add your private model to the model garden, you can run experiments with it. Same as with other models, you can test the output in a side-by-side comparison.
Comparing model metrics
Comparing cost
You might not need the most sophisticated or fastest model for every use case. Especially when doing relatively simple tasks in high frequency, a cheaper model might be the best solution.
Run an experiment to find out what model has the best value for money.
Comparing latency
When your use case is dependent on low latency, comparing the latency of different models and providers is very important. After running the experiment, the heatmap with the necessary metrics will show you how the models compare against one another.
On top of that, it is possible to test out the same model on different providers, which results in a similar output but completely different latencies (example: Llama-70b on Azure vs Llama-70b on Groq).
Prompt optimization
Prompt finetuning
Twenty slightly different prompts, which one generates the best output? Experiment and find out.
Also, for a more quantitative approach, read how Evaluators like cosine similarity can help you assess which prompt version works best in Evaluators in Experiments.
Comparing different parameters
parameters are more important than you initially might think.
Within Experiments, you can test the same datasets with the same models and slightly different parameters. Read more about what each parameter does here.
Pre-Deployment Testing
Testing before deploying
Before making changes in production, you can run the exact same prompt configuration in an Experiment to test your hypothesis in a safe and secure environment.
Backtesting
Backtesting is a method used to evaluate the performance of a model by testing it on historical data.
The idea is to see how well the model would have performed if it had been used during a past period. This process helps in assessing the effectiveness, reliability, and robustness of the model before deploying it in real-world scenarios.
Regression testing
Whenever you want to improve the model's performance in one aspect, regression testing helps ensure that the model's performance in other areas hasn't degraded. Using Datasets in Experiments is the best way to make sure that models outputs are still working as expected when releasing a new version.
Security testing
Jailbreak mitigation
Jailbreaking refers to the practice of bypassing the restrictions or safety protocols of a model to make it produce unauthorized or unintended responses.
Jailbreak mitigation involves strengthening the model to prevent such exploitative behavior, ensuring compliance with intended use cases.
Within Experiments, you can test how your model responds to jailbreak attempts in a safe and secure way.
Testing against known attacks and datasets with jailbreak prompts allows you to test what might happen before putting it into production.
Updated 4 months ago