Use LLM as a Reference in Experiments

With this new feature, you're able to use the output of a large language model like GPT-4 as the reference for another model like Gemma-7b and Mistal-large (see image).

For most of the evaluators, you need a reference. This is because an eval like cosine similarity needs two things to compare to each other (the newly generated text and the reference text).

Example use case: A new model has been released which is faster and less expensive than the model you are currently using. Although your current model is performing well, you are interested in comparing the performance of the new model. To compare the two models, you have selected your current model (GPT-4) as the reference model in the configuration. This will serve as a benchmark for the new model's performance. When running the experiment, the reference model will be completed first, and then the output of that model will be used as a reference for the other models. To measure the similarity between the output of the newer models and the reference model, you can use an evaluator such as cosine similarity.