You can now create your own custom LLM evalutator. This allows you to go beyond the standard evaluators like BLEU and Valid JSON.

In the example below, we made an evaluator that compares the tone of voice of the generated text against the desired tone in the reference.

Read more about Experiments and Evaluators.

With this new feature, you're able to use the output of a large language model like GPT-4 as the reference for another model like Gemma-7b and Mistal-large (see image).

For most of the evaluators, you need a reference. This is because an eval like cosine similarity needs two things to compare to each other (the newly generated text and the reference text).

Example use case: A new model has been released which is faster and less expensive than the model you are currently using. Although your current model is performing well, you are interested in comparing the performance of the new model. To compare the two models, you have selected your current model (GPT-4) as the reference model in the configuration. This will serve as a benchmark for the new model's performance. When running the experiment, the reference model will be completed first, and then the output of that model will be used as a reference for the other models. To measure the similarity between the output of the newer models and the reference model, you can use an evaluator such as cosine similarity.

Check out the newly added models on You can find them in the model garden.


The table above is an overview of all the newly added models categorized per provider

The Llama models were previously only available through Anyscale, but now Azure also provides them. This is great news for users that work solely with Azure-hosted models.

Perplexity models

We're excited to introduce Perplexity models to Perplexity is unique in a few different ways:

  • Freshness - Whereas other models can't access the internet, Pplx-7b-online and Pplx-70b-online actually can. This allows them to have information that is up-to-date.
  • Hallucinations - In order to prevent inaccurate statements, perplexity's online models can be used to check if the LLM output is similar to the latest information online.

Mistral's new flagship model: Mistral Large

Some key strengths and reasons why you should try out Mistral Large:

  • Fluency and cultural awareness - It can fluently communicate in English, French, Spanish, German, and Italian. It possesses a sophisticated understanding of grammar and cultural context.
  • 32k context window - With its 32k tokens context window, it can accurately recall information from extensive documents.
  • JSON format and function calling - It has an inbuilt function calling feature and constrained output mode which makes it ideal for app development.

Google's Gemma 7b model

Gemma might not be as 'good' as Gemini, but it stands out in other ways.

  • Open source - developers now have access to a transparent model that allows for customization.
  • Speed and cost - because Gemma has only 7 billion parameters instead of Gemini's 60 billion, it is much cheaper and faster. This even allows you to run it on your laptop or other consumer products where low latency is key.

Instead of manually adding your data sets or uploading them through a CSV file in experiments, you can now import them from the files you stored in Resources. This allows you to store and access your files in a quick and efficient manner.

Resource management

by Cormick Marskamp

Save your data sets, variables, and evaluators in the newly added resources tab.

With this new feature, you don't have to constantly re-upload your resources anymore.

Example: After finetuning your prompt, simply save the prompt in the resources tab so you can easily access it during your next workflow.

The resources are divided into 3 categories:

Data sets - This is where you can store all your prompts and references. Read more about Data sets.

Variables - The dynamic elements you put in between {{curly_brackets}} are listed here. Read more about how to use Variables.

Evaluators - The standard and custom evals can be stored here. Read more about why you should use evals and when to use which one in Evaluators.


by Cormick Marskamp

We have added Evaluators to our platform. With a wide range of industry-standard metrics and other relevant evaluators, you can check whether or not the output of your LLM is accurate, reliable, and contextually relevant.

Read more about each evaluator in our Evaluators.

In our latest SDK update, we're thrilled to share a series of enhancements that significantly boost the performance and capabilities of our platform.

Support for messages in the invoke and stream functionalities

Now you can provide a new property in the invoke and the invoke with stream methods of both SDKs to combine the messages with the prompt configured in Orquesta.

const deployment = await client.deployments.invoke({
  key: 'customer_service',
  messages: [
      role: 'user',
        'A customer is asking about the latest software update features. Generate a detailed and informative response highlighting the key new features and improvements in the latest update.',
  context: { environments: 'production', country: 'NLD' },
  inputs: { firstname: 'John', city: 'New York' },
  metadata: { customer_id: 'Qwtqwty90281' },


The introduction of the new property significantly enhances various aspects of interaction:

  1. Enhanced Contextual Clarity: The use of chat history empowers the model to preserve the context throughout a conversation. This feature ensures that each response is not only coherent but also directly relevant, as the model can draw upon past dialogues to fully grasp the nuances of the current inquiry or discussion topic.
  2. Streamlined Conversation Flow: Chat history is instrumental in maintaining a consistent and logical flow in conversations. This prevents the occurrence of repetitive or conflicting responses, mirroring the natural progression of human dialogues and maintaining conversational integrity.
  3. Tailored User Interactions: With access to previous interactions, chat history allows the model to customize its responses according to individual user preferences and historical queries. This level of personalization significantly boosts user engagement and satisfaction, leading to more effective and enjoyable communication experiences.

Support for messages and choices to add metrics

In Orquesta, after every request, you can add custom metrics. On top of the superset that we already support, we also added two new properties, messages and choices. The messages property is designed to log the communication exchange between the user and the model, capturing the entire conversation history for context and analysis. Meanwhile, the choices property records the different response options that the model considers or generates before presenting the final output, providing insights into the model's decision-making process.

  messages: [
      role: 'user',
        'A customer is asking about the latest software update features. Generate a detailed and informative response highlighting the key new features and improvements in the latest update.',
  choices: [
      index: 0,
      finish_reason: 'stop',
      message: {
        role: 'assistant',
          "Dear customer: Thank you for your interest in our latest software update! We're excited to share with you the new features and improvements we've rolled out. Here's what you can look forward to in this update",

Support for OpenAI system fingerprint

To help track OpenAI environmental or model changes, OpenAI is exposing a system_fingerprint parameter and value. If this value changes, different outputs will be generated due to changes to the OpenAI environment.

In the new version of the SDK, if you are using an OpenAI model and the invoke method, the system_fingerprint will be exposed in the deployment properties.

You could already select models such as Llama from Meta and Mixtral from Mistral in the model garden. But with this release, it is now possible to connect your own API key for AnyScale. This way you can use your own account and rate limits without having to rely on a shared key. Soon you'll be able to use your own private models and finetuning on AnyScale.

Mass Experimentation

by Cormick Marskamp

You could already test out different models and configurations using our playground. However, with the introduction of Experiments, you are able to do this on a much larger scale.

Simply import your prompts, expected outputs, and variables, configure your model, and you're able to do hundreds of simultaneous runs. This allows you to do a side-by-side comparison in large batches.

Check out the interactive walkthrough.