improved

Experiments v2 - Evaluate your LLM Config

We’re excited to introduce Experiments V2, a major upgrade to our Experiments module that makes testing, evaluating, and benchmarking models and prompts more intuitive and flexible than ever before.

added

DeepSeek Models Now Available: 67B Chat, R1, and V3

We are excited to announce the integration of DeepSeek’s latest AI models—67B Chat, R1, and V3—into our platform.

added

OpenAI’s Latest Small Reasoning Model – o3-mini

Start using OpenAI’s newest and most advanced ‘small’ reasoning model: o3-mini.

added

Llama 3.3 70b & Llama Guard 3 are now available through Together AI

Experience the power of the latest Llama 3.3 70b and Llama Guard 3 models on Orq, integrated via Together AI.

improved

New Layout with Project Structure

We’re introducing a new project structure UI to help you organize and manage your resources more effectively. With projects, you can group your work by use case, environment, or any logical structure that suits your needs.

added

Online Guardrails in Live Deployments

You can now configure Guardrails after you have added them to your Library directly in Deployments > Settings for both input and output, giving you full control over Deployment responses

added

HTTP and JSON Evaluators and Guardrails

You can now add HTTP and JSON Evaluators and Guardrails under the Evaluator tab and add them to your Deployment or Experiment.

added

Master Your RAG with RAGAS Evals

The Ragas Evaluators are now available, providing specialized tools to evaluate retrieval-augmented generation (RAG) workflows. These evaluators make it easy to set up quality checks when integrating a Knowledge Base into a RAG system and can be used in Experiments and Deployment to ensure responses are accurate, relevant, and safe.

added

Evaluator Library: 50+ Ready-to-Use and Tailorable Evaluators

Introducing the new Evaluator Library:

improved

Improved LLMs as a Judge

LLM-as-a-Judge Enhancements:
We’ve significantly improved our existing LLM Evaluator feature to provide more robust evaluation capabilities and enforce type-safe outputs.