August 26th, 2025

Introducing Evaluatorq

Today, we are proud to introduce EvaluatorqAn evaluation framework library that provides a flexible way to run parallel evaluations and optionally integrates with the Orq AI platform.Features

Parallel Execution: Run multiple evaluation jobs concurrently with progress tracking
Flexible Data Sources: Support for inline data, promises, and Orq platform datasets
Type-safe: Fully written in TypeScript
Orq Platform Integration: Seamlessly fetch and evaluate datasets from Orq AI (optional)

Installationnpm install @orq-ai/evaluatorqBasic Usage

Typescript

import { evaluatorq, job } from "@orq-ai/evaluatorq";

const textAnalyzer = job("text-analyzer", async (data) => {
  const text = data.inputs.text;
  const analysis = {
    length: text.length,
    wordCount: text.split(" ").length,
    uppercase: text.toUpperCase(),
  };

  return analysis;
});

await evaluatorq("text-analysis", {
  data: [
    { inputs: { text: "Hello world" } },
    { inputs: { text: "Testing evaluation" } },
  ],
  jobs: [textAnalyzer],
  evaluators: [
    {
      name: "length-check",
      scorer: async ({ output }) => {
        const passesCheck = output.length > 10;
        return {
          value: passesCheck ? 1 : 0,
          explanation: passesCheck
            ? "Output length is sufficient"
            : `Output too short (${output.length} chars, need >10)`,
        };
      },
    },
  ],
});

Repository: https://github.com/orq-ai/orqkit

August 20th, 2025

Native support for Gemini models - URL Context Tool

The new URL context tool enables models to access content directly from provided URLs, enhancing responses with external information for tasks like extracting data, comparing documents, synthesizing content, and analyzing code or documentation.Key Features

Two-step retrieval: internal index cache first, live fetch if unavailable.
Returns url_context_metadata with retrieval details and safety checks.
Supports up to 20 URLs per request, max 34MB per URL.
Retrieved content counts toward input tokens (tracked in usage_metadata).

Supported Models

gemini-2.5-pro
gemini-2.5-flash
gemini-2.5-flash-lite

Best practices

Use direct, specific, fully qualified URLs (https://…).
Avoid login-restricted or paywalled content.
Combine with Google Search grounding for broader workflows.

August 18th, 2025

OpenTelemetry Tracing Support

Key Features

Native OpenTelemetry (OTel) export for traces.
Send spans directly to our OTLP/HTTP collector at: https://api.orq.ai/v2/otel.
Works with standard OTel SDKs and exporters (Python, Node.js).
Supports custom headers (e.g., bearer tokens) and resource attributes (e.g., service.name).

Quick StartEnvironment based

# Common env vars (both Python & Node.js)
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.orq.ai/v2/otel"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <ORQ_API_KEY>"
export OTEL_RESOURCE_ATTRIBUTES="service.name=my-service,service.version=1.0.0"

Python

# pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http opentelemetry-instrumentation

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

resource = Resource.create({
    "service.name": "my-python-service",
    "service.version": "1.0.0"
})

provider = TracerProvider(resource=resource)
trace.set_tracer_provider(provider)

exporter = OTLPSpanExporter(
    endpoint="https://api.orq.ai/v2/otel",
    # Optional if your setup requires auth
    headers={"Authorization": "Bearer <ORQ_API_KEY>"}
)

provider.add_span_processor(BatchSpanProcessor(exporter))

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("demo-span"):
    # your app logic here
    pass

NodeJS

TypeScript

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';

const traceExporter = new OTLPTraceExporter({
  url: 'https://api.orq.ai/v2/otel',
  // Optional if your setup requires auth
  headers: { Authorization: 'Bearer <ORQ_API_KEY>' },
});

const sdk = new NodeSDK({
  resource: new Resource({
    'service.name': 'my-node-service',
    'service.version': '1.0.0',
  }),
  traceExporter,
});

sdk.start();

// Example span
import { context, trace } from '@opentelemetry/api';
const tracer = trace.getTracer('example');

const span = tracer.startSpan('demo-span');
try {
  // your app logic here
} finally {
  span.end();
}

August 18th, 2025

Knowledge API: New Parameters

New search_type field with options:

vector_search: Semantic similarity search
keyword_search: Exact text matching
hybrid_search: Combined approach (default)
Separated Top-K Controls:
- top_k: Controls initial document retrieval (1-100)
- rerank_config.top_k: Controls final result count after reranking (1-100)
New agentic_rag_config parameter to customize AI-driven search behavior per request

Benefits

Flexible Search Strategy: Choose optimal search method per query
Fine-tuned Relevance: Retrieve broadly, return precisely with independent top-k settings
Performance Optimization: Better control over retrieval vs. quality trade-offs

Impact

Users can now choose specific search strategies instead of relying on knowledge base defaults
Better control over the retrieval-rerank pipeline allows for more relevant results by retrieving more documents initially but returning fewer, higher-quality results

You can find more info here reference/searchknowledge

August 18th, 2025

Support for OpenAI’s image edit and variation APIs

Image Edit and Image Variation are now supported in our AI Proxy. This comes with complete trace monitoring and cost tracking.Also, all the images are stored in our Orq with the same expirations as the retention period of your workspace.

cURL

curl --location 'https://api.orq.ai/v2/proxy/images/edits' \
--header 'Authorization: Bearer ORQ_API_KEY' \
--form 'model="openai/gpt-image-1"' \
--form 'image=@"/path/to/file"' \
--form 'image=@"/path/to/file"' \
--form 'prompt="Add the style of card with the gradient style of the first image"'

Python

import base64
from openai import OpenAI

client = OpenAI(base_url="https://api.orq.ai/v2/proxy/image/edit", api_key="ORQ_API_KEY")

prompt = """
Generate a photorealistic image of a gift basket on a white background
labeled 'Relax & Unwind' with a ribbon and handwriting-like font,
containing all the items in the reference pictures.
"""

result = client.images.edit(
    model="gpt-image-1",
    image=[
        open("body-lotion.png", "rb"),
        open("bath-bomb.png", "rb"),
        open("incense-kit.png", "rb"),
        open("soap.png", "rb"),
    ],
    prompt=prompt,
)

image_base64 = result.data[0].b64_json
image_bytes = base64.b64decode(image_base64)

# Save the image to a file
with open("gift-basket.png", "wb") as f:
    f.write(image_bytes)

Typescript

import fs from "fs";
import OpenAI, { toFile } from "openai";

const client = new OpenAI({
  baseURL: "https://api.orq.ai/v2/proxy/image/edit",
  apiKey: process.env.ORQ_API_KEY
});

const imageFiles = [
    "bath-bomb.png",
    "body-lotion.png",
    "incense-kit.png",
    "soap.png",
];

const images = await Promise.all(
    imageFiles.map(async (file) =>
        await toFile(fs.createReadStream(file), null, {
            type: "image/png",
        })
    ),
);

const rsp = await client.images.edit({
    model: "gpt-image-1",
    image: images,
    prompt: "Create a lovely gift basket with these four items in it",
});

// Save the image to a file
const image_base64 = rsp.data[0].b64_json;
const image_bytes = Buffer.from(image_base64, "base64");
fs.writeFileSync("basket.png", image_bytes);

August 18th, 2025

File Uploads for Anthropic Models

PDF support is currently supported via direct API access and Google Vertex AI on:

Claude Opus 4.1 (claude-opus-4-1-20250805)
Claude Opus 4 (claude-opus-4-20250514)
Claude Sonnet 4 (claude-sonnet-4-20250514)
Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
Claude Haiku 3.5 (claude-3-5-haiku-20241022)

You can find more information here docs/attaching-files

August 18th, 2025

Comprehensive Budget Controls & Cost Management

Take full control of your AI spending with granular budget management across your entire workspace. Set limits, track consumption, and prevent cost overruns with real-time enforcement.

Configure budgets with daily, weekly, monthly, or yearly reset cycles
Automatic period boundary calculation with UTC timezone standardization
Period-aligned cache TTL for optimal performance
New Budgets API to create budgets programatically

Use Cases:

Prevent individual users from exceeding allocated AI budgets
Set department-level spending limits for enterprise teams
Control costs across multiple AI models and services

To learn more about using Contacts, see Creating a Contact.

August 8th, 2025

GPT-5 available in the Model Garden

Today we’re introducing GPT‑5⁠, OpenAI’s smartest, fastest, most useful model yet in the Model Garden.

July 10th, 2025

Preview Model Support

We will no longer support experimental and preview models by default in the model garden.Going forward, access to preview or experimental models will only be granted upon request. Many of these models are unreliable in production, they regularly fail or become unavailable without notice, which leads to a poor experience and extra maintenance overhead.If you have a use case that specifically requires an experimental or preview model, please reach out to support@orq.ai

July 9th, 2025

Three new RAGAS Evaluators for enhanced RAG Experiments

We’ve added three new RAGAS evaluators to the Orq.ai platform, giving you deeper insights into your retrieval-augmented generation (RAG) workflows. These evaluators help you understand not just if the right information was retrieved, but how well your system handles noisy data and captures key context.Here’s what’s new:

1. Context Recall

Measure how much of the relevant reference information your retrieval pipeline actually brings into context. This evaluator compares the retrieved text to the reference for each user query, helping you identify if important facts are missing. Use context recall to improve your retrieval strategy and ensure your LLM responses are always well-supported.Entities used:

Reference (ground truth)
Retrieved text

When to use:

When you want to know if your retrieval step is actually surfacing all necessary information for the LLM to answer correctly.

Learn more

2. Noise Sensitivity

Understand how your RAG system performs when irrelevant or “noisy” information is mixed into the retrieved context. This evaluator tests the robustness of your model by introducing noise and measuring the impact on your outputs. It’s a great way to benchmark reliability in real-world scenarios where retrieval isn’t always perfect.Entities used:

User message
Reference (ground truth)
Retrieved text (with added noise)

When to use:

When you want to see if your LLM can still provide accurate answers, even when extra, unrelated information appears in the context.

Learn more

3. Context Entities Recall

Check whether the most important entities (names, places, organizations, etc.) present in your reference are also present in the retrieved context. This evaluator goes beyond text overlap and focuses on the coverage of key information units, helping you catch subtle gaps in your retrieval pipeline.Entities used:

Reference (ground truth)
Retrieved text

When to use:

When you want to be sure your retrieval is capturing and surfacing all the essential entities needed to answer user queries correctly.

Learn more

Use these new RAGAS evaluators to systematically assess and improve the quality, completeness, and reliability of your RAG pipelines right from within Orq.ai.

July 9th, 2025

Flexbile Prompt Configuration for Experiments

Configuring Experiments just got more flexible.Previously, when setting up an Experiment, you could only use the prompt from your dataset (the message column), or you had to set up your prompt manually in the prompt template. With this update, you can now combine both approaches, giving you far more control over your Experiments.This is especially useful if your application logs real user input from your AI features in production. Those logs often contain valuable user messages you want to test against different system prompts. With this update, you can easily combine actual user messages from your dataset with new or imported system prompts directly in your prompt template configuration. This allows you to experiment and see which prompt produces the best response, without needing to restructure your data.This improved flexibility makes it much easier to:

Systematically validate regression, improvements, and guardrail compliance
Compare outputs for cost, latency, and consistency
Run more realistic and effective experiments based on actual user input

Start combining user messages and system prompts in your Experiments to get more actionable results and deeper insights.

July 4th, 2025

Chunking API

The Chunking API is now live - the final building block for fully programmatic knowledge base workflows. With this new API, you can automatically split (“chunk”) your text data in the way that best fits your needs, all without leaving your workflow.The Chunking API supports six different chunking strategies, each suited to a different kind of content:

Token chunker: Splits your text based on token count. Great for keeping chunks small enough for LLMs and for consistent embedding sizes.
Sentence chunker: Breaks your text at sentence boundaries, so each chunk stays readable and sentences remain intact.
Recursive chunker: Chunks text by working down a hierarchy (paragraphs, then sentences, then words) to maintain document structure.
Semantic chunker: Groups together sentences that are topically related, so each chunk makes sense on its own.
SDPM chunker: Uses advanced skip-gram patterns to find natural split points, especially helpful for technical or structured documents.
Agentic chunker: Uses an LLM to determine the best split points, ideal for complex documents that need intelligent segmentation.

With this release, our RAG-as-a-service stack is complete. You can create knowledge bases, upload and chunk your data (or send pre-chunked data), search and filter content with chunk-level metadata, turn on agentic RAG, and apply reranker models. The knowledge base tester in the UI lets you experiment with search settings, chunk sizes, and thresholds to optimize retrieval for your use case.

📘 Try the Chunking API out for yourself here: Chunking

June 19th, 2025

Evaluators API

We’ve expanded evaluator support with two new APIs, enabling much more flexible, code-driven workflows:

Evaluators Hub API: Run any evaluator on demand via API, without needing to attach it to an experiment or deployment. Ideal for integrating evaluator runs into your own pipelines or automated tools.
Evaluators API: Create, retrieve, update, and delete evaluators directly via API. This allows you to fully manage your evaluators programmatically, instead of only through the UI.

You can use both APIs together to build and manage evaluators entirely through code, combining on-demand runs with dynamic evaluator management. This enables teams to automate their evaluation workflows, integrate with CI/CD, and adapt evaluators to changing requirements—all without leaving their own codebase.Key benefits:

Decoupled Evaluation: Run and manage evaluators without relying on experiments or deployments.
Full API Control: Create, update, and remove evaluators as your evaluation needs evolve.
Pipeline Integration: Integrate both evaluator management and execution into your own data pipelines or CI/CD workflows.
Faster Iteration: Quickly test, refine, and iterate on evaluators using code.
Automation: Scale evaluations and management as needed for batch operations or continuous quality monitoring.

See an example on how the new Evaluator API works with RAGAS faithfulness below

curl --request POST \
     --url https://api.orq.ai/v2/evaluators/ragas_faithfulness \
     --header 'accept: application/json' \
     --header 'authorization: Bearer $ORQ_API_KEY' \
     --header 'content-type: application/json' \
     --data '
{
  "model": "openai/gpt-4.1",
  "query": "How large can my carry on bag be?",
  "output": "Your carry-on bag must not exceed 22 x 14 x 9 inches (56 x 36 x 23 cm), including handles and wheels. There is no weight limit for carry-on bags, but you must be able to lift it into the overhead bin without assistance.",
  "retrievals": [
    "{ \"question\": \"What are the size and weight limits for carry-on bags?\", \"answer\": \"Your carry-on bag must not exceed 22 x 14 x 9 inches (56 x 36 x 23 cm), including handles and wheels. There is no weight limit for carry-on bags, but you must be able to lift it into the overhead bin without assistance.\" }",
    "{ \"question\": \"Are there bag or seasonal bag limitations?\", \"answer\": \"Learn about checked bag limitations on our website.\" }",
    "{ \"question\": \"What are your liability limitations for lost bags?\", \"answer\": \"Learn about our liability limitations and see items excluded from liability on our website.\" }",
    "{ \"question\": \"What should I do about delayed or damaged bags?\", \"answer\": \"Find out what to do about delayed or damaged bags on our website.\" }"
  ]
}
'

June 19th, 2025

Connect LiteLLM models to Orq

You can now connect your LiteLLM-managed models directly to Orq.ai, making it even easier for teams already using LiteLLM to get started with Orq—without having to change your LLM gateway.Who is this for?
This integration is ideal for teams who have standardized on LiteLLM for unified API access, flexible model management, or cost control—but want to benefit from Orq’s full platform for everything beyond just model access.How to get started:

Go to the Integrations tab, click LiteLLM, and enter your base URL and API keys.
In Model Garden, click Add Model, select Import from LiteLLM, and choose your model(s).

Why use this?

No migration pain: Instantly connect your existing LiteLLM models, no rework needed.
Keep your infra, add new superpowers: Keep LiteLLM as your gateway while gaining everything Orq offers for GenAI ops.
Best of both worlds: For teams already invested in LiteLLM, this means you get prompt management, deployments, evaluation, and observability—on top of your existing setup.

Orq’s own gateway remains the best option for teams wanting a deeply integrated, managed experience. But for teams committed to LiteLLM, you can now take full advantage of Orq’s broader capabilities—no compromises.

June 19th, 2025

Self onboard Vertex AI models

You can now onboard your own VertexAI models, including fine-tuned and private models.Go to Model garden > Add model > Vertex AI

June 19th, 2025

Custom Human Review

Today we’re introducing Custom human review, a major update that gives teams full flexibility in collecting and structuring human feedback in Orq.ai.With this release, you’re no longer limited to predefined review types. Now, you can create your own feedback forms tailored to your workflow and use case. Choose from three flexible input types:

Categorical: Create button options with your own labels, such as good/bad or saved/deleted.
Range: Set up a custom scoring slider (for example, a scale from 0 to 100).
Open field: Allow users to enter free-form feedback, perfect for more detailed comments.

How to collect feedback

There are two ways to collect feedback:

Orq Studio: Custom human review templates can be used directly within the Orq.ai platform, where colleagues and domain experts review AI outputs and provide structured feedback.
API: Alternatively, developers can leverage the Orq API to expose these custom feedback forms in their own applications or products, allowing end users to submit feedback seamlessly.

Why this matters

Collecting high-quality, relevant human review is crucial for improving AI features. Teams can structure feedback in a way that makes sense for them, whether for quality control, prompt iteration, or continuous evaluation. All feedback is attached to the log, including prompts and outputs, making it easy for prompt engineers to refine configurations based on real-world usage.

How to create custom Human Review

June 19th, 2025

Evaluator playground

We’ve added a new way to quickly test your evaluators directly within the evaluator configuration screen.Previously, testing an evaluator meant attaching it to a deployment or running an experiment, often more work than needed for a quick check. With the new Test Your Evaluator playground, you can now try out your evaluator on the spot, using custom queries and responses.This makes it easier to iterate, validate, and fine-tune your evaluators, without leaving the configuration screen or disrupting your workflows.

June 2nd, 2025

Attach files directly to the model

With our latest update, you can now attach PDF files directly to Gemini and OpenAI models. Simply include the encoded file in your deployment invocation, and the model will natively process the PDF - no more manual file uploads or bloated prompts.Why this matters

Better File Understanding: Models can directly analyze the content and structure of your PDFs—including text, images, tables, and diagrams - improving extraction quality and structured outputs.
Cleaner Prompts: File data is no longer dumped into your system or user message, resulting in faster, less error-prone requests.

The file_data should be using the standard data URI scheme (data:content/type;base64). See an example API call below:

cURL

curl --request POST \
     --url https://api.orq.ai/v2/deployments/invoke \
     --header 'accept: application/json' \
     --header 'authorization: Bearer <your_orq_key>' \
     --header 'content-type: application/json' \
     --data '
{
  "key": "key",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "prompt"
        },
        {
          "type": "file",
          "file": {
						"filename" : "<filename>",
            "file_data": "data:application/pdf;base64,<base64-encoded-data>"
          }
        }
      ]
    }
  ]
}
'

Python

generation = client.deployments.invoke(
    key="deployment_key",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "prompt"
                },
                {
                    "type": "file",
                    "file": {
                        "file_data": "data:application/pdf;base64,<base64-encoded-data>",
                        "filename": "filename"
                    }
                }
            ]
        }
    ],
    metadata={
        "user_id": "123",
        "session_id": "456",
    }
)

Typescript

const generation = await client.deployments.invoke({
  key: 'deployment_key',
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'prompt'
        },
        {
          type: 'file',
          file: {
            fileData: 'data:application/pdf;base64,<base64-encoded-data>',
            filename: 'filename.pdf'
          }
        }
      ]
    }
  ],
  metadata: {
    user_id: '123',
    session_id: '456'
  }
});

June 2nd, 2025

Evaluator runs without new generation

You can now evaluate existing LLM outputs without generating new responses. Previously, running an evaluator required creating a new output every time. With this update, you can retroactively score any response already stored in your dataset.Why this matters

Quality: Evaluate historical responses to ensure you’re assessing the true source of truth, not just new outputs.
Flexibility: Apply evaluations to both** single responses and full conversation chains**, adapting to your specific review needs.

How it works

Prepare your dataset including LLM outputs in the messages column.
Set up an experiment and select one or more evaluators - do not select a prompt.
The evaluator will analyze the responses already present in the “messages” column of your dataset.

June 2nd, 2025

Reasoning field now included in the API response

With the latest update to our reasoning models, API responses now include the reasoning and reasoning_signature fields. These additions provide more transparency into how the model arrives at its answers.See an example below:

{
    "id": "01JWQZ1JB6PSJ1F8KSWE5771GR",
    "created": "2025-06-02T09:19:59.308Z",
    "object": "vision",
    "model": "claude-sonnet-4-20250514",
    "provider": "anthropic",
    "is_final": true,
    "finalized": "2025-06-02T09:20:05.615Z",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "type": "function",
                        "id": "toolu_01A57N7TgYAqcFmiMY9Q7TT4",
                        "index": 0,
                        "function": {
                            "name": "AA_tool",
                            "arguments": "{\"question\":\"How large can my carry on bag be?\",\"answer\":\"Your carry-on bag must not exceed 22 x 14 x 9 inches (56 x 36 x 23 cm), including handles and wheels. There is no weight limit for carry-on bags, but you must be able to lift it into the overhead bin without assistance.\"}"
                        }
                    }
                ],
                "reasoning": "The user is asking about carry-on bag size limits for American Airlines. Let me check the knowledge base for relevant information.\n\nLooking at the knowledge base, I can see there's a question that directly addresses this: \"What are the size and weight limits for carry-on bags?\" with the answer: \"Your carry-on bag must not exceed 22 x 14 x 9 inches (56 x 36 x 23 cm), including handles and wheels. There is no weight limit for carry-on bags, but you must be able to lift it into the overhead bin without assistance.\"\n\nThis question is clearly related to American Airlines travel and I have the relevant information in the knowledge base to answer it. I should use the AA_tool to structure my response.",
                "reasoning_signature": "Euxxx",
                "type": "tool_calls"
            },
            "finish_reason": "tool_use"
        }
    ],
    "provider_response": {
        "id": "msg_01VDbNW9N2khGgx16x5texpt",
        "type": "message",
        "role": "assistant",
        "model": "claude-sonnet-4-20250514",
        "content": [
            {
                "type": "thinking",
                "thinking": "The user is asking about carry-on bag size limits for American Airlines. Let me check the knowledge base for relevant information.\n\nLooking at the knowledge base, I can see there's a question that directly addresses this: \"What are the size and weight limits for carry-on bags?\" with the answer: \"Your carry-on bag must not exceed 22 x 14 x 9 inches (56 x 36 x 23 cm), including handles and wheels. There is no weight limit for carry-on bags, but you must be able to lift it into the overhead bin without assistance.\"\n\nThis question is clearly related to American Airlines travel and I have the relevant information in the knowledge base to answer it. I should use the AA_tool to structure my response.",
                "signature": "Euxxx"
            },
            {
                "type": "tool_use",
                "id": "toolu_01A57N7TgYAqcFmiMY9Q7TT4",
                "name": "AA_tool",
                "input": {
                    "question": "How large can my carry on bag be?",
                    "answer": "Your carry-on bag must not exceed 22 x 14 x 9 inches (56 x 36 x 23 cm), including handles and wheels. There is no weight limit for carry-on bags, but you must be able to lift it into the overhead bin without assistance."
                }
            }
        ],
        "stop_reason": "tool_use",
        "stop_sequence": null,
        "usage": {
            "input_tokens": 984,
            "cache_creation_input_tokens": 0,
            "cache_read_input_tokens": 0,
            "output_tokens": 312,
            "service_tier": "standard"
        }
    }
}

May 30th, 2025

Experiment Runs overview

You can now compare different Experiment runs in a single overview using the new Runs tab.Previously, reviewing the results of multiple experiment runs required clicking into each run individually and manually comparing scores. This made it harder to spot improvements or regressions at a glance.With the new Runs overview, you get a clear, side-by-side comparison of all your experiment runs, including all your evaluator scores. This makes it much easier to see if recent changes to your prompt or model setup have led to measurable improvements, or unexpected regressions.Example:
In the screenshot below, you can see that both Experiment Run #1 and #2 achieved a perfect score on the JSON Schema Evaluator (100), while Cosine Similarity improved from 0.65 to 0.69.No more clicking back and forth between runs, get instant insight into your progress in one place.

May 30th, 2025

Improved Experiment configuration panel

Configuring your experiments just got a lot easier. The new side panel gives you a clear and organized overview of everything in your experiment, making setup and adjustments faster and more intuitive.With the improved configuration panel, you can instantly see:
• Which prompts, models, and evaluators are included in your experiment
• The status and details of each prompt and evaluator
• A simple way to add or edit prompts and evaluators directly from the side panelThis new layout means you no longer have to dig through multiple screens or remember which components are active. Everything you need is visible at a glance, helping you stay organized and making it easier to experiment with different configurations.

May 28th, 2025

Metadata filter on Knowledge Base Chunks

Knowledge Base chunks can now be enriched with metadata during creation via the Chunks API. When searching your Knowledge Base, you can filter results by metadata to target only the most relevant chunks—improving both accuracy and efficiency in retrieval.Highlights:

Add custom metadata to each Knowledge Base chunk at creation.
Search using metadata filters to refine results.

Why this matters

As your Knowledge Base grows, searching through every chunk for every query can slow things down and introduce noise in responses. By tagging each chunk with metadata—such as document type, source, department, date, or custom tags—you gain precise control over what information gets retrieved. This ensures users and AI models only access the most relevant information, leading to faster, higher-quality answers and improved user experience.

Example

If your company stores support articles in the Knowledge Base for multiple products, you can now instantly filter retrieval to only include articles for a specific product, version, or language—removing irrelevant matches and streamlining customer support.

cURL

curl --location 'http://my.orq.ai/v2/knowledge/{knowledge_id}/search' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $ORQ_API_KEY' \
--data '{
    "query": "What we the top editions of science fiction books",
    "filter": {
        "edition": {
            "gte": 2020
        }
    },
    "search_options":{
        "include_metadata": true,
        "include_vectors": true,
        "include_scores": true
    }
}'

Python

from orq_ai_sdk import Orq
import os

client = Orq(api_key=os.getenv("ORQ_API_KEY"))

client.knowledge.search(
    knowledge_id="unique_knowledge_id",
    query={"edition": {"$gte": 2020}},
    search_options={
        "include_metadata": True,
        "include_vectors": True,
        "include_scores": True,
    },
)

Typescript

import { Orq } from '@orq-ai/node';

const orq = new Orq({
  apiKey: 'ORQ_API_KEY',
});

orq.knowledge.search({
  knowledgeId: 'unique_knowledge_id',
  query: { edition: { $gte: 2020 } },
  searchOptions: {
    includeMetadata: true,
    includeVectors: true,
    includeScores: true,
  },
});

📘 To learn more about search, see Searching a knowledge base and the related API.

May 27th, 2025

Claude 4 Sonnet & Opus

With the launch of Claude 4 Sonnet and Opus, Anthropic has introduced its most advanced AI models yet.Both models share the same large 200k token context window and up-to-date training data (March 2025), but they are designed for slightly different needs.

Feature	Claude 4 Sonnet	Claude 4 Opus
Context Window	200k tokens	200k tokens
Training Cutoff	March 2025	March 2025
Max Output Tokens	64k	32k
Input Price	$3 / million tokens	$15 / million tokens
Output Price	$15 / million tokens	$75 / million tokens
Performance Summary	Improved reasoning and intelligence vs Sonnet 3.7	Most capable model with superior reasoning
Latency	Moderate	Slow
When to use?	Best price/ value and capable enough for 95% use cases	Best for complex tasks needing deep analysis where speed is not important

Claude 4 Sonnet is the best choice for most users. It offers excellent reasoning, strong tool use, and great overall performance at a much lower price. For 95% of typical use cases—including content generation, Q&A, code review, and day-to-day tasks—Sonnet provides the best balance of speed, capability, and cost.Claude 4 Opus is Anthropic’s top-tier model. It delivers the highest level of reasoning and intelligence and outperforms other models on complex or nuanced tasks. Opus is ideal for scenarios where accuracy and depth matter more than speed or cost, such as advanced research, detailed data analysis, and orchestrating multi-step workflows. However, it is slower and significantly more expensive.Both models represent a significant leap over previous versions, with better reasoning, improved tool use, and more robust handling of longer or more complex inputs. Choose Sonnet for value and everyday performance, and Opus for your most demanding AI workloads.

May 8th, 2025

Tool response in Playground

We’ve enhanced the Playground experience by introducing interactive tool responses. When you’re using tools (also known as function calls) within your workflow, you can now seamlessly provide the tool response back to the model, enabling continuous and efficient testing of your workflows.How it works:Invoke Tools: When your model calls a tool (e.g., get_weather), it provides the tool name and relevant arguments.Interactive Response: Previously, your workflow paused here, requiring manual intervention. Now, you can directly give the tool response back to the model or use the handy Tab shortcut to auto-generate a suggested response.Continue Workflow: With the tool response provided, the model can continue, generating a final, context-aware response.This new capability simplifies debugging and allows you to rapidly prototype sophisticated workflows without interruption. See an example below.

April 18th, 2025

OpenAI’s new o3, o4-mini, GPT-4.1 models

OpenAI’s latest models can be found in the model garden.

When to use what model

o-series models are purpose-built for reasoning and tool-use. Choose o3 when raw capability beats everything else; use o4-mini when you still need strong reasoning but want 10× cheaper tokens and snappier replies.
GPT-4.1 family is the evolution of the “generalist” GPT line. The full model gives you 1 M-token context with strong instruction following; mini trims cost/latency for mainstream workloads; nano pushes pricing and speed to the floor for simple or massive-scale tasks.

Feature	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	o3	o4 mini
Training cutoff date	May 2024	May 2024	May 2024	May 2024	May 2024
Context window	1M	1M	1M	200k	200k
Max output tokens	32.768	32.768	32.768	100k	100k
Input price	$2/M Tokens	$0.40/M Tokens	$0.10/M Tokens	$10/M tokens	$1.10/M tokens
Output price	$8/M Tokens	$1.60/M Tokens	$0.40/M Tokens	$40/M tokens	$4.40/M tokens
Latency	Moderate	Fast	Very fast	Slow	Moderate
When to use?	Long-context chat, knowledge-work, multimodal apps needing top GPT quality	Everyday product features, prototyping & chat where speed and price count	Real-time, latency-critical or large-scale batch jobs on a tight budget	Deep research, complex multi-step reasoning, high-stakes coding/science tasks	Reasoning workloads where cost & throughput matter; API agents, math & data-science

April 18th, 2025

Rate Limits

To ensure fair usage and consistent performance, we’ve introduced rate limits for all Orq.ai APIs on a per-account basis. This helps prevent server overload, reduces risk of abuse, and keeps costs manageable.If your account exceeds its rate limit, you’ll receive a 429 Too Many Requests response.Rate limits vary by subscription tier:

Subscription	Rate Limit for Deployment API calls	Rate Limit for other API calls	Log retention
Sandbox	100 API calls/minute	20 API calls/minute	3 days
Team (Legacy)	1000 API calls/minute	50 API calls/minute	14 days
Pro	2000 API calls/minute	100 API calls/minute	30 days
Enterprise	Custom	Custom	Custom

For more details, refer to our pricing page.

April 18th, 2025

Agentic RAG

You can now enable Agentic RAG in your deployments to improve the relevance of retrieved context and the overall output quality.Once toggled on, simply select a model to act as the agent. The agent will automatically check if the retrieved knowledge chunks are relevant to the user query. If they aren’t, it rewrites the query — preserving the original intent — to improve retrieval results.This iterative refinement loop increases the chance of surfacing useful context from your Knowledge Base, giving the language model better grounding to generate high-quality, reliable responses. The setup includes two key components:

Document Grading Agent – determines if relevant chunks were retrieved.
Query Refinement Agent – rewrites the query if needed.

See the screenshot below on how the input query gets refined. Input query: ‘is my suitcase too big?’ is reformulated to > ‘luggage size requirements and restrictions for carry-on and checked baggage’.

How to enable this? Just toggle the Agentic RAG feature on and select your model.

This feature is part of our ongoing effort to help you ship more robust AI features, let us know if you have any feedback!

April 18th, 2025

Threads

We’ve added a new Threads view to help you make sense of multi-step conversations at a glance.Each thread captures the full back-and-forth between the user and the assistant, showing just the inputs and outputs per step—without the technical breakdowns like embeddings, rerankers, or guardrails. It’s ideal for reviewing how the conversation unfolds across multiple steps.The Thread Overview includes:

Start time and last update
Total duration
Number of traces (steps)
Total cost and tokens used
Project name and session ID
Custom tags to help categorize threads

You can customize which columns are visible to tailor the overview to your needs.To narrow things down, you can filter threads by:

Project
Tags
Start date

This view complements the existing Traces tab, which remains the place for inspecting each individual LLM call in detail.

📘 To start using the Threads function add a thread id during the deployment invocation, to learn more, see Threads.

April 9th, 2025

Jina AI - new embedding and rerank models

We’ve added Jina AI as a new provider, bringing a suite of high-performance models to the platform:

Multilingual Embedding Models: Use Jina AI’s embedding models—like jina-embeddings-v3, supporting 89 languages and sequences up to 8,192 tokens.
Efficient Rerank Models: Choose from three reranker variations:
- Tiny – optimized for speed and low memory use.
- Turbo – a balance between speed and accuracy.
- Base – best-in-class accuracy for critical use cases.

These models strengthen both multilingual coverage and reranking performance across a wide range of applications.

April 9th, 2025

Knowledge Base API

Previously, creating and managing a knowledge base required using the Orq UI. Users had to manually create a knowledge base, upload one file at a time, and define a chunking strategy per datasource. While simple, this process was time-consuming, especially for developers looking to scale or automate their workflows.With the new Knowledge Base API, all major operations are now available programmatically. You can create and manage knowledge bases, upload datasources, generate or manage chunks, and perform retrieval-related actions, all through code.This opens up much more flexibility, especially for teams working on complex chunks. You can now:
• Perform chunking on your own side, tailored to your data structure
• Bring your own chunks—and even include your own embeddings
• Let Orq embed the chunks if embeddings aren’t provided, using the model specified in your API callNote: Attaching custom metadata to chunks isn’t supported yet but will be added soon.Available operations:
• Knowledge bases: create, list, retrieve, update, delete
• Datasources: create, list, retrieve, update, delete
• Chunks: create, list, retrieve, update, delete
• Search: inspect what’s being retrievedWhether you’re working with a single file or a dynamic content pipeline, this update makes the knowledge base workflow faster, more flexible, and developer-friendly.

📘 Read more on the Knowledge API in our docs or in the SDKs

April 7th, 2025

Hub

We’ve added a new feature called the Hub — a library where you can browse and reuse common evaluators and prompts for your projects.Instead of starting from a blank screen, you can now add ready-made components with one click. Some can be fully customized after adding them to your project (like prompts or classification evaluators), while others — such as traditional NLP metrics like cosine similarity — are fixed by design.The goal: make it easier to start with off-the-shelf evals and prompts.At launch, the Hub includes prompts and evaluators. Soon, you’ll also find datasets, tools, and other entities here to further speed up your workflow.You can find the Hub in the left-hand menu of your workspace.

March 18th, 2025

Tracing

AI workflows can feel like a black box—when something goes wrong, it’s hard to know why. Tracing changes that by giving you full visibility into every step of your workflow. Instead of guessing why an LLM output is wrong, you can quickly check every step in the workflow—saving time and reducing frustration.With this release, you can inspect all events that occur within a trace, including:

Retrieval – See which knowledge chunks were fetched.
Embedding & Reranking – Understand how inputs are processed and prioritized.
LLM Calls – Track prompts, responses, and latency.
Evaluation & Guardrails – Ensure quality control in real time.
Cache Usage – Spot inefficiencies in repeated queries.
Fallbacks & Retries – Detect when your system auto-recovers from failures.

This level of observability helps teams debug faster, optimize workflows, and make data-driven improvements.

Billing impact - Event count

With the introduction of Traces, all the events seen in the overview will count towards the number of events. This has direct impact on the billing.An example: A chat request with 2 evaluators was historically counted as 1 request but will now be counted as 3 events.

March 18th, 2025

Python Evaluators

When building AI features, ensuring high-quality and reliable outputs is crucial. Orq.ai allows you to implement custom evaluators in Python, giving you full control over how AI-generated content is assessed and validated.

Benefits of Using Python for Evaluators

Flexibility & Customization

Python enables you to define evaluation logic that precisely matches your needs, whether it’s scoring relevance, detecting biases, or enforcing style guidelines.

Seamless Integration

Orq.ai supports direct integration of Python-based evaluators, allowing you to run checks on AI outputs within your workflow without extra tooling.

Preloaded with NumPy

Your Python evaluators can leverage NumPy (v1.26.4) for numerical computations, making it easier to apply statistical methods, calculate custom scores, or analyze AI responses efficiently.

Automated & Scalable

Python evaluators run automatically on AI responses, ensuring continuous quality control and reducing manual review efforts.

📘 Read here how to set up a custom Python Evaluator, see our docs

March 18th, 2025

Move entities

Previously, once an entity was created, it was locked in place—you couldn’t move it to another project or directory. Now, you finally can.With this new feature, users can seamlessly move entities and directories to different projects and directories, ensuring better organization and flexibility in managing data.

Key Details:

Move entities and directories freely between projects and directories.
Maintain project-level permissions—entities will only be visible to the selected teams assigned to that project.
Improve workflow efficiency by keeping related entities together where they belong.

How It Works

This update gives you more control over your workspace, making it easier to structure your data the way you need it.Try it out and optimize your project structure today!

February 28th, 2025

GPT 4.5

OpenAI has unveiled GPT-4.5, its latest and most advanced AI language model to date. Building upon the foundation of GPT-4o, GPT-4.5 offers enhanced pattern recognition, deeper world knowledge, and a more refined conversational experience. This release aims to provide users with a more intuitive and reliable AI assistant for a variety of applications.

Why It Matters

GPT-4.5 represents a significant step forward in AI-human interaction. By reducing instances of AI “hallucinations” and improving emotional intelligence, users can expect more accurate and contextually appropriate responses. These enhancements are particularly beneficial for tasks requiring creativity, empathy, and nuanced understanding, such as content creation, customer support, and personal assistance.

What’s New?

Enhanced Knowledge: GPT-4.5 has been trained on a broader dataset, resulting in a more comprehensive understanding of various topics and improved factual accuracy.Improved Conversational Abilities: The model now exhibits a more natural and emotionally nuanced conversational style, making interactions feel more like engaging with a thoughtful human.Reduced Hallucinations: Through advanced training techniques, GPT-4.5 significantly lowers the occurrence of generating incorrect or fabricated information, enhancing user trust.

Additional Enhancements

Training Innovations: GPT-4.5 combines traditional supervised fine-tuning with scalable alignment, where smaller models generate high-quality training data for larger models. This approach accelerates training and improves the model’s ability to follow nuanced instructions.
Multilingual and Multimodal Support: The model shows moderate gains in understanding multiple languages and processing various data types, making it versatile for global applications.

Performance & Analysis Tools

Benchmark Evaluations: GPT-4.5 demonstrates solid improvements over GPT-4o, especially in math (+27.4%) and science (+17.8%), making it more reliable for factual reasoning.
User Feedback Integration: OpenAI encourages users to engage with GPT-4.5 and provide feedback to further refine its capabilities and address any limitations.
Complex reasoning: Even though GPT-4.5 is an improvement over GPT-4.o, for more complex reasoning tasks, you should use o3-mini.

We invite all users to explore GPT-4.5 and experience its enhanced features firsthand. Your interactions and feedback are invaluable as we continue to evolve and improve our AI models. Dive in and discover how GPT-4.5 can assist you in achieving your goals more effectively.

February 25th, 2025

Claude 3.7 Sonnet

Claude 3.7 Sonnet is Anthropic’s most intelligent AI model yet, introducing hybrid reasoning, fine-tuned for business use cases, and an extended 200K context window.

🚧 We don’t return the thinking message yet, only the final output will be shown

📘 The model is available through Anthropic and Vertex AI

🚧 In AWS, the model is only available in the US region

Thinking - Budget tokens Parameter

With the introduction of the new model, we also introduce a new parameter with it: Thinking - budget tokensUsing this parameter users can enable/ disable the models reasoning capabilities. Additionally, you can set how many tokens should be dedicated for its ‘thinking’ step. We do recommend a minimum of 1024 budget tokens. A value of 0 means the thinking is disabled. The budget tokens should never exceed the Max Tokens parameter because it still needs tokens to generate a final answer.

Key highlights

🧠 Hybrid Reasoning Model

First-of-its-kind AI that can provide instant responses or extended, step-by-step reasoning based on user preference.
API users can control “thinking time” to balance speed, accuracy, and cost.

📊 Advanced Visual Data Extraction

Improved ability to analyze charts, graphs, and complex diagrams, making it ideal for data science, analytics, and business intelligence.

🗣️ More Human-Like Chatbots

Enhanced reasoning and natural conversation flow make it ideal for interactive AI assistants that connect data, make decisions, and take actions.

🤖 Robotic Process Automation

Automate complex workflows and repetitive tasks with industry-leading instruction following and tool integration.

🏗️ Expanded Capabilities for Developers

Supports a 200K context window for deep comprehension of long documents, knowledge bases, and large datasets.
128K output tokens (beta)—15x longer responses, enabling extensive code generation, planning, and detailed analysis.

Benchmarks and performance

Claude 3.7 Sonnet is designed for developers, businesses, and AI enthusiasts looking for an all-in-one model that balances speed, reasoning, and precision.

February 12th, 2025

Increased fallbacks

We’ve enhanced our fallback and retry system to provide even greater reliability and flexibility in production use cases. Previously, Orq.ai allowed users to define a single fallback model if the primary model failed to generate a satisfactory output. Now, we’ve increased the number of fallback models to five, giving users even more control over model orchestration without additional coding.

What’s New?

Up to Five Fallback Models – Previously limited to one fallback model, users can now configure up to five fallback models for increased reliability.
Flexible Fallback Configuration – The ability to define multiple fallback models allows for more advanced use cases, such as distributing API requests across different endpoints. For example:
- Azure provides different regional deployments for models like GPT-4o. If a user is making a large number of API calls through the Sweden Central region, they can configure their first fallback to West Europe, and subsequent fallbacks to other locations if rate limits are hit.
- If a primary model encounters rate limits, Orq can seamlessly transition requests to alternative endpoints, ensuring uninterrupted performance.

This update doesn’t change the core retry and fallback functionality that users already love—it simply expands its capabilities, making Orq.ai’s automated failover system even more powerful.Try out the new expanded fallbacks and let us know how they enhance your workflow!

February 12th, 2025

Importing prompts

We’ve introduced a new feature that makes it easier than ever to reuse your prompts within Orq.ai. With the import functionality, you can seamlessly integrate your already created, version-controlled prompts into your Playgrounds, Experiments, and Deployments.

What’s New?

Seamless Importing – Users can now import their saved prompts directly into their workspace.
Version Control – Select the specific version of a prompt to use.
Model Configuration – You can import your prompt including the optional model config.

This feature eliminates the need to manually recreate prompts or copy paste them, making it easier to experiment and deploy with confidence.For a quick walkthrough, check out the GIF below, which visually demonstrates how to import and use your prompts.

February 12th, 2025

Datasets v2

We’ve made a significant update to how datasets work in Orq.ai with the release of Datasets v2. This update merges variable collections into datasets, streamlining the structure and eliminating confusion between the two concepts.

What’s Changed?

Previously, variable collections only contained input variables, while Datasets v1 stored message arrays and optional references. With Datasets v2, we now combine everything into a single entity. This provides a single source of truth, making it clearer what constitutes a dataset and simplifying how datasets are used.

Why This Change?

Eliminates confusion between datasets and variable collections.
Creates a unified structure for better dataset management.
Ensures consistency for experiments and historical logging.

Why use Datasets v2?

Upload a dataset to evaluate your use case in an Experiment
Save human-curated logs into curated datasets, which you can use to fine-tune your model.
Save bad-rated logs, crucial for prompt engineers looking to refine prompts and improve output quality.

See an example of how to create a dataset below:

February 7th, 2025

Experiments v2 - Evaluate your LLM Config

We’re excited to introduce Experiments V2, a major upgrade to our Experiments module that makes testing, evaluating, and benchmarking models and prompts more intuitive and flexible than ever before.

Why Experiments Matter

The Experiments module allows users to analyze model performance systematically without manually reviewing logs one by one. It leverages evaluators—such as JSON schema validators, LLM-as-a-judge, and functions like cosine similarity —to automate the scoring process and assess output quality. This ensures users can confidently measure effectiveness and make data-driven improvements.Additionally, Experiments enable benchmarking through:

A/B testing – Compare different prompts or model configurations to determine what works best.
Regression testing – Detect unintended changes in outputs after modifying prompts or configurations.
Backtesting – Assess how new setups would have performed on historical data.

What’s New in Experiments V2?

We listened to your feedback and made significant improvements:

Compare Prompts Side by Side – Users can now directly compare multiple prompts within an experiment, making it easier to test variations and find the most effective approach.
Intuitive and Flexible UI – The new interface is streamlined for ease of use, making configuration and analysis smoother than ever.
Merged Data Sets and Variable Collections – Previously separate, these have now been combined to simplify workflows. We appreciate all the feedback on this and have made it clearer: when setting up an experiment, users now upload a data set containing input variables, messages (prompts), and expected outputs (if evaluators require a reference).

February 6th, 2025

DeepSeek Models Now Available: 67B Chat, R1, and V3

We are excited to announce the integration of DeepSeek’s latest AI models—67B Chat, R1, and V3—into our platform.

What model should I use?

Choose 67B Chat for fast everyday conversations
Pick R1 when you need deeper thinking and logic for complex tasks
Use V3 as a starting point to create your own specialized AI model

February 5th, 2025

OpenAI’s Latest Small Reasoning Model – o3-mini

Start using OpenAI’s newest and most advanced ‘small’ reasoning model: o3-mini.

January 16th, 2025

Llama 3.3 70b & Llama Guard 3 are now available through Together AI

Experience the power of the latest Llama 3.3 70b and Llama Guard 3 models on Orq, integrated via Together AI.

Together AI

With the deprecation of Anyscale, we’ve been looking for a good alternative, and that’s where Together AI comes in. Together AI allows you to start using the latest Llama models with more open source models to come.

Llama Guard 8b

In addition to our latest big release featuring Evaluators & Guardrails, we also added the Llama Guard model to make your AI feature even more secure.Llama Guard 3 is built to protect AI systems from common prompt hijacking attacks and misuse of LLMs.

How it works: It evaluates inputs and outputs for safety, returning a simple safe or unsafe verdict.
Key feature: When flagged as unsafe, it identifies the violated content categories, offering clear insights.
Why use it: Ideal for organizations exposing AI features to external users, it prevents misuse and ensures secure, compliant interactions.

This model is an essential tool for maintaining the integrity of your AI systems.

Llama 3.3 70b

For users building with Llama 3.1, you can now switch to the improved Llama 3.3 70b. You can use the same prompt as with Llama 3.1 since the prompt format remains the same. Also, the 3.3 70b model is similar in performance to the 3.1 405b model, meaning you can save cost and time by switching to the newer model.

January 15th, 2025

New Layout with Project Structure

We’re introducing a new project structure UI to help you organize and manage your resources more effectively. With projects, you can group your work by use case, environment, or any logical structure that suits your needs.

What are the benefits of the new Project Structure UI?

Easier grouping of resources that you’re working on as a team
Less clutter when working on multiple use cases
Better incorporation of teams and roles within a project

Where are my old resources?

Your old resources are automatically grouped per entity of your project. For example: all your deployments within the same project are moved to the deployments folder of that project.

Example structure

In the example image below, you can see that you’ll have the ability to group multiple entities within the same folder. This allows you to have a production folder of a specific use case, and keep your other projects separate.

Can I move my entities to a different project or directory?

Unfortunately not yet. We’re working on this feature, but until then, make sure you are working in the right place.

Search your entities

To find the right entity you want to work on, we added a global search function.You can filter on project and creator, as seen in the gif below.

📘 To read more about our new project structure, please visit our documentation.

December 16th, 2024

Online Guardrails in Live Deployments

You can now configure Guardrails after you have added them to your Library directly in Deployments > Settings for both input and output, giving you full control over Deployment responses

Guardrails: Block unacceptable inputs or outputs based on predefined criteria. If a guardrail is triggered, it can either retry the request or activate a fallback model, if configured.

What Are Guardrails?

Guardrails enforce boundaries by setting thresholds to block inputs or outputs that fail to meet predefined criteria. When triggered, guardrails can stop processing or activate fallback mechanisms like retries or alternative models.

Threshold Configuration: Set customizable thresholds based on your specific use case to block inputs or outputs that fail to meet predefined criteria.
Input and Output Guardrails: Apply rules to both incoming data and generated outputs, ensuring comprehensive quality control at every stage.
Synchronous Operation: Guardrails run synchronously and do not introduce additional latency to your workflows.
Fallback Mechanisms: When triggered, guardrails can stop processing or activate fallback options such as retries or alternative models to maintain continuity.

December 13th, 2024

Why Are Evals and Guardrails Important?

With this release, you can gain full control over your AI applications, enhancing the reliability, safety, and quality of AI-generated outputs with evaluators and guardrails. While LLMs are incredibly powerful, they are also prone to errors such as incorrect information, biased content, or misaligned outputs—a risk famously illustrated by a chatbot that mistakenly agreed to sell a car for $1.The value of evaluators is twofold:

First, they address the inherent risks of LLMs by monitoring and ensuring output quality and safety.
Second, they help teams overcome one of AI development’s biggest challenges: understanding how updates impact performance. Without clear insights, the development loop can break, turning iteration into guesswork rather than a structured engineering process. Evaluators help you understand whether an update is an improvement or a regression and quickly drill down into good / bad examples.

What Are Evaluators?

Evaluators assess and score outputs either against a reference or directly as they are generated by the model. They play a critical role in monitoring and improving AI quality, enabling you to evaluate performance continuously without disrupting operations. Evaluators come in several forms:Classic Evaluators: These rely on references and established metrics, such as cosine similarity or BLEU scores, to compare outputs to an expected result. They are ideal for structured, deterministic tasks like translation or summarization.

Example: Use cosine similarity to measure how closely an LLM-generated summary matches the reference text.

LLM-as-a-Judge: This approach uses language models themselves to evaluate outputs in real time, without requiring reference texts. By leveraging tailored prompts, LLMs can assess dimensions such as tone, accuracy, and task relevance, offering more flexibility for dynamic and creative use cases.

Example: Evaluate whether a chatbot’s response aligns with the intended tone and emotional context without comparing it to a specific reference.

We provide four specialized types of evaluators, each tailored to specific evaluation needs:RAGAS Evaluators
Designed to evaluate retrieval-augmented generation (RAG) workflows, these evaluators measure context relevance, faithfulness, recall, and robustness. They ensure responses derived from external knowledge bases are accurate and focused.

Example: Verify that a chatbot retrieving insurance policy details provides relevant and accurate information while excluding unrelated content.

Function Evaluators
Rule-based tools for assessing measurable aspects of text, such as formatting, length, or keyword presence. These provide binary, deterministic results for compliance and validation.

Example: Check if a response includes “return policy” or stays within a 200-character limit for customer service use cases.

HTTP Evaluators
Allow external API calls to evaluate outputs using custom or third-party systems, enabling tailored checks for domain-specific requirements.

Example: Call a compliance-check API to validate that AI-generated responses meet industry-specific guidelines.

JSON Evaluators
Validate JSON outputs against predefined schemas, ensuring all required fields, correct data types, and structures are present.

Example: Confirm a JSON file for user registration includes fields like “username,” “age,” and “email” in the correct format.

What Are Guardrails?

Guardrails, enforce strict boundaries to block unacceptable inputs or outputs entirely. These constraints ensure the system adheres to rules like regulatory compliance, ethical standards, or content safety. The motives to implement guardrails can be summarized in four categories:

Robustness and Security. Guardrails are of huge importance in the aspect of security and robustness of an LLM-based application. This includes vulnerabilities such as prompt injections, jail-breaking, data leakage, handling illegible or obfuscated content.
Information and Evidence. Guardrails can be put in place for fact checking, detecting hallucinations, making sure the response does not contain incorrect or misrepresented information, does not consider irrelevant information, the response is supported by evidence.
Ethics and Safety. Making sure the response does not lead to harmful consequences either directly or indirectly. A response should not exhibit problematic social biases or share protected or sensitive information, take into account copyright and plagiarism law, use 3rd party content by appropriately citing it, etc.
Tool-specific functionalities. Making sure the responses are on-topic, of proper extensiveness, written in the suitable tone, use specified terminology, etc.

December 13th, 2024

HTTP and JSON Evaluators and Guardrails

You can now add HTTP and JSON Evaluators and Guardrails under the Evaluator tab and add them to your Deployment or Experiment.

HTTP Evaluators and Guardrails

Purpose: Enable evaluation using external APIs to perform checks like compliance, custom logic, or business-specific validations.
How It Works: You define the URL, headers, and payload for HTTP Requests, and the evaluator expects either a boolean or numerical response in the format:

{ "type": "boolean", "value": true }

{ "type": "number", "value": 1 }

Use Case Example: An HTTP Evaluator can validate content against regulatory guidelines by routing a compliance check to your API, returning a boolean to accept or block the output. If you want to integrate an HTTP Eval into your set up, you can find more information in HTTP Evaluators.
Guardrails: Configurable to accept or deny calls based on the returned value, providing dynamic control in experiments and deployments.

JSON Evaluators and Guardrails

Purpose: Validate input or output payloads during Experiments or Deployments using a predefined JSON Schema.
How It Works: Enter a JSON Schema specifying required fields and their expected types. For example:

{
  "name": "ReceiptDataExtraction",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "Date": {
        "type": "string",
        "description": "Transaction date (YYYY-MM-DD)."
      },
      "Vendor": {
        "type": "string",
        "description": "Name of the vendor."
      },
      "Amount": {
        "type": "number",
        "description": "Total transaction amount."
      }
    },
    "required": ["Date", "Vendor", "Amount"],
    "additionalProperties": false
  }
}

Use Case Example: A JSON Evaluator can ensure that generated content includes mandatory fields like “title” and “length,” blocking invalid payloads. For a more extensive explanation on how to set this up in your Deployment, you can go here.
Guardrails: Toggle guardrails to block outputs that don’t adhere to the schema, ensuring compliance during Deployments or Experiments.

December 13th, 2024

Master Your RAG with RAGAS Evals

The Ragas Evaluators are now available, providing specialized tools to evaluate retrieval-augmented generation (RAG) workflows. These evaluators make it easy to set up quality checks when integrating a Knowledge Base into a RAG system and can be used in Experiments and Deployment to ensure responses are accurate, relevant, and safe.Key Features

Out-of-the-Box Functionality: RAGAS Evaluators are ready to use and cannot be reconfigured, offering a consistent evaluation framework.
Reference-Based Scoring: Some evaluators require a reference to calculate metrics like accuracy or faithfulness.
Scoring Scale: Evaluations return a score between 0 and 1, with higher scores indicating better performance (e.g., higher relevance or faithfulness).

Included Evaluators

Context Precision: Assesses how well retrieved chunks align with the user’s query.
- Example: Ensures chunks about “home insurance” are prioritized when the query asks about coverage, filtering out irrelevant topics like auto insurance.
Response Relevancy: Evaluates how directly the generated answer addresses the query.
- Example: For “What are the fees for international transfers?” it ensures the answer is concise and focused on the fees without unrelated details.
Faithfulness: Ensures the response is factually consistent with the retrieved context.
- Example: For “What is the company’s remote work policy?” it checks if claims (e.g., “three days remote”) match the policy document.
Context Entity Recall: Verifies that critical entities from the reference answer are included in the retrieved content.
- Example: For “Tell me about the Taj Mahal,” ensures entities like “Shah Jahan” and “Agra” are retrieved.
Context Recall: Measures if all necessary details from a reference are retrieved.
- Example: For “What are the main benefits of product X?” it ensures all benefits like “cost savings” and “improved efficiency” are included.
Noise Sensitivity: Checks if the system ignores irrelevant information in the retrieved context.
- Example: For “What is LIC known for?” it ensures the response focuses on LIC’s attributes, filtering out unrelated economic data.

December 13th, 2024

Evaluator Library: 50+ Ready-to-Use and Tailorable Evaluators

Introducing the new Evaluator Library:Key Features:1. Gallery of Preconfigured Evaluators:

The library includes 50+ preconfigured evaluation templates tailored for a wide range of use cases, from quality checks to compliance and contextual alignment.
These templates are organized into categories such as LLM Evaluator Ragas Evaluator and Function Evaluator checks to help you find the right tool for your needs quickly.

2. Customizable and Configurable Evaluators:

Each preconfigured evaluator can be customized and configured to fit specific project requirements.
Once tailored, evaluators can be added to your personal Library for reuse in Experiment and Deployment.

3. Seamless Integration:

Evaluators are designed to work effortlessly in both experiments and deployments, enabling consistent and reliable quality control across development stages.

December 13th, 2024

Improved LLMs as a Judge

LLM-as-a-Judge Enhancements:
We’ve significantly improved our existing LLM Evaluator feature to provide more robust evaluation capabilities and enforce type-safe outputs.1. Type-Safe Output:

When you configure the output type as a number, boolean, or other types, the LLM ensures the result adheres strictly to the defined format.
Additionally, each output includes a clear, concise explanation to support transparency and reliability in the evaluation process.

2. Extended Functionality:

The LLM-as-a-Judge now seamlessly integrates into live deployments, enabling asynchronous quality evaluation with a configurable sample rate. This approach provides actionable insights without disrupting workflows, ensuring consistent quality and minimizing drift.
The LLM-as-a-Judge also supports Guardrail Mode, actively enforcing quality standards by blocking inputs or outputs that fail to meet defined criteria.

November 30th, 2024

Online Evaluators in Live Deployments

iYou can now configure Evaluators after you have added them to your Library directly in Deployments > Settings for both input and output, giving you full control over Deployments.

Why Are Evaluators Important?

Quality and Safety: Evaluators mitigate LLM risks by monitoring and ensuring output quality and safety.
Performance Insights: They provide clear feedback on updates, helping teams identify improvements or regressions and refine iterations effectively.

What Are Evaluators?

Evaluators are tools designed to assess the quality, relevance, and safety of AI inputs or outputs, ensuring reliable and effective system performance. Off the shelf, we offer several options, including LLM Evaluator, Ragas Evaluator, Function Evaluator and API Call Evaluators. Additionally, the evaluators allow for the following configuration:

**Sample Rate Option:**Run evaluations on a subset of calls (e.g., 10%) to balance coverage and efficiency.
**Cost Management:**Reduces additional expenses associated with running evaluators, especially LLM-based ones.
Latency Control: Minimizes added response time by limiting the number of calls evaluated.

If you want to set up the evals in your own deployment, you can find more information here.

November 25th, 2024

Cache your LLM response

Cache and reuse your LLM outputs for near-instant responses, reduced costs, and consistent results.What is Caching?With this release, you can now enable caching for your LLM calls. Caching works by storing both the input and the generated output for a configurable period (TTL – Time-to-Live). When an exact request is made during this time, the cached output is returned instantly, bypassing the need for the LLM to generate a new response.Why Enable Caching?

Cost Savings

By reusing previously generated responses, caching reduces the number of LLM calls, which can significantly cut down on usage costs.

Time Savings

Cached responses are nearly instant, offering a much faster response time compared to generating a new output from the LLM.

Improved Consistency

Caching mitigates the inherent variability of LLM responses by ensuring that identical inputs produce the same outputs during the cache period.How to set up Caching:Go to your Deployment > settings > enable caching > configure TTLHow to check the Cache status:Logs > select specific log > request tab > caching statusIn here you will see if the cache was hit or not and the configured TTL (time-to-leave) in seconds.

November 22nd, 2024

Structured Outputs on Azure

Ensure the model’s responses always stick to your JSON Schema. No more missing fields or incorrect values, making your AI applications more reliable and robust.In addition to the Open AI provided models, you can now use Structured Outputs on Azure hosted models.Using Structured Outputs is the go-to method if you want to ensure that the model is adhering to your JSON Schema.

📘 Read more about Structured Outputs in Using Response Format.

November 7th, 2024

Claude 3.5 Haiku and Sonnet (new)

Introducing the upgraded Claude 3.5 Sonnet and the new Claude 3.5 Haiku. The enhanced Claude 3.5 Sonnet offers comprehensive improvements, especially in coding and tool use, while maintaining the speed and pricing of the previous model. These upgrades make it an excellent choice for complex, multi-step development and planning tasks.For those prioritizing speed, Claude 3.5 Haiku combines quick response times with strong overall performance. It’s ideal for user-facing applications and specialized tasks, managing data-rich inputs like personalized recommendations and real-time decision-making with ease.Note: Claude 3.5 Haiku doesn’t support vision just yet.

November 4th, 2024

Role Based Access Control improvements

With the improvements to role-based access control (RBAC), it is much easier to assign and change workspace roles in Orq.Orq now offers three distinct workspace roles:

Admin: Complete access across all platform features, including user management, workspace settings, model configurations, and billing.
Developer: Provides technical access to all essential features, including the ability to deploy changes within Deployments. Developers cannot manage models, billing, or users.
Researcher: Offers access to Playgrounds, Experiments, and Deployments for research purposes. Researchers can review and annotate logs but cannot modify or deploy changes.

📘 Read more on how to set up permissions in our documentation

October 14th, 2024

Attach files to a Deployment

One common request we hear is whether it’s possible to attach files to an LLM — for example, to extract data from a PDF or engage with its content. With this update, you can now do just that.

📘 We support pdf, txt, docx, csv, xls - 10mb max

The content of the files will be incorporated into the initial system or user message. We recommend adjusting your prompt as you would when using variables in a prompt. The added content will be highlighted in green. See the example below:

🚧 Be aware that we only extract the raw data, no images or objects. Also, encrypted files are not supported.

When should I attach a file and not use a Knowledge Base?

The need for full context understanding

RAG retrieves relevant chunks, which works for focused queries but falls short for tasks like summarization that require full-document understanding. Attaching files gives the LLM access to the entire document, ensuring it has the complete context.For example, when summarizing reports, legal cases, or research papers, the LLM needs to process the full document to capture key details and connections that partial text retrieval can’t provide. Full context access leads to better comprehension and more accurate outputs, particularly for tasks requiring a holistic view, such as summarization and detailed analysis.

Dynamic document context

Unlike a static knowledge base, attached files can provide ad-hoc, context-specific documents for one-time or immediate use without the need for integration into a broader knowledge repository.When a user is dealing with unique documents—such as one-off reports, meeting notes, or specific contracts—they can attach these files directly to a deployment. The LLM can instantly use these documents to provide answers or insights. This feature is especially useful for situations where time-sensitive or project-specific documents need to be used on the fly, giving flexibility to quickly incorporate new, temporary knowledge without modifying or updating the knowledge base.

Private or sensitive data

Due to privacy concerns, confidential or sensitive files (e.g., contracts and medical records) may not be suitable for a general knowledge base. Attaching files directly allows secure, temporary interaction with this data.

How to use it?

To attach one or more files to a Deployment, follow these steps:

Upload your files

curl --location 'https://my.orq.ai/v2/files' \
--header 'Authorization: Bearer xxxxxx' \
--form 'purpose="retrieval"' \
--form 'file=@"/Users/cormick/Downloads/filename.pdf"'

Receive the _id

{
    "_id": "file_01JA5D27ZVW2N702Z0D3B1G8EK",
    "object_name": "files-api/workspaces/e747f6ac-19b0-47cd-8e79-0e1bf72b2a3e/retrieval/file_01JA5D27ZVW2N702Z0D3B1G8EK.vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "purpose": "retrieval",
    "file_name": "file_01JA5D27ZVW2N702Z0D3B1G8EK.vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "bytes": 5295,
    "created": "2024-10-14T11:36:54.189Z"
}

Send the file _id in addition to your normal Orq API call

curl --location 'https://my.orq.ai/v2/deployments/invoke' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer xxxxx' \
--data '{
    "key": "deployment_key",
    "messages": [
        {
            "role": "user",
            "content": ""
        }
    ],
    "file_ids": [
        "file_01JA5D27ZVW2N702Z0D3B1G8EK"
    ]
}'

October 11th, 2024

Log Filtering

The ‘Log Filters’ feature is now available, following high demand from our users. Monitoring logs is essential for tracking performance and identifying areas for improvement, and with this new feature, it’s now easier to find the exact data you need to make informed decisions.With Log Filters, you can quickly sort through large sets of logs and focus on the most relevant information for your workflow. Here are some key use cases:

Optimize Prompts: Filter logs with low ratings to pinpoint where your prompts need improvement and refine them for better outcomes.
Troubleshoot Issues: Filter logs with error statuses to identify what went wrong and address any issues in your setup.
Monitor Fallbacks: Filter logs that triggered a fallback model to see where the primary model didn’t perform as expected.

📘 Read more on our documentation page: Filters

October 11th, 2024

Prompt Snippets

Introducing Prompt Snippets, a feature designed to boost your efficiency by allowing you to reuse key parts of your prompt across multiple variants.When working with multiple deployment variants that share common prompt elements, you no longer need to update each one individually. Instead, edit the snippet once, and all related variants will automatically reflect the change, saving you time and reducing manual effort.

Using a Prompt Snippet

When you type {{snippet., a list of all available snippets will appear. You can then select the snippet you’d like to use.

Configuring a Prompt Snippet

Head to the Prompt Library to create and configure your Prompt Snippet.

Version control

Prompt Snippets are fully version-controlled, making it easier to track changes and correct mistakes, especially in collaborative environments. Be aware that updates to snippets will automatically apply to all deployments using them, so ensure any changes are intentional and properly tested.

❗️ Please note that if you modify the snippet, all deployments containing that snippet will be automatically updated

📘 Read more on our documentation: Prompt.

October 11th, 2024

Output Masking

We’ve added a new Output Masking feature to further strengthen security for the most sensitive use cases. With this feature, you can now mask the entire model’s output in a Deployment, providing an additional layer of data protection.In addition to the existing capability to mask input variables—such as marking patient_information as PII to prevent sensitive data from being stored on the Orq.ai platform—Output Masking extends this protection to the model’s complete output as well, ensuring that sensitive information remains secure throughout the entire process.

Enabling Output Masking

Go to your Deployment > Settings > toggle Output Masking on.

📘 Read more on how to set up Output Masking in our Deployment Security and Privacy documentation

October 11th, 2024

Workspace usage metrics

We’ve added a new feature to the Billing page called Workspace Usage. This tool provides users with a clear overview of how many API calls they’ve used out of their available limit within their current billing cycle. The usage tracking starts from the beginning of your subscription.It’s important to note that exceeding your API usage limit does not stop you from making further calls. However, once you surpass the free API call limit, additional charges will apply for any extra usage.

September 16th, 2024

Structured outputs

Structured Outputs is a feature that makes sure the model’s responses always follow your provided JSON Schema. This means you don’t have to worry about the model leaving out any required parts or making up incorrect values.

Why should you use Structured Outputs?

Reliable format: You don’t have to check or fix responses using the wrong format anymore.
Clear refusals: When the model refuses something for safety reasons, it’s now easily detectable programmatically.
Easier prompts: You don’t need to prompt the model to adhere to the schema anymore.

Using Structured Outputs

There are two ways of using the Structured Outputs functionality:

reponse_format (recommended) - In the dropdown menu (see GIF below), you can select the JSON pill and specify your schema.
Function Calling - In the tool, you can toggle the functionality of the structured output on/off.

Note: Structured Outputs are only available when using the following models:

gpt-4o 2024-08-06 and later
gpt-4o-mini 2024-07-18 and later

🚧 Reminder: To use Structured Outputs, all fields must be specified as required.

If you want to read more about Structured Outputs, please visit OpenAI’s documentation here

September 16th, 2024

Vertex AI custom integration

Set up your custom API keys with the new Vertex AI integration.

September 16th, 2024

Improved authentication

🚧 We removed the email and password login combination to enhance the security of our platform.

To improve the security of our platform, you will be able to log in using the following methods:

Magic Link Sign-In: Enter your email address, and we’ll send you a login link with a code for easy access. This means that you can also set up Two-Factor Authentication (2FA) or Multi-Factor Authentication (MFA) for added security.
Google Single Sign-On (SSO): Seamlessly log in using your Google account.
Microsoft Single Sign-On (SSO): Use your Microsoft account for quick and secure access.

These changes aim to provide a more secure and convenient login experience.

August 28th, 2024

New GoogleAI provider and Gemini models

With the introduction of Google AI as a new provider on the platform, the previous ‘Google’ provider is renamed to Vertex AI.If you were using Google as a provider before, you don’t have to do anything, you will only see a different provider name. The model will remain exactly the same.

📘 Google AI allows you to set up your own API keys, whereas VertexAI does not give you this option.

Why should I use Google AI instead of Vertex AI?

In addition to Google AI allowing you to set up your own API keys, they also offer their latest and experimental models. So, if you want to test out their newest models, Google AI is the way to go.The new models that are available in Google AI:

Gemini-1.5-flash-8b-exp-0827
Gemini-1.5-flash-exp-0827
Gemini-1.5-pro-exp-0801
Gemini-1.5-pro-exp-0827

August 28th, 2024

Export Experiment Results

To give data scientists greater flexibility, we now allow you to export your Experiment results directly from Orq.ai. This update provides the freedom to extend your analysis beyond the platform, enabling you to work with your data in your preferred tools and environments.How it works:

Choose your file type: Select from CSV, JSON, or JSONL formats based on your specific requirements.
Receive your data: After selecting your preferred format, your Experiment results will be compiled and sent to you via email, ready for further analysis.

Use cases

Leverage the exported data for custom visualizations, integrations, or further analysis in your preferred tools.
Utilize the output of LLM generations in various formats for downstream tasks or additional finetuning.

August 15th, 2024

Knowledge Bases & RAG

Start building even more reliable AI applications by equipping LLMs with accurate and relevant knowledge. This will greatly improve their knowledge and thus reduce the likelihood of hallucinations.

Why use a Knowledge Base?

The primary purpose of a knowledge base and a Retrieval-Augmented Generation (RAG) is to provide a reliable source of information that an LLM can access. By querying a knowledge base, an LLM can retrieve relevant data to answer questions or solve problems more accurately. This integration ensures that the information provided is both relevant and precise, enhancing the overall effectiveness of the model.

Use cases

Reduced Hallucination: By relying on a well-structured knowledge base, the likelihood of the LLM generating incorrect or fabricated information (hallucinations) is significantly decreased.
Specific Context: A knowledge base allows for the inclusion of domain-specific or company-specific information, ensuring that the responses generated by the LLM are more aligned with the intended context.
Up-to-Date Information: Unlike static models, a knowledge base can be continuously updated with the latest information, providing the LLM with current and accurate data.

📘 Read more about Knowledge Bases in our Knowledge Base docs

August 1st, 2024

Webhooks

A much-requested feature, webhooks, is now added to the Orq.ai platform.Webhooks allow you to automate a lot of things in real-time, including:

Automated data ingestion and processing: feeding real-time data from sources into Orq.
Automated workflow orchestration: triggering specific deployments based on events.
Logging: the ability to store the requests (input and output) back into your system.
Monitoring and alerting: set up webhooks to alert relevant stakeholders about changes in model performance.

In summary, with the new webhooks, you’ll get much more flexibility in building your Gen AI application.

📘 Read more about webhooks in our docs.

August 1st, 2024

Contact analytics

To gain more insights and control over your LLM consumption, we added contact analytics.A contact is an entity that can be anything you’d like to track, such as an individual, team, project, or client.Some use cases include:

Track Individual Users: Monitor costs and error rates for timely support.
Track Internal Projects: Ensure projects stay within budget and meet goals.
Track Client Projects: Get clear metrics per client projects.
Track Specific Teams: Analyze team performance and optimize resource allocation.

📘 Read more on how to set up contact analytics in our docs

July 25th, 2024

Onboard private models on Azure

You can now add your own private and/or finetuned models to the Orq.ai model garden.As of now, you can only add models hosted on Azure. Soon, we’ll also support the self-onboarding of private models through OpenAI and Vertex AI.Use the interactive walkthrough below to see how the onboarding process works.

July 25th, 2024

Debug more effectively

The Debug tab is back. Did your request fail, and do you want to look into the payload? Navigate towards the debug tab and find out what has been happening.

July 25th, 2024

Provider_response added in the SDK

We did add the raw provider_response to our Python and Node SDK. This will give you the flexibility of retrieving model-specific output that we don’t provide through our unified API.See an example response below:

{
  "id": "01J3MZHS6KFEWE7WBSAG0JP25X",
  "created": "2024-07-25T12:58:41.235Z",
  "object": "image",
  "model": "leonard-vision-xl",
  "provider": "leonardoai",
  "is_final": true,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "url": "https://cdn.leonardo.ai/users/3bf1a64b-57f6-499b-907d-8b1286c1aa95/generations/dee000c8-7b6d-4d8c-82b4-d1992f8adce2/Default_a_tree_0.jpg"
      },
      "finish_reason": "stop"
    }
  ],
  "provider_response": {
    "generations_by_pk": {
      "generated_images": [
        {
          "url": "https://cdn.leonardo.ai/users/3bf1a64b-57f6-499b-907d-8b1286c1aa95/generations/dee000c8-7b6d-4d8c-82b4-d1992f8adce2/Default_a_tree_0.jpg",
          "nsfw": false,
          "id": "f47939ed-0841-48fa-b763-50e2e5b02e9a",
          "likeCount": 0,
          "motionMP4URL": null,
          "generated_image_variation_generics": []
        }
      ],
      "modelId": "5c232a9e-9061-4777-980a-ddc8e65647c6",
      "motion": null,
      "motionModel": null,
      "motionStrength": null,
      "prompt": "a tree",
      "negativePrompt": "",
      "imageHeight": 512,
      "imageToVideo": null,
      "imageWidth": 512,
      "inferenceSteps": 15,
      "seed": 4135326353,
      "public": false,
      "scheduler": "EULER_DISCRETE",
      "sdVersion": "SDXL_0_9",
      "status": "COMPLETE",
      "presetStyle": "CINEMATIC",
      "initStrength": null,
      "guidanceScale": null,
      "id": "dee000c8-7b6d-4d8c-82b4-d1992f8adce2",
      "createdAt": "2024-07-25T12:58:43.034",
      "promptMagic": false,
      "promptMagicVersion": null,
      "promptMagicStrength": null,
      "photoReal": false,
      "photoRealStrength": null,
      "fantasyAvatar": null,
      "generation_elements": []
    }
  }
}

July 24th, 2024

GPT-4o mini

OpenAI has introduced GPT-4o mini, a smaller, more efficient, and cheaper model replacing GPT-3.5 Turbo. This follows similar moves by Anthropic (Claude 3 Haiku) and Google (Gemini 1.5 Flash).Key features:

Significantly smarter than GPT-3.5 Turbo, but not as capable as GPT-4o (see benchmark).
60% cheaper compared to GPT-3.5 Turbo.
It’s a vision model, meaning it can interpret images.
Context window is as big as GPT-4o (128k).

Why a “weaker” model when there is a better one? Quite simply: Not every task needs the best model.Recommendation: For simple tasks such as summarizing texts or improving formulations, 4o-mini is perfectly sufficient. When more knowledge and complex “thinking” is required, you resort to 4o.

June 21st, 2024

Claude 3.5 Sonnet

Two hours after Claude 3.5 Sonnet was released, it became available on Orq.ai. Toggle it on in the model garden and try it out yourself!A little about the new model. First off, Claude 3.5 Sonnet can handle both text and images. It’s also Anthropic’s top-performing model to date, outshining the previous Claude 3 Sonnet and Claude 3 Opus in various AI benchmarks (see image).One area where Claude 3.5 Sonnet does seem to excel is in speed. Anthropic claims it’s about twice as fast as Claude 3 Opus, which could be a game-changer for developers building apps that need quick responses, like customer service chatbots.The model’s vision capabilities have also seen a significant boost. It’s better at interpreting charts, graphs, and even reading text from less-than-perfect images. This could open up some interesting possibilities for real-world applications.However, it’s worth noting that Claude 3.5 Sonnet still has the same context window of 200,000 tokens as its predecessor. While this is a decent amount (about 150,000 words), it’s not a step up from what we’ve seen before.

June 12th, 2024

Prompt library

The new Prompt library lets you create prompt templates that you can reuse throughout the whole platform.With the filtering capabilities, you can easily find the right prompt for your use case and effectively manage multiple prompts.Check the new feature out in your workspace or in this interactive walkthrough below.

June 4th, 2024

Llama 3 on Perplexity

Start using the best open source model, hosted on perplexity, in Orq.You could already use the Llama 3 models hosted on Anyscale and Groq. However, being able to use Llama-3 on Perplexity opens up new possibilities and use cases.Because the model is able to go online, it is able to:

Generate up-to-date responses
Retrieve dynamic data about the latest news etc.
Understand real-world context better

The example below showcases that only Llama-3 on perplexity is able to generate the current temperature in Amsterdam.

May 28th, 2024

Multi API key selector

You can now select which API key to use in your Playgrounds, Experiments, and Deployments.Select which API key to use with the new integration selector pill.This only works after you have set up more than one API key. Note: This feature does not apply on Azure and AWS.In the example below, I use the exact same model configuration as my fallback model. However, I will use my second API key in case the first one doesn’t work.

May 27th, 2024

Deployment test run

Preview your LLM call in the Deployment studio.This new feature allows you to quickly do a test run. You still have the option to open the same configuration in the Playground for further testing.

May 7th, 2024

Groq custom integration

Set up your custom API keys with the new Groq integration.

May 7th, 2024

Keyboard shortcuts

For the real prompt engineers in Orq, we introduced some keyboard shortcuts in the Playground module. If your cursor is located within the prompt template, you can use the hotkeys below to generate an output.cmd/ctrl + enter runs the current block.shift + cmd/ctrl + enter runs all the blocks.Feel free to contact us for other shortcut suggestions.

May 7th, 2024

New version pop-up

With the introduction of the new version pop-up, you can now choose when you’d like to do the update. This way, you won’t get interrupted in the middle of your workflow.

May 7th, 2024

Tools workflow

Quickly import a Tool into a Deployment.Also, when creating a Tool in a Deployment, you can save it by clicking Add to resources.

Create, save, and find all your Tools in the Resources tab.

April 26th, 2024

New domain for ORQ.ai

🚧 If you’re using my.orquesta.dev please start using my.orq.ai

April 26th, 2024

Version Controlled Deployments

With this new feature, you can view the changes made to a Deployment and restore to the previous version if needed.

On top of that, when deploying, users get the option to write a quick description about what they changed and if it’s a minor or major change.

April 24th, 2024

Deployment Routing

When opening a Deployment, you now immediately enter the Prompt Studio.To switch to other Variants, you can click on the variant name, which opens a dropdown menu with all the other variants that you can choose to switch to. In the same dropdown menu, you can also choose to add another variant.Under the new Routing tab, you can still see the Business Rules Engine and configure it like you’re used to.

April 24th, 2024

Prompt generator

With this new feature, we can help you turn your prompt into a more detailed and complete prompt.You can also use it to create a template which you can iterate on.

April 23rd, 2024

Tools in Experiments

Start using Tools, also known as Function Calling, in an Experiment.Once you have set up your Tools in the Resource Tab, you can add them in the configuration tab in your Experiment.A strong feature is to use Tools in combination with the Valid JSON and Valid JSON Schema Evaluators.

April 23rd, 2024

With breadcrumbs, you’ll always know in which module you’re working. On top of that, breadcrumbs allow for quick navigation within the module. See an example below.

April 23rd, 2024

Experiment aggregate metrics

With the new Experiment aggregate metrics, you can do a much more accurate comparison between models and/or configurations.These filters are crucial when doing large Experiments. The heatmap still provides you with a quick glance at what the differences are, but the filter allows you to make a more precise comparison.The filter includes the sum, average, median, min, max, range, and standard deviation.

April 23rd, 2024

Set up your Tools in Resources

Set up your Tools in the Resources tab.This update allows you to reuse your Tools across the platform. Also, if you want to use Tools in an Experiment, you have to set them up within the Resources tab first in order to select them.

April 11th, 2024

Google Gemini Pro 1.5

Start using Google’s Gemini Pro 1.5 across all three modules on Orq. The newest model from Google is not only more advanced than its predecessor but also boasts several key differentiating features.Top reasons why you should start testing Gemini Pro 1.5 today:

1 million context window! - Allowing it to understand a much greater context than before.
Complex multimodality reasoning - It’s able to understand video, audio, code, image, and text.

April 8th, 2024

Vision models are now available

Start using vision models to unlock the even greater potential of Generative AI.Vision models and image models are different. While image models create images, vision models interpret them.We currently support the following vision models:

Claude-3-haiku
Claude-3-sonnet
Claude-3-opus
GPT-4-turbo-vision
GPT-4-vision-preview
Gemini-pro-vision

April 8th, 2024

New Cohere models - Command R & Command R+

We now support the latest models from Cohere: Command R & Command R+.Command R is a highly advanced language model designed to facilitate smooth and natural conversations and perform complex tasks requiring long context. It belongs to the “scalable” category of models that provide both high performance and accuracy.The most important key strengths of Command R:

High precision on retrieval augmented generation (RAG) and tool use tasks (function calling)
Low latency and high throughput
Large context window 128,000-tokens
Multilingual capabilities across 10 key languages, with pre-training data for 13 additional languages

April 4th, 2024

Prompt Studio V2

Today we released the new Prompt Studio V2. Similar to the updated Playground V2, the Prompt Studio is designed for intuitive prompt engineering and consistency across the entire platform.The Prompt Studio can be found when clicking on a variant in the business rules engine. Within the Prompt Studio, you can configure your LLM call. It supports function calling (tools), variable inputs, retries, and fallbacks.Check out the interactive walkthrough below to test it out yourself.

April 4th, 2024

Anthropic Custom Integration

Start using your own API keys with the new Anthropic integration.

April 4th, 2024

AWS Bedrock Custom Integration

Start using your own API keys with the new AWS Bedrock integration.

March 16th, 2024

New SDK’s with breaking changes!

🚧 We recently transitioned our SDK's from@orquesta to @orq-ai across all platforms to align with our branding and focus on AI. This includes some breaking changes!

Python SDK

Installation

Before

pip install orquesta-sdk

After

pip install orq-ai-sdk

Usage

Before

Python

from orquesta_sdk import Orquesta, OrquestaClientOptions

api_key = os.environ.get("ORQUESTA_API_KEY", "__API_KEY__")

options = OrquestaClientOptions(
    api_key=api_key,
    environment="production"
)

client = Orquesta(options)

After

Python

import os

from orq_ai_sdk import OrqAI

client = OrqAI(
   api_key=os.environ.get("ORQ_API_KEY", "__API_KEY__"),
   environment="production"
)

Node SDK

Installation

Before

npm install @orquesta/node

yarn add @orquesta/node

After

npm install @orq-ai/node

yarn add @orq-ai/node

Usage

Before

Typescript

import { createClient } from '@orquesta/node';

const client = createClient({
  apiKey: 'orquesta-api-key',
  environment: 'production',
});

After

Typescript

import { createClient } from '@orq-ai/node';

const client = createClient({
  apiKey: 'orquesta-api-key',
  environment: 'production',
});

March 14th, 2024

Playground V2

Today, we launched our new Playground V2. This updated layout is designed for prompt engineering. By keeping the prompt template separate from the chat messages, it’s easier to fine-tune your prompt. When you clear the chat messages, the prompt template will still be there.Check out our new Playground V2 in the interactive walkthrough below!

March 14th, 2024

Claude 3 - Haiku, Sonnet and Opus are now available

With the launch of Claude 3, Anthropic is challenging the status quo. Its most advanced model ‘Opus’ is supposedly better than GPT-4. However, it’s also around 2.5 times more expensive than GPT-4. Haiku, the cheapest and least capable model, is around the same price as GPT-3.5 and is really fast.On another note, none of the Claude models support image and video generation. However, they do have the ability to interpret and analyze images. Moreover, they do have a default context window of 200k tokens, making it useful for larger data sets and variables.The Sonnet and Haiku models are available on Anthropic as well as on AWS Bedrock. Whereas Opus is currently only available on Anthropic.You can toggle them on in the Model Garden and try them out yourself!

March 14th, 2024

Hyperlinking

With the new Hyperlinking feature, you are able to take your use case from one module to the other. Switching between the Playground, Experiments, and Deployments allows you to make quick iterations throughout the whole platform. Whether you want to take your Playground setup to Experiments or your Deployment to Playground, it’s all possible.Example: You are running a Deployment but want to quickly test out what would happen if you change your prompt. You don’t want to change the prompt within your Deployment because it’s in production. With the new Hyperlinking feature, you can open the exact same configuration used in one module in the other. Follow these steps to try it out yourself:

Navigate towards the Logs tab
Click on the log that has the configuration that you want to iterate on
Click on the tree-dotted menu in the upper right corner
Click open in Playground
Click on ‘View Playground’
You’ll be redirected the Playground and see the exact same configuration you are using in your Deployment

March 6th, 2024

Use Groq as your LLM provider

You can now use Groq as your LLM provider. It currently hosts the following models: llama2-70b-chat, mixtral-8x7b-32768, and Gemma-7b-it.What is Groq and why should I try them out?Groq has recently gained a lot of attention for being one of the top companies providing fast LLM inference solutions. There’s a growing need among Gen AI application builders to decrease response times, as lowering latency is key for real-time AI applications. The field is getting crowded, with many companies competing in LLM inference. Groq being one of them, claims to offer inference speeds that are 18 times faster than some well-known cloud-based services. The models hosted by Groq utilize the Groq LPU™ Inference Engine, which is powered by their specialized hardware called Language Processing Units (LPUs) to enhance LLM performance.

Tokens per second

March 1st, 2024

Make your own custom LLM Evaluator

You can now create your own custom LLM evalutator. This allows you to go beyond the standard evaluators like BLEU and Valid JSON.In the example below, we made an evaluator that compares the tone of voice of the generated text against the desired tone in the reference.Read more about Experiments and Evaluators

February 28th, 2024

Use LLM as a Reference in Experiments

With this new feature, you’re able to use the output of a large language model like GPT-4 as the reference for another model like Gemma-7b and Mistal-large (see image).For most of the evaluators, you need a reference. This is because an eval like cosine similarity needs two things to compare to each other (the newly generated text and the reference text).Example use case: A new model has been released which is faster and less expensive than the model you are currently using. Although your current model is performing well, you are interested in comparing the performance of the new model. To compare the two models, you have selected your current model (GPT-4) as the reference model in the configuration. This will serve as a benchmark for the new model’s performance. When running the experiment, the reference model will be completed first, and then the output of that model will be used as a reference for the other models. To measure the similarity between the output of the newer models and the reference model, you can use an evaluator such as cosine similarity.

February 28th, 2024

New Providers & Models: Mistral large, Perplexity, Gemma 7b and other models are added

February 28th, 2024

Import from Resources in Experiments

Instead of manually adding your data sets or uploading them through a CSV file in experiments, you can now import them from the files you stored in Resources. This allows you to store and access your files in a quick and efficient manner.

February 28th, 2024

Resource management

Save your data sets, variables, and evaluators in the newly added resources tab.With this new feature, you don’t have to constantly re-upload your resources anymore.Example: After finetuning your prompt, simply save the prompt in the resources tab so you can easily access it during your next workflow.The resources are divided into 3 categories:Data sets - This is where you can store all your prompts and references. Read more about Dataset.Variables - The dynamic elements you put in between curly_brackets are listed here.Evaluators - The standard and custom evals can be stored here. Read more about why you should use evals and when to use which one in Evaluator.

February 13th, 2024

Evaluators are available on our Platform

We have added Evaluators to our platform. With a wide range of industry-standard metrics and other relevant evaluators, you can check whether or not the output of your LLM is accurate, reliable, and contextually relevant.Read more about each evaluator in Evaluators.

February 13th, 2024

Python and Node SDK’s improvements

In our latest SDK update, we’re thrilled to share a series of enhancements that significantly boost the performance and capabilities of our platform.

Support for messages in the invoke and stream functionalities

Now you can provide a new property in the invoke and the invoke with stream methods of both SDKs to combine the messages with the prompt configured in Orquesta.

Typescript

const deployment = await client.deployments.invoke({
  key: 'customer_service',
  messages: [
    {
      role: 'user',
      message:
        'A customer is asking about the latest software update features. Generate a detailed and informative response highlighting the key new features and improvements in the latest update.',
    },
  ],
  context: { environments: 'production', country: 'NLD' },
  inputs: { firstname: 'John', city: 'New York' },
  metadata: { customer_id: 'Qwtqwty90281' },
});

console.log(deployment?.choices[0].message.content);

The introduction of the new property significantly enhances various aspects of interaction:

Enhanced Contextual Clarity: The use of chat history empowers the model to preserve the context throughout a conversation. This feature ensures that each response is not only coherent but also directly relevant, as the model can draw upon past dialogues to fully grasp the nuances of the current inquiry or discussion topic.
Streamlined Conversation Flow: Chat history is instrumental in maintaining a consistent and logical flow in conversations. This prevents the occurrence of repetitive or conflicting responses, mirroring the natural progression of human dialogues and maintaining conversational integrity.
Tailored User Interactions: With access to previous interactions, chat history allows the model to customize its responses according to individual user preferences and historical queries. This level of personalization significantly boosts user engagement and satisfaction, leading to more effective and enjoyable communication experiences.

Support for messages and choices to add metrics

In Orquesta, after every request, you can add custom metrics. On top of the superset that we already support, we also added two new properties, messages and choices. The messages property is designed to log the communication exchange between the user and the model, capturing the entire conversation history for context and analysis. Meanwhile, the choices property records the different response options that the model considers or generates before presenting the final output, providing insights into the model’s decision-making process.

Typescript

deployment.addMetrics({
  messages: [
    {
      role: 'user',
      message:
        'A customer is asking about the latest software update features. Generate a detailed and informative response highlighting the key new features and improvements in the latest update.',
    },
  ],
  choices: [
    {
      index: 0,
      finish_reason: 'stop',
      message: {
        role: 'assistant',
        content:
          "Dear customer: Thank you for your interest in our latest software update! We're excited to share with you the new features and improvements we've rolled out. Here's what you can look forward to in this update",
      },
    },
  ],
});

Support for OpenAI system fingerprint

To help track OpenAI environmental or model changes, OpenAI is exposing a system_fingerprint parameter and value. If this value changes, different outputs will be generated due to changes to the OpenAI environment.In the new version of the SDK, if you are using an OpenAI model and the invoke method, the system_fingerprint will be exposed in the deployment properties.

January 25th, 2024

Setup your own API key with the new AnyScale integration.

You could already select models such as Llama from Meta and Mixtral from Mistral in the model garden. But with this release, it is now possible to connect your own API key for AnyScale. This way you can use your own account and rate limits without having to rely on a shared key. Soon you’ll be able to use your own private models and finetuning on AnyScale.

January 25th, 2024

Mass Experimentation

You could already test out different models and configurations using our playground. However, with the introduction of Experiments, you are able to do this on a much larger scale.Simply import your prompts, expected outputs, and variables, configure your model, and you’re able to do hundreds of simultaneous runs. This allows you to do a side-by-side comparison in large batches.Check out the interactive walkthrough.

January 3rd, 2024

Python and Node SDK’s now support multimodality

We improved our SDKs to support more types of models. Now, it’s possible to use completion, chat, and image models through our unified API.We also improved our API reference here to offer a better DX.Python SDKhttps://pypi.org/project/orquesta-sdk/Node SDKhttps://www.npmjs.com/package/@orquesta/node

January 3rd, 2024

Function calling supported in the Playground

For a better UX on Orquesta, we improve how to handle function calling in the Playground and the Prompt Studio.Now you can simply enter your function name, description, and parameters with the JSON schema format.

January 3rd, 2024

Billable transactions are now shown in the log details

Now, it is possible to see in the log details if the request to the LLM was executed with Orquesta API keys or your API keys.

January 3rd, 2024

Gemini, Google’s most capable model, now available on Orquesta

Now, using Gemini Pro on Orquesta through the Playground and Deployments is possible. This model also supports function calling.

January 1st, 2024

Log and metrics enhancement

We introduced a new log detail drawer for a better and more segmented data structure. This will allow our users to consume the information more easily.

January 1st, 2024

Chat history now available in the playground

If you are using the Playground, you can now utilize Playground memory. This feature enables you to add historical context when comparing different models simultaneously.For completion models, we strive to incorporate historical context, but this may occasionally lead to undesired effects in the final completions. If you encounter such issues, we recommend disabling Playground memory.

January 1st, 2024

Support for Dall-E models

Added support for Dall-E models in the Playground and Deployments for OpenAI and Azure.

January 1st, 2024

orq.ai Integrations

We added a new Integrations section to allow the use of your API keys. At the moment, we support custom API keys from the following LLM providers

OpenAI
Azure
Cohere

More providers and integrations are coming soon.

January 1st, 2024

Reset models default parameter in a Deployment

Now, it is possible to reset the parameters of the primary model and the fallback model in a Deployment.

December 4th, 2023

AWS Bedrock models available

Recently, we’ve expanded our platform’s capabilities by adding AWS Bedrock models. You can now explore these models in the Playground or integrate them into your deployments.Models

anthropic/claude-2
anthropic/claude
anthropic/claude-instant
a21labs/jurassic-2-ultra
a21labs/jurassic-2-mid
amazon/titan-text-lite
amazon/titan-text-express
cohere/command
anthropic/claude-2.1
cohere/command-light

Release Notes

​Introducing Evaluatorq

​Native support for Gemini models - URL Context Tool

​OpenTelemetry Tracing Support

​Knowledge API: New Parameters

​Support for OpenAI’s image edit and variation APIs

​File Uploads for Anthropic Models

​Comprehensive Budget Controls & Cost Management

​GPT-5 available in the Model Garden

​Preview Model Support

​Three new RAGAS Evaluators for enhanced RAG Experiments

​1. Context Recall

​2. Noise Sensitivity

​3. Context Entities Recall

​Flexbile Prompt Configuration for Experiments

​Chunking API

​Evaluators API

​Connect LiteLLM models to Orq

​Self onboard Vertex AI models

​Custom Human Review

​How to collect feedback

​Why this matters

​How to create custom Human Review

​Evaluator playground

​Attach files directly to the model

​Evaluator runs without new generation

​Reasoning field now included in the API response

​Experiment Runs overview

​Improved Experiment configuration panel

​Metadata filter on Knowledge Base Chunks

​Why this matters

​Example

​Claude 4 Sonnet & Opus

​Tool response in Playground

​OpenAI’s new o3, o4-mini, GPT-4.1 models

​When to use what model

​Rate Limits

​Agentic RAG

​Threads

​Jina AI - new embedding and rerank models

​Knowledge Base API

​Hub

​Tracing

​Billing impact - Event count

​Python Evaluators

​Benefits of Using Python for Evaluators

​Move entities

​Key Details:

​How It Works

​GPT 4.5

​Why It Matters

​What’s New?

​Additional Enhancements

​Performance & Analysis Tools

​Claude 3.7 Sonnet

​Thinking - Budget tokens Parameter

​Key highlights

​Benchmarks and performance

​Increased fallbacks

​What’s New?

​Importing prompts

​What’s New?

​Datasets v2

​What’s Changed?

​Why This Change?

​Why use Datasets v2?

​Experiments v2 - Evaluate your LLM Config

​Why Experiments Matter

​What’s New in Experiments V2?

​DeepSeek Models Now Available: 67B Chat, R1, and V3

​What model should I use?

​OpenAI’s Latest Small Reasoning Model – o3-mini

​Llama 3.3 70b & Llama Guard 3 are now available through Together AI

​Together AI

​Llama Guard 8b

​Llama 3.3 70b

​New Layout with Project Structure

​What are the benefits of the new Project Structure UI?

​Where are my old resources?

​Example structure

Introducing Evaluatorq

Native support for Gemini models - URL Context Tool

OpenTelemetry Tracing Support

Knowledge API: New Parameters

Support for OpenAI’s image edit and variation APIs

File Uploads for Anthropic Models

Comprehensive Budget Controls & Cost Management

GPT-5 available in the Model Garden

Preview Model Support

Three new RAGAS Evaluators for enhanced RAG Experiments

1. Context Recall

2. Noise Sensitivity

3. Context Entities Recall

Flexbile Prompt Configuration for Experiments

Chunking API

Evaluators API

Connect LiteLLM models to Orq

Self onboard Vertex AI models

Custom Human Review

How to collect feedback

Why this matters

How to create custom Human Review

Evaluator playground

Attach files directly to the model

Evaluator runs without new generation

Reasoning field now included in the API response

Experiment Runs overview

Improved Experiment configuration panel

Metadata filter on Knowledge Base Chunks

Why this matters

Example

Claude 4 Sonnet & Opus

Tool response in Playground

OpenAI’s new o3, o4-mini, GPT-4.1 models

When to use what model

Rate Limits

Agentic RAG

Threads

Jina AI - new embedding and rerank models

Knowledge Base API

Hub

Tracing

Billing impact - Event count

Python Evaluators

Benefits of Using Python for Evaluators

Move entities

Key Details:

How It Works

GPT 4.5

Why It Matters

What’s New?

Additional Enhancements

Performance & Analysis Tools

Claude 3.7 Sonnet

Thinking - Budget tokens Parameter

Key highlights

Benchmarks and performance

Increased fallbacks

What’s New?

Importing prompts

What’s New?

Datasets v2

What’s Changed?

Why This Change?

Why use Datasets v2?

Experiments v2 - Evaluate your LLM Config

Why Experiments Matter

What’s New in Experiments V2?

DeepSeek Models Now Available: 67B Chat, R1, and V3

What model should I use?

OpenAI’s Latest Small Reasoning Model – o3-mini

Llama 3.3 70b & Llama Guard 3 are now available through Together AI

Together AI

Llama Guard 8b

Llama 3.3 70b

New Layout with Project Structure

What are the benefits of the new Project Structure UI?

Where are my old resources?

Example structure

Can I move my entities to a different project or directory?