Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

What is the Orq MCP?

The Orq Model Context Protocol (MCP) server provides AI code assistants with direct access to the Orq.ai workspace. With 38 specialized tools, manage experiments, create datasets, configure evaluators, and analyze traces without leaving the IDE.

Installation

Point the assistant at the MCP server and authenticate with an API key:


Endpoint	`https://my.orq.ai/v2/mcp`
Auth Header	`Authorization: Bearer YOUR_ORQ_API_KEY`

Code Assistants

See detailed documentation for the following code assistants:

Claude Code

Official Anthropic CLI for Claude with MCP integration

Claude Desktop

Use Orq MCP in Claude’s desktop application

Codex

AI coding assistant with MCP protocol support

Cursor

AI-first code editor with native MCP support

VS Code

AI-powered editor with GitHub Copilot and native MCP support

Warp

AI-powered terminal with native MCP support

Key Capabilities

Agent Creation

Create, update, and configure agents with instructions, tools, models, evaluators, and guardrails

Experiment Management

Run experiments, compare prompts or models side-by-side, and export results

Dataset Operations

Create datasets, add or edit datapoints, and generate synthetic test data

Analytics & Insights

Query usage, cost, latency, and error metrics across the workspace

Evaluator & Guardrail Configuration

Create and update LLM-as-a-Judge and Python evaluators, and attach guardrails to agents

Docs Exploration

Search the Orq.ai documentation without leaving your IDE

Available Tools

The Orq MCP provides 38 tools across 11 categories:

Category	Tool	Description
Agents	get_agent	Retrieve agent configuration and details
Agents	create_agent	Create a new agent with instructions, tools, models, evaluators, and guardrails
Agents	update_agent	Update an existing agent’s configuration and publish a new semantic version. Requires `versionIncrement` (`major`, `minor`, or `patch`) and `versionDescription` with every update
Agents	invoke_agent	Invoke an agent via the Responses API. Supports multi-turn via `previous_response_id`, variables, and background mode
Agents	retrieve_agent_response	Retrieve a previously created agent response by ID
Analytics	get_analytics_overview	Get workspace snapshot (requests, cost, tokens, errors, error rate, latency, top models)
Analytics	query_analytics	Flexible drill-down with filtering and grouping
Dataset	create_dataset	Create a new dataset
Dataset	list_datapoints	List datapoints in a dataset
Dataset	create_datapoints	Create datapoints (max 100)
Dataset	update_datapoint	Update a datapoint
Dataset	delete_datapoints	Delete datapoints (max 100)
Dataset	delete_dataset	Delete a dataset and all datapoints
Deployments	create_deployment	Create a deployment
Deployments	get_deployment	Retrieve a deployment by key
Evaluator	get_llm_eval	Retrieve an LLM-as-a-Judge evaluator configuration
Evaluator	get_python_eval	Retrieve a Python code evaluator configuration
Evaluator	create_llm_eval	Create LLM-as-a-Judge evaluator
Evaluator	create_python_eval	Create Python code evaluator
Evaluator	update_llm_eval	Update an existing LLM-as-a-Judge evaluator (prompt, model, output type)
Evaluator	update_python_eval	Update an existing Python code evaluator (code, output type)
Experiment	list_experiment_runs	List runs with cursor pagination
Experiment	get_experiment_run	Export run (JSON/JSONL/CSV)
Experiment	create_experiment	Create experiment from dataset with optional auto-run
Models	list_models	List available AI models by type (chat, embedding, image, tts, stt, and more)
Models	invoke_model	Invoke any model directly via the Responses API. Supports reasoning effort control and response content inclusion
Search	search_entities	Search any entity type: project, dataset, prompt, experiment, agent, evaluator, knowledge, memory store, or deployment (supports cursor pagination)
Search	search_directories	List directories within a project
Search	search_docs	Query the Orq.ai documentation for feature guidance and API reference
Skills	create_skill	Create a reusable skill
Skills	update_skill	Update an existing skill
Skills	get_skill	Retrieve a skill by key
Skills	list_skills	List all skills in the workspace
Skills	delete_skill	Delete a skill
Traces	list_traces	List traces with filtering by model, type, project, thread ID, time range, and more
Traces	get_span	Retrieve a single span (compact or full mode)
Traces	list_spans	List all spans in a trace
Workspace	delete_entity	Delete any entity by type and ID. Supported types: `agent`, `prompt`, `experiment`, `evaluator`, `knowledge`, `memory_store`, `prompt_snippet` (Skills), `sheet`, `tool`. Use `delete_dataset` to delete a dataset along with all its datapoints

Examples

Building an Agent

Create an agent from scratch

Create a customer support agent called "Support Bot" that answers questions about our SaaS product. Use GPT-4.1 and give it a concise and professional tone.

The assistant will:

Use create_agent with the name, instructions, and model (openai/gpt-4.1)
Return the agent key and configuration summary

Review and update agent instructions

Show me the current instructions for the "Support Bot" agent and update them to always respond in the user's language

The assistant will:

Use get_agent to retrieve the current configuration
Display the existing instructions
Use update_agent with the revised instructions field, versionIncrement, and versionDescription
Confirm the update and new version

Invoking a Model

Use invoke_model to call any model directly via the Responses API.Parameters

Parameter	Type	Description
`model`	string	Model ID in `provider/model` format (e.g. `openai/gpt-5`, `openai/o3`)
`reasoning`	object	Reasoning configuration. Supported on OpenAI GPT-5 and o-series models only. `effort`: `none`, `low`, `medium`, `high`, or `xhigh`. `summary`: `auto`, `concise`, or `detailed`
`include`	array	Response content to include: `reasoning.encrypted_content`, `message.output_text.logprobs`

Call an o-series model with reasoning

Use invoke_model to call openai/o3 with medium reasoning effort and return a concise reasoning summary

The assistant will:

Use invoke_model with model: "openai/o3" and reasoning: { effort: "medium", summary: "concise" }
Return the model response along with the reasoning summary

Include encrypted reasoning content

Invoke gpt-5 and include the encrypted reasoning content in the response

The assistant will:

Use invoke_model with model: "openai/gpt-5" and include: ["reasoning.encrypted_content"]
Return the response with the encrypted reasoning block attached

Investigating Traces

Find errors from the last 24 hours

Show me all traces with errors from the last 24 hours

The assistant will:

Calculate the unix timestamp for 24 hours ago
Use list_traces with filter status:=ERROR && timestamp:>TIMESTAMP and sort by timestamp:desc
Display trace IDs, names, durations, and timestamps
Summarize the most common error types and their frequency

Detect regressions after a model switch

After switching models yesterday, has latency increased or stabilized?

The assistant will:

Use query_analytics with metric: "latency" and group_by: ["model"] for the period before the switch
Repeat for the period after the switch
Compare average latency per model across both windows and surface any regressions

Find the slowest traces

Find the 5 slowest traces from today and show me their span details

The assistant will:

Use list_traces sorted by duration_ms:desc, filtered to today, limit 5
Use list_spans with each trace_id to retrieve the full span tree
Surface bottlenecks and latency outliers

Filter traces by thread ID

Show me all traces for thread ID thread_abc123

The assistant will:

Use list_traces with thread_id: "thread_abc123"
Return all traces associated with that conversation thread
Surface turn count, total cost, and any errors across the session

Running Experiments

Compare two models on an existing dataset

Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 using the "user-queries" dataset

The assistant will:

Search for the “user-queries” dataset using search_entities
Use create_experiment with two model configurations and auto_run enabled
Return the experiment ID once both configurations have run

Compare two prompt strategies

Create an experiment using the "customer-feedback" dataset with two prompts: one focused on empathy and one on brevity. Run it and summarize the results.

The assistant will:

Search for the dataset using search_entities
Use create_experiment with two prompt variants and auto_run enabled
Use get_experiment_run to retrieve evaluation metrics
Compare the variants and summarize which performed better

Export experiment results

Export the latest experiment run as CSV

The assistant will:

Use list_experiment_runs to find the most recent run
Use get_experiment_run with CSV export format
Return a signed download URL for the CSV file

Managing Datasets

Create a synthetic dataset

Generate 50 realistic customer support questions about a SaaS product and create a dataset called "Support Training Data"

The assistant will:

Generate 50 synthetic question/answer pairs
Use create_dataset to create the dataset
Use create_datapoints to add all entries in bulk, each formatted as { inputs: { question: "..." }, expected_output: "..." }

Import data from code

Create a dataset from the JSON array above and add it to my workspace

The assistant will:

Parse the JSON from the selection or context
Use create_dataset with an appropriate name
Use create_datapoints to add each entry as a datapoint

Update or clean up a dataset

Delete all datapoints in the "staging-tests" dataset that have an empty expected_output field

The assistant will:

Use search_entities to find the “staging-tests” dataset and retrieve its ID
Use list_datapoints to retrieve all entries
Filter for datapoints with empty expected_output
Use delete_datapoints to remove them in batches

Evaluators

Retrieve an evaluator’s configuration

Show me the current configuration for the "tone-scorer" evaluator

The assistant will:

Search for the evaluator using search_entities to resolve its ID
Use get_llm_eval or get_python_eval to retrieve the full configuration
Display the prompt, model, output type, and other settings

Create an LLM-as-a-Judge evaluator

Create an LLM-as-a-Judge evaluator that scores responses on tone: professional, neutral, or aggressive

The assistant will:

Use create_llm_eval with a scoring rubric for tone classification
Confirm the evaluator ID and configuration

Create a Python evaluator

Create a Python evaluator that checks whether the response contains a valid JSON object

The assistant will:

Write a Python snippet that parses the response and validates JSON structure
Use create_python_eval to register it in the workspace

Create an experiment with evaluators

Create an experiment from the "qa-dataset" dataset with the "tone-scorer" evaluator attached

The assistant will:

Search for the dataset using search_entities
Use search_entities to find the evaluator and get its key, or use the key returned by create_llm_eval / create_python_eval if created in the same session
Use create_experiment with both the dataset ID and evaluator ID, with auto_run enabled

Update an existing evaluator

Update the "tone-scorer" evaluator to also check for formal language and return a boolean instead of a number

The assistant will:

Search for the evaluator using search_entities
Use update_llm_eval with the evaluator ID, updated prompt, and output_type: "boolean"
Confirm the new configuration

Managing Entities

Delete a workspace entity

Delete the experiment named "GPT-5 Test Run" from my workspace

The assistant will:

Search for the experiment using search_entities
Use delete_entity with type: "experiment" and the resolved ID
Confirm deletion

Supported type values: agent, prompt, experiment, evaluator, knowledge, memory_store, prompt_snippet (Skills), sheet, tool. Use delete_dataset to delete a dataset along with all its datapoints.

Documentation Search

Look up a feature in the Orq.ai docs

How does prompt caching work in the AI Gateway?

The assistant will:

Use search_docs with a relevant query
Return matching documentation sections with guidance and examples
Summarize the answer in context

Get started with a specific product area

Show me how to set up the AI Gateway

The assistant will:

Use search_docs to find Router onboarding content
Return setup steps, configuration options, and quick-start examples

Analytics

Get a workspace snapshot

Give me an overview of my workspace metrics for the last 7 days

The assistant will:

Use get_analytics_overview with a 7-day range
Return total requests, cost, tokens, error rate, latency, and top models

Drill into a specific model’s performance

How has gpt-5.2 performed this week? Focus on error rate and cost.

The assistant will:

Use query_analytics with metric: "errors", filtered by model and a 7-day range
Use query_analytics with metric: "cost", filtered by model and a 7-day range
Surface error rate trends and cost breakdown side by side

Identify the most expensive models

Which models are costing the most this month?

The assistant will:

Use query_analytics with metric: "cost", group_by: ["model"], and a 30-day range
Aggregate cost per model across all time buckets and rank them by total spend

Skills

Orq Skills layer pre-built multi-step workflows on top of these MCP tools: build agents, run experiments, analyze trace failures, and more with a single command.

Orq Skills

Pre-built workflows and slash commands for the full Build, Evaluate, Optimize lifecycle

​What is the Orq MCP?

​Installation

​Code Assistants

Claude Code

Claude Desktop

Codex

Cursor

VS Code

Warp

​Key Capabilities

Agent Creation

Experiment Management

Dataset Operations

Analytics & Insights

Evaluator & Guardrail Configuration

Docs Exploration

​Available Tools

​Examples

​Skills

Orq Skills

What is the Orq MCP?

Installation

Code Assistants

Key Capabilities

Available Tools

Examples

Skills