What is the Orq MCP?
The Orq Model Context Protocol (MCP) server provides AI code assistants with direct access to the Orq.ai workspace. With 38 specialized tools, manage experiments, create datasets, configure evaluators, and analyze traces without leaving the IDE.Installation
Point the assistant at the MCP server and authenticate with an API key:| Endpoint | https://my.orq.ai/v2/mcp |
| Auth Header | Authorization: Bearer YOUR_ORQ_API_KEY |
Code Assistants
See detailed documentation for the following code assistants:Claude Code
Official Anthropic CLI for Claude with MCP integration
Claude Desktop
Use Orq MCP in Claude’s desktop application
Codex
AI coding assistant with MCP protocol support
Cursor
AI-first code editor with native MCP support
VS Code
AI-powered editor with GitHub Copilot and native MCP support
Warp
AI-powered terminal with native MCP support
Key Capabilities
Agent Creation
Create, update, and configure agents with instructions, tools, models, evaluators, and guardrails
Experiment Management
Run experiments, compare prompts or models side-by-side, and export results
Dataset Operations
Create datasets, add or edit datapoints, and generate synthetic test data
Analytics & Insights
Query usage, cost, latency, and error metrics across the workspace
Evaluator & Guardrail Configuration
Create and update LLM-as-a-Judge and Python evaluators, and attach guardrails to agents
Docs Exploration
Search the Orq.ai documentation without leaving your IDE
Available Tools
The Orq MCP provides 38 tools across 11 categories:| Category | Tool | Description |
|---|---|---|
| Agents | get_agent | Retrieve agent configuration and details |
| Agents | create_agent | Create a new agent with instructions, tools, models, evaluators, and guardrails |
| Agents | update_agent | Update an existing agent’s configuration and publish a new semantic version. Requires versionIncrement (major, minor, or patch) and versionDescription with every update |
| Agents | invoke_agent | Invoke an agent via the Responses API. Supports multi-turn via previous_response_id, variables, and background mode |
| Agents | retrieve_agent_response | Retrieve a previously created agent response by ID |
| Analytics | get_analytics_overview | Get workspace snapshot (requests, cost, tokens, errors, error rate, latency, top models) |
| Analytics | query_analytics | Flexible drill-down with filtering and grouping |
| Dataset | create_dataset | Create a new dataset |
| Dataset | list_datapoints | List datapoints in a dataset |
| Dataset | create_datapoints | Create datapoints (max 100) |
| Dataset | update_datapoint | Update a datapoint |
| Dataset | delete_datapoints | Delete datapoints (max 100) |
| Dataset | delete_dataset | Delete a dataset and all datapoints |
| Deployments | create_deployment | Create a deployment |
| Deployments | get_deployment | Retrieve a deployment by key |
| Evaluator | get_llm_eval | Retrieve an LLM-as-a-Judge evaluator configuration |
| Evaluator | get_python_eval | Retrieve a Python code evaluator configuration |
| Evaluator | create_llm_eval | Create LLM-as-a-Judge evaluator |
| Evaluator | create_python_eval | Create Python code evaluator |
| Evaluator | update_llm_eval | Update an existing LLM-as-a-Judge evaluator (prompt, model, output type) |
| Evaluator | update_python_eval | Update an existing Python code evaluator (code, output type) |
| Experiment | list_experiment_runs | List runs with cursor pagination |
| Experiment | get_experiment_run | Export run (JSON/JSONL/CSV) |
| Experiment | create_experiment | Create experiment from dataset with optional auto-run |
| Models | list_models | List available AI models by type (chat, embedding, image, tts, stt, and more) |
| Models | invoke_model | Invoke any model directly via the Responses API. Supports reasoning effort control and response content inclusion |
| Search | search_entities | Search any entity type: project, dataset, prompt, experiment, agent, evaluator, knowledge, memory store, or deployment (supports cursor pagination) |
| Search | search_directories | List directories within a project |
| Search | search_docs | Query the Orq.ai documentation for feature guidance and API reference |
| Skills | create_skill | Create a reusable skill |
| Skills | update_skill | Update an existing skill |
| Skills | get_skill | Retrieve a skill by key |
| Skills | list_skills | List all skills in the workspace |
| Skills | delete_skill | Delete a skill |
| Traces | list_traces | List traces with filtering by model, type, project, thread ID, time range, and more |
| Traces | get_span | Retrieve a single span (compact or full mode) |
| Traces | list_spans | List all spans in a trace |
| Workspace | delete_entity | Delete any entity by type and ID. Supported types: agent, prompt, experiment, evaluator, knowledge, memory_store, prompt_snippet (Skills), sheet, tool. Use delete_dataset to delete a dataset along with all its datapoints |
Examples
Building an Agent
Building an Agent
Create an agent from scratchThe assistant will:
Review and update agent instructionsThe assistant will:
- Use
create_agentwith the name, instructions, and model (openai/gpt-4.1) - Return the agent key and configuration summary
Review and update agent instructions
- Use
get_agentto retrieve the current configuration - Display the existing instructions
- Use
update_agentwith the revisedinstructionsfield,versionIncrement, andversionDescription - Confirm the update and new version
Invoking a Model
Invoking a Model
Use
Call an o-series model with reasoningThe assistant will:
Include encrypted reasoning contentThe assistant will:
invoke_model to call any model directly via the Responses API.Parameters| Parameter | Type | Description |
|---|---|---|
model | string | Model ID in provider/model format (e.g. openai/gpt-5, openai/o3) |
reasoning | object | Reasoning configuration. Supported on OpenAI GPT-5 and o-series models only. effort: none, low, medium, high, or xhigh. summary: auto, concise, or detailed |
include | array | Response content to include: reasoning.encrypted_content, message.output_text.logprobs |
Call an o-series model with reasoning
- Use
invoke_modelwithmodel: "openai/o3"andreasoning: { effort: "medium", summary: "concise" } - Return the model response along with the reasoning summary
Include encrypted reasoning content
- Use
invoke_modelwithmodel: "openai/gpt-5"andinclude: ["reasoning.encrypted_content"] - Return the response with the encrypted reasoning block attached
Investigating Traces
Investigating Traces
Find errors from the last 24 hoursThe assistant will:
Detect regressions after a model switchThe assistant will:
Find the slowest tracesThe assistant will:
Filter traces by thread IDThe assistant will:
- Calculate the unix timestamp for 24 hours ago
- Use
list_traceswith filterstatus:=ERROR && timestamp:>TIMESTAMPand sort bytimestamp:desc - Display trace IDs, names, durations, and timestamps
- Summarize the most common error types and their frequency
Detect regressions after a model switch
- Use
query_analyticswithmetric: "latency"andgroup_by: ["model"]for the period before the switch - Repeat for the period after the switch
- Compare average latency per model across both windows and surface any regressions
Find the slowest traces
- Use
list_tracessorted byduration_ms:desc, filtered to today, limit 5 - Use
list_spanswith eachtrace_idto retrieve the full span tree - Surface bottlenecks and latency outliers
Filter traces by thread ID
- Use
list_traceswiththread_id: "thread_abc123" - Return all traces associated with that conversation thread
- Surface turn count, total cost, and any errors across the session
Running Experiments
Running Experiments
Compare two models on an existing datasetThe assistant will:
Compare two prompt strategiesThe assistant will:
Export experiment resultsThe assistant will:
- Search for the “user-queries” dataset using
search_entities - Use
create_experimentwith two model configurations andauto_runenabled - Return the experiment ID once both configurations have run
Compare two prompt strategies
- Search for the dataset using
search_entities - Use
create_experimentwith two prompt variants andauto_runenabled - Use
get_experiment_runto retrieve evaluation metrics - Compare the variants and summarize which performed better
Export experiment results
- Use
list_experiment_runsto find the most recent run - Use
get_experiment_runwith CSV export format - Return a signed download URL for the CSV file
Managing Datasets
Managing Datasets
Create a synthetic datasetThe assistant will:
Import data from codeThe assistant will:
Update or clean up a datasetThe assistant will:
- Generate 50 synthetic question/answer pairs
- Use
create_datasetto create the dataset - Use
create_datapointsto add all entries in bulk, each formatted as{ inputs: { question: "..." }, expected_output: "..." }
Import data from code
- Parse the JSON from the selection or context
- Use
create_datasetwith an appropriate name - Use
create_datapointsto add each entry as a datapoint
Update or clean up a dataset
- Use
search_entitiesto find the “staging-tests” dataset and retrieve its ID - Use
list_datapointsto retrieve all entries - Filter for datapoints with empty
expected_output - Use
delete_datapointsto remove them in batches
Evaluators
Evaluators
Retrieve an evaluator’s configurationThe assistant will:
Create an LLM-as-a-Judge evaluatorThe assistant will:
Create a Python evaluatorThe assistant will:
Create an experiment with evaluatorsThe assistant will:
Update an existing evaluatorThe assistant will:
- Search for the evaluator using
search_entitiesto resolve its ID - Use
get_llm_evalorget_python_evalto retrieve the full configuration - Display the prompt, model, output type, and other settings
Create an LLM-as-a-Judge evaluator
- Use
create_llm_evalwith a scoring rubric for tone classification - Confirm the evaluator ID and configuration
Create a Python evaluator
- Write a Python snippet that parses the response and validates JSON structure
- Use
create_python_evalto register it in the workspace
Create an experiment with evaluators
- Search for the dataset using
search_entities - Use
search_entitiesto find the evaluator and get its key, or use the key returned bycreate_llm_eval/create_python_evalif created in the same session - Use
create_experimentwith both the dataset ID and evaluator ID, withauto_runenabled
Update an existing evaluator
- Search for the evaluator using
search_entities - Use
update_llm_evalwith the evaluator ID, updatedprompt, andoutput_type: "boolean" - Confirm the new configuration
Managing Entities
Managing Entities
Delete a workspace entityThe assistant will:
- Search for the experiment using
search_entities - Use
delete_entitywithtype: "experiment"and the resolved ID - Confirm deletion
Supported
type values: agent, prompt, experiment, evaluator, knowledge, memory_store, prompt_snippet (Skills), sheet, tool. Use delete_dataset to delete a dataset along with all its datapoints.Documentation Search
Documentation Search
Look up a feature in the Orq.ai docsThe assistant will:
Get started with a specific product areaThe assistant will:
- Use
search_docswith a relevant query - Return matching documentation sections with guidance and examples
- Summarize the answer in context
Get started with a specific product area
- Use
search_docsto find Router onboarding content - Return setup steps, configuration options, and quick-start examples
Analytics
Analytics
Get a workspace snapshotThe assistant will:
Drill into a specific model’s performanceThe assistant will:
Identify the most expensive modelsThe assistant will:
- Use
get_analytics_overviewwith a 7-day range - Return total requests, cost, tokens, error rate, latency, and top models
Drill into a specific model’s performance
- Use
query_analyticswithmetric: "errors", filtered by model and a 7-day range - Use
query_analyticswithmetric: "cost", filtered by model and a 7-day range - Surface error rate trends and cost breakdown side by side
Identify the most expensive models
- Use
query_analyticswithmetric: "cost",group_by: ["model"], and a 30-day range - Aggregate cost per model across all time buckets and rank them by total spend
Skills
Orq Skills layer pre-built multi-step workflows on top of these MCP tools: build agents, run experiments, analyze trace failures, and more with a single command.Orq Skills
Pre-built workflows and slash commands for the full Build, Evaluate, Optimize lifecycle