Skip to main content

What is the Orq MCP?

The Orq Model Context Protocol (MCP) server provides AI code assistants with direct access to your Orq.ai workspace. With 30 specialized tools, you can manage experiments, create datasets, configure evaluators, and analyze traces without leaving your IDE.

Key Capabilities

Agent Creation

Create, update, and configure agents with instructions, tools, models, evaluators, and guardrails

Experiment Management

Run experiments, compare prompts or models side-by-side, and export results

Dataset Operations

Create datasets, add or edit datapoints, and generate synthetic test data

Analytics & Insights

Query usage, cost, latency, and error metrics across your workspace

Evaluator & Guardrail Configuration

Create and update LLM-as-a-Judge and Python evaluators, and attach guardrails to agents

Docs Exploration

Search the Orq.ai documentation without leaving your IDE

Quickstart

Point your assistant at the MCP server and authenticate with your API key:
Endpointhttps://my.orq.ai/v2/mcp
Auth HeaderAuthorization: Bearer YOUR_ORQ_API_KEY

Code Assistants

See detailed documentation for the following code assistants:
https://mintcdn.com/orqai/V4VTU6tYSeIvM5E3/images/logos/claude-code.svg?fit=max&auto=format&n=V4VTU6tYSeIvM5E3&q=85&s=72c047d00d51690ead46ee44de218659

Claude Code

Official Anthropic CLI for Claude with MCP integration
https://mintcdn.com/orqai/V4VTU6tYSeIvM5E3/images/logos/claude-desktop.svg?fit=max&auto=format&n=V4VTU6tYSeIvM5E3&q=85&s=fa4983ca5103a549a885ea68d1e7123e

Claude Desktop

Use Orq MCP in Claude’s desktop application
https://mintcdn.com/orqai/V4VTU6tYSeIvM5E3/images/logos/codex.svg?fit=max&auto=format&n=V4VTU6tYSeIvM5E3&q=85&s=4e4f807b92bbcfda44c7e4b8486fb040

Codex

AI coding assistant with MCP protocol support
https://mintcdn.com/orqai/V4VTU6tYSeIvM5E3/images/logos/cursor.svg?fit=max&auto=format&n=V4VTU6tYSeIvM5E3&q=85&s=d95c80baadd949a909f0278b573a1589

Cursor

AI-first code editor with native MCP support
https://mintcdn.com/orqai/V4VTU6tYSeIvM5E3/images/logos/warp.svg?fit=max&auto=format&n=V4VTU6tYSeIvM5E3&q=85&s=6fd49e6877b54996c11663f59ce24790

Warp

AI-powered terminal with native MCP support

Available Tools

The Orq MCP provides 30 tools across 10 categories:
CategoryToolDescription
Agentsget_agentRetrieve agent configuration and details
Agentscreate_agentCreate a new agent with instructions, tools, models, evaluators, and guardrails
Agentsupdate_agentUpdate an existing agent’s configuration (instructions, model, tools, evaluators, guardrails)
Analyticsget_analytics_overviewGet workspace snapshot (requests, cost, tokens, errors, error rate, latency, top models)
Analyticsquery_analyticsFlexible drill-down with filtering and grouping
Datasetcreate_datasetCreate a new dataset
Datasetlist_datapointsList datapoints in a dataset
Datasetcreate_datapointsCreate datapoints (max 100)
Datasetupdate_datapointUpdate a datapoint
Datasetdelete_datapointsDelete datapoints (max 100)
Datasetdelete_datasetDelete a dataset and all datapoints
Evaluatorget_llm_evalRetrieve an LLM-as-a-Judge evaluator configuration
Evaluatorget_python_evalRetrieve a Python code evaluator configuration
Evaluatorcreate_llm_evalCreate LLM-as-a-Judge evaluator
Evaluatorcreate_python_evalCreate Python code evaluator
Evaluatorupdate_llm_evalUpdate an existing LLM-as-a-Judge evaluator (prompt, model, output type)
Evaluatorupdate_python_evalUpdate an existing Python code evaluator (code, output type)
Experimentlist_experiment_runsList runs with cursor pagination
Experimentget_experiment_runExport run (JSON/JSONL/CSV)
Experimentcreate_experimentCreate experiment from dataset with optional auto-run
Modelslist_modelsList available AI models by type (chat, embedding, image, tts, stt, and more)
Registrylist_registry_keysList available attribute keys for filtering traces
Registrylist_registry_valuesList top values for a specific attribute
Searchsearch_entitiesSearch any entity type: project, dataset, prompt, experiment, agent, evaluator, knowledge, memory store, or deployment (supports cursor pagination)
Searchsearch_directoriesList directories within a project
Searchsearch_docsQuery the Orq.ai documentation for feature guidance and API reference
Traceslist_tracesList traces with filtering by model, type, project, thread ID, time range, and more
Tracesget_spanRetrieve a single span (compact or full mode)
Traceslist_spansList all spans in a trace
Workspacedelete_entityDelete any entity by type and ID. Supported types: agent, prompt, experiment, evaluator, knowledge, memory_store, prompt_snippet, sheet, tool. Use delete_dataset to delete a dataset along with all its datapoints

Examples

Find errors from the last 24 hours
Show me all traces with errors from the last 24 hours
The assistant will:
  1. Calculate the unix timestamp for 24 hours ago
  2. Use list_traces with filter status:=ERROR && timestamp:>TIMESTAMP and sort by timestamp:desc
  3. Display trace IDs, names, durations, and timestamps
  4. Summarize the most common error types and their frequency

Detect regressions after a model switch
After switching models yesterday, has latency increased or stabilized?
The assistant will:
  1. Use query_analytics with metric: "latency" and group_by: ["model"] for the period before the switch
  2. Repeat for the period after the switch
  3. Compare average latency per model across both windows and surface any regressions

Find the slowest traces
Find the 5 slowest traces from today and show me their span details
The assistant will:
  1. Use list_traces sorted by duration_ms:desc, filtered to today, limit 5
  2. Use list_spans with each trace_id to retrieve the full span tree
  3. Surface bottlenecks and latency outliers

Filter traces by thread ID
Show me all traces for thread ID thread_abc123
The assistant will:
  1. Use list_traces with thread_id: "thread_abc123"
  2. Return all traces associated with that conversation thread
  3. Surface turn count, total cost, and any errors across the session
Compare two models on an existing dataset
Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 using the "user-queries" dataset
The assistant will:
  1. Search for the “user-queries” dataset using search_entities
  2. Use create_experiment with two model configurations and auto_run enabled
  3. Return the experiment ID once both configurations have run

Compare two prompt strategies
Create an experiment using the "customer-feedback" dataset with two prompts: one focused on empathy and one on brevity. Run it and summarize the results.
The assistant will:
  1. Search for the dataset using search_entities
  2. Use create_experiment with two prompt variants and auto_run enabled
  3. Use get_experiment_run to retrieve evaluation metrics
  4. Compare the variants and summarize which performed better

Export experiment results
Export the latest experiment run as CSV
The assistant will:
  1. Use list_experiment_runs to find the most recent run
  2. Use get_experiment_run with CSV export format
  3. Return a signed download URL for the CSV file
Create a synthetic dataset
Generate 50 realistic customer support questions about a SaaS product and create a dataset called "Support Training Data"
The assistant will:
  1. Generate 50 synthetic question/answer pairs
  2. Use create_dataset to create the dataset
  3. Use create_datapoints to add all entries in bulk, each formatted as { inputs: { question: "..." }, expected_output: "..." }

Import data from code
Create a dataset from the JSON array above and add it to my workspace
The assistant will:
  1. Parse the JSON from your selection or context
  2. Use create_dataset with an appropriate name
  3. Use create_datapoints to add each entry as a datapoint

Update or clean up a dataset
Delete all datapoints in the "staging-tests" dataset that have an empty expected_output field
The assistant will:
  1. Use search_entities to find the “staging-tests” dataset and retrieve its ID
  2. Use list_datapoints to retrieve all entries
  3. Filter for datapoints with empty expected_output
  4. Use delete_datapoints to remove them in batches
Retrieve an evaluator’s configuration
Show me the current configuration for the "tone-scorer" evaluator
The assistant will:
  1. Search for the evaluator using search_entities to resolve its ID
  2. Use get_llm_eval or get_python_eval to retrieve the full configuration
  3. Display the prompt, model, output type, and other settings

Create an LLM-as-a-Judge evaluator
Create an LLM-as-a-Judge evaluator that scores responses on tone: professional, neutral, or aggressive
The assistant will:
  1. Use create_llm_eval with a scoring rubric for tone classification
  2. Confirm the evaluator ID and configuration

Create a Python evaluator
Create a Python evaluator that checks whether the response contains a valid JSON object
The assistant will:
  1. Write a Python snippet that parses the response and validates JSON structure
  2. Use create_python_eval to register it in your workspace

Create an experiment with evaluators
Create an experiment from the "qa-dataset" dataset with the "tone-scorer" evaluator attached
The assistant will:
  1. Search for the dataset using search_entities
  2. Use search_entities to find the evaluator and get its key, or use the key returned by create_llm_eval / create_python_eval if created in the same session
  3. Use create_experiment with both the dataset ID and evaluator ID, with auto_run enabled

Update an existing evaluator
Update the "tone-scorer" evaluator to also check for formal language and return a boolean instead of a number
The assistant will:
  1. Search for the evaluator using search_entities
  2. Use update_llm_eval with the evaluator ID, updated prompt, and output_type: "boolean"
  3. Confirm the new configuration
Delete a workspace entity
Delete the experiment named "GPT-5 Test Run" from my workspace
The assistant will:
  1. Search for the experiment using search_entities
  2. Use delete_entity with type: "experiment" and the resolved ID
  3. Confirm deletion
Supported type values: agent, prompt, experiment, evaluator, knowledge, memory_store, prompt_snippet, sheet, tool. Use delete_dataset to delete a dataset along with all its datapoints.
Get a workspace snapshot
Give me an overview of my workspace metrics for the last 7 days
The assistant will:
  1. Use get_analytics_overview with a 7-day range
  2. Return total requests, cost, tokens, error rate, latency, and top models

Drill into a specific model’s performance
How has gpt-5.2 performed this week? Focus on error rate and cost.
The assistant will:
  1. Use query_analytics with metric: "errors", filtered by model and a 7-day range
  2. Use query_analytics with metric: "cost", filtered by model and a 7-day range
  3. Surface error rate trends and cost breakdown side by side

Identify your most expensive models
Which models are costing the most this month?
The assistant will:
  1. Use query_analytics with metric: "cost", group_by: ["model"], and a 30-day range
  2. Aggregate cost per model across all time buckets and rank them by total spend