Skip to main content

What is the Orq MCP?

The Orq Model Context Protocol (MCP) server provides AI code assistants with direct access to your Orq.ai workspace. With 24 specialized tools, you can manage experiments, create datasets, configure evaluators, and analyze traces without leaving your IDE.

Quickstart

Point your assistant at the MCP server and authenticate with your API key:
Endpointhttps://my.orq.ai/v2/mcp
Auth HeaderAuthorization: Bearer YOUR_ORQ_API_KEY
For assistant-specific setup, select your assistant in the Code Assistants section below.

Key Capabilities

Agent Creation

Create and configure agents with custom instructions, tools, models, evaluators, and guardrails directly from conversation

Experiment Management

Create and run experiments, configure task columns with prompts or agents, and export results in multiple formats

Dataset Operations

Create synthetic datasets, reshape local data, manage datapoints, and map data to experiments

Analytics & Insights

Query workspace analytics, track performance metrics, and ask natural language questions about your traces

Evaluator & Guardrail Configuration

Create LLM-as-a-Judge evaluators, Python code evaluators, and attach guardrails for automated quality assessment and runtime safety

Available Tools

The Orq MCP provides 24 tools across 9 categories:
CategoryToolDescription
Agentsget_agentRetrieve agent configuration and details
Agentscreate_agentCreate a new agent with instructions, tools, models, evaluators, and guardrails
Agentsupdate_agentUpdate an existing agent’s configuration (instructions, model, tools, evaluators, guardrails)
Analyticsget_analytics_overviewGet workspace snapshot (requests, cost, tokens, errors, latency, top models)
Analyticsquery_analyticsFlexible drill-down with filtering and grouping
Datasetcreate_datasetCreate a new dataset
Datasetlist_datapointsList datapoints in a dataset
Datasetcreate_datapointsCreate datapoints (max 100)
Datasetupdate_datapointUpdate a datapoint
Datasetdelete_datapointsDelete datapoints (max 100)
Datasetdelete_datasetDelete a dataset and all datapoints
Evaluatorcreate_llm_evalCreate LLM-as-a-Judge evaluator
Evaluatorcreate_python_evalCreate Python code evaluator
Experimentlist_experiment_runsList runs with cursor pagination
Experimentget_experiment_runExport run (JSON/JSONL/CSV)
Experimentcreate_experimentCreate experiment from dataset with optional auto-run
Modelslist_modelsList all available AI models
Registrylist_registry_keysList available attribute keys for filtering traces
Registrylist_registry_valuesList top values for a specific attribute
Searchsearch_entitiesSearch projects, datasets, prompts, or experiments
Searchsearch_directoriesList directories within a project
Traceslist_tracesList traces with filtering and sorting
Tracesget_spanRetrieve a single span (compact or full mode)
Traceslist_spansList all spans in a trace

Examples

Find errors from the last 24 hours
Show me all traces with errors from the last 24 hours
The assistant will:
  1. Calculate the unix timestamp for 24 hours ago
  2. Use list_traces with filter status:=ERROR && timestamp:>TIMESTAMP and sort by timestamp:desc
  3. Display trace IDs, names, durations, and timestamps
  4. Summarize the most common error types and their frequency

Detect regressions after a model switch
After switching models yesterday, has latency increased or stabilized?
The assistant will:
  1. Use query_analytics with metric: "latency" and group_by: ["model"] for the period before the switch
  2. Repeat for the period after the switch
  3. Compare average latency per model across both windows and surface any regressions

Find the slowest traces
Find the 5 slowest traces from today and show me their span details
The assistant will:
  1. Use list_traces sorted by duration_ms:desc, filtered to today, limit 5
  2. Use list_spans with each trace_id to retrieve the full span tree
  3. Surface bottlenecks and latency outliers
Compare two models on an existing dataset
Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 using the "user-queries" dataset
The assistant will:
  1. Search for the “user-queries” dataset using search_entities
  2. Use create_experiment with two model configurations and auto_run enabled
  3. Return the experiment ID once both configurations have run

Compare two prompt strategies
Create an experiment using the "customer-feedback" dataset with two prompts: one focused on empathy and one on brevity. Run it and summarize the results.
The assistant will:
  1. Search for the dataset using search_entities
  2. Use create_experiment with two prompt variants and auto_run enabled
  3. Use get_experiment_run to retrieve evaluation metrics
  4. Compare the variants and summarize which performed better

Export experiment results
Export the latest experiment run as CSV
The assistant will:
  1. Use list_experiment_runs to find the most recent run
  2. Use get_experiment_run with CSV export format
  3. Return a signed download URL for the CSV file
Create a synthetic dataset
Generate 50 realistic customer support questions about a SaaS product and create a dataset called "Support Training Data"
The assistant will:
  1. Generate 50 synthetic question/answer pairs
  2. Use create_dataset to create the dataset
  3. Use create_datapoints to add all entries in bulk, each formatted as { inputs: { question: "..." }, expected_output: "..." }

Import data from code
Create a dataset from the JSON array above and add it to my workspace
The assistant will:
  1. Parse the JSON from your selection or context
  2. Use create_dataset with an appropriate name
  3. Use create_datapoints to add each entry as a datapoint

Update or clean up a dataset
Delete all datapoints in the "staging-tests" dataset that have an empty expected_output field
The assistant will:
  1. Use search_entities to find the “staging-tests” dataset and retrieve its ID
  2. Use list_datapoints to retrieve all entries
  3. Filter for datapoints with empty expected_output
  4. Use delete_datapoints to remove them in batches
Create an LLM-as-a-Judge evaluator
Create an LLM-as-a-Judge evaluator that scores responses on tone: professional, neutral, or aggressive
The assistant will:
  1. Use create_llm_eval with a scoring rubric for tone classification
  2. Confirm the evaluator ID and configuration

Create a Python evaluator
Create a Python evaluator that checks whether the response contains a valid JSON object
The assistant will:
  1. Write a Python snippet that parses the response and validates JSON structure
  2. Use create_python_eval to register it in your workspace

Create an experiment with evaluators
Create an experiment from the "qa-dataset" dataset with the "tone-scorer" evaluator attached
The assistant will:
  1. Search for the dataset using search_entities
  2. Use the evaluator ID — copy it from the Orq.ai UI, or use the ID returned by create_llm_eval / create_python_eval if created in the same session
  3. Use create_experiment with both the dataset ID and evaluator ID, with auto_run enabled
Get a workspace snapshot
Give me an overview of my workspace metrics for the last 7 days
The assistant will:
  1. Use get_analytics_overview with a 7-day range
  2. Return total requests, cost, tokens, error rate, latency, and top models

Drill into a specific model’s performance
How has gpt-5.2 performed this week? Focus on error rate and cost.
The assistant will:
  1. Use query_analytics with metric: "errors", filtered by model and a 7-day range
  2. Use query_analytics with metric: "cost", filtered by model and a 7-day range
  3. Surface error rate trends and cost breakdown side by side

Identify your most expensive models
Which models are costing the most this month?
The assistant will:
  1. Use query_analytics with metric: "cost", group_by: ["model"], and a 30-day range
  2. Aggregate cost per model across all time buckets and rank them by total spend

Code Assistants