What is the Orq MCP?
The Orq Model Context Protocol (MCP) server provides AI code assistants with direct access to your Orq.ai workspace. With 24 specialized tools, you can manage experiments, create datasets, configure evaluators, and analyze traces without leaving your IDE.Quickstart
Point your assistant at the MCP server and authenticate with your API key:| Endpoint | https://my.orq.ai/v2/mcp |
| Auth Header | Authorization: Bearer YOUR_ORQ_API_KEY |
Key Capabilities
Agent Creation
Create and configure agents with custom instructions, tools, models, evaluators, and guardrails directly from conversation
Experiment Management
Create and run experiments, configure task columns with prompts or agents, and export results in multiple formats
Dataset Operations
Create synthetic datasets, reshape local data, manage datapoints, and map data to experiments
Analytics & Insights
Query workspace analytics, track performance metrics, and ask natural language questions about your traces
Evaluator & Guardrail Configuration
Create LLM-as-a-Judge evaluators, Python code evaluators, and attach guardrails for automated quality assessment and runtime safety
Available Tools
The Orq MCP provides 24 tools across 9 categories:| Category | Tool | Description |
|---|---|---|
| Agents | get_agent | Retrieve agent configuration and details |
| Agents | create_agent | Create a new agent with instructions, tools, models, evaluators, and guardrails |
| Agents | update_agent | Update an existing agent’s configuration (instructions, model, tools, evaluators, guardrails) |
| Analytics | get_analytics_overview | Get workspace snapshot (requests, cost, tokens, errors, latency, top models) |
| Analytics | query_analytics | Flexible drill-down with filtering and grouping |
| Dataset | create_dataset | Create a new dataset |
| Dataset | list_datapoints | List datapoints in a dataset |
| Dataset | create_datapoints | Create datapoints (max 100) |
| Dataset | update_datapoint | Update a datapoint |
| Dataset | delete_datapoints | Delete datapoints (max 100) |
| Dataset | delete_dataset | Delete a dataset and all datapoints |
| Evaluator | create_llm_eval | Create LLM-as-a-Judge evaluator |
| Evaluator | create_python_eval | Create Python code evaluator |
| Experiment | list_experiment_runs | List runs with cursor pagination |
| Experiment | get_experiment_run | Export run (JSON/JSONL/CSV) |
| Experiment | create_experiment | Create experiment from dataset with optional auto-run |
| Models | list_models | List all available AI models |
| Registry | list_registry_keys | List available attribute keys for filtering traces |
| Registry | list_registry_values | List top values for a specific attribute |
| Search | search_entities | Search projects, datasets, prompts, or experiments |
| Search | search_directories | List directories within a project |
| Traces | list_traces | List traces with filtering and sorting |
| Traces | get_span | Retrieve a single span (compact or full mode) |
| Traces | list_spans | List all spans in a trace |
Examples
Investigating Traces
Investigating Traces
Find errors from the last 24 hoursThe assistant will:
Detect regressions after a model switchThe assistant will:
Find the slowest tracesThe assistant will:
- Calculate the unix timestamp for 24 hours ago
- Use
list_traceswith filterstatus:=ERROR && timestamp:>TIMESTAMPand sort bytimestamp:desc - Display trace IDs, names, durations, and timestamps
- Summarize the most common error types and their frequency
Detect regressions after a model switch
- Use
query_analyticswithmetric: "latency"andgroup_by: ["model"]for the period before the switch - Repeat for the period after the switch
- Compare average latency per model across both windows and surface any regressions
Find the slowest traces
- Use
list_tracessorted byduration_ms:desc, filtered to today, limit 5 - Use
list_spanswith eachtrace_idto retrieve the full span tree - Surface bottlenecks and latency outliers
Running Experiments
Running Experiments
Compare two models on an existing datasetThe assistant will:
Compare two prompt strategiesThe assistant will:
Export experiment resultsThe assistant will:
- Search for the “user-queries” dataset using
search_entities - Use
create_experimentwith two model configurations andauto_runenabled - Return the experiment ID once both configurations have run
Compare two prompt strategies
- Search for the dataset using
search_entities - Use
create_experimentwith two prompt variants andauto_runenabled - Use
get_experiment_runto retrieve evaluation metrics - Compare the variants and summarize which performed better
Export experiment results
- Use
list_experiment_runsto find the most recent run - Use
get_experiment_runwith CSV export format - Return a signed download URL for the CSV file
Managing Datasets
Managing Datasets
Create a synthetic datasetThe assistant will:
Import data from codeThe assistant will:
Update or clean up a datasetThe assistant will:
- Generate 50 synthetic question/answer pairs
- Use
create_datasetto create the dataset - Use
create_datapointsto add all entries in bulk, each formatted as{ inputs: { question: "..." }, expected_output: "..." }
Import data from code
- Parse the JSON from your selection or context
- Use
create_datasetwith an appropriate name - Use
create_datapointsto add each entry as a datapoint
Update or clean up a dataset
- Use
search_entitiesto find the “staging-tests” dataset and retrieve its ID - Use
list_datapointsto retrieve all entries - Filter for datapoints with empty
expected_output - Use
delete_datapointsto remove them in batches
Creating Evaluators
Creating Evaluators
Create an LLM-as-a-Judge evaluatorThe assistant will:
Create a Python evaluatorThe assistant will:
Create an experiment with evaluatorsThe assistant will:
- Use
create_llm_evalwith a scoring rubric for tone classification - Confirm the evaluator ID and configuration
Create a Python evaluator
- Write a Python snippet that parses the response and validates JSON structure
- Use
create_python_evalto register it in your workspace
Create an experiment with evaluators
- Search for the dataset using
search_entities - Use the evaluator ID — copy it from the Orq.ai UI, or use the ID returned by
create_llm_eval/create_python_evalif created in the same session - Use
create_experimentwith both the dataset ID and evaluator ID, withauto_runenabled
Analytics
Analytics
Get a workspace snapshotThe assistant will:
Drill into a specific model’s performanceThe assistant will:
Identify your most expensive modelsThe assistant will:
- Use
get_analytics_overviewwith a 7-day range - Return total requests, cost, tokens, error rate, latency, and top models
Drill into a specific model’s performance
- Use
query_analyticswithmetric: "errors", filtered by model and a 7-day range - Use
query_analyticswithmetric: "cost", filtered by model and a 7-day range - Surface error rate trends and cost breakdown side by side
Identify your most expensive models
- Use
query_analyticswithmetric: "cost",group_by: ["model"], and a 30-day range - Aggregate cost per model across all time buckets and rank them by total spend