Overview
Run experiments directly from code using the evaluatorq framework. Compare Deployments and Agents side-by-side—including Orq agents against third-party agents—and inspect tool usage and parameters for every execution. View results in your Terminal or in Orq’s AI Studio.Use-cases
Use the evaluatorq framework when:- Comparing Deployments: Test two versions of a Prompt/model configuration to ensure performance hasn’t degraded after changes.
- A/B testing Agents: Compare agentic approaches on the same Dataset.
- Development validation: Run experiments locally before pushing changes to production.
- CI/CD integration: Automatically run experiments as part of your testing pipeline.
- Quick iteration: Kick off experiments from code and get immediate feedback on model performance.
Prerequisites
Install the necessary libraries to get started:Basic Workflow
Define Your Data
Create a list of test cases (DataPoints) with inputs that will be passed to your Deployments or Agents. Choose one of three approaches:
Reference Existing Dataset (Recommended)
Reference Existing Dataset (Recommended)
Use a Dataset you’ve already created in Orq. This is the recommended approach for production experiments.
Load from File (CSV or JSON)
Load from File (CSV or JSON)
Parse a CSV or JSON file to create DataPoints. Useful for local development and testing.
Define Inline (Quick Testing)
Define Inline (Quick Testing)
Define DataPoints directly in code for quick local experiments and testing.
Define Your Jobs
Jobs define the work to be done on each test case. Create a job for each variant you want to test.In the example below, each job invokes an Orq Deployment using
orq_client.deployments.invoke(). This calls your deployed prompt/model with specific inputs and parameters. For each DataPoint, the job:- Invokes the Deployment with configuration options (e.g.,
reasoningparameter) - Receives a response object from the model
- Parses and structures the response for evaluation
Define Your Evaluators
Evaluators score the outputs from each job.Each evaluator is an async function that receives the job output and returns an
You can use Orq’s built-in Evaluators, create custom Evaluators, or integrate third-party frameworks. Popular options include Ragas and DeepEval.
EvaluationResult with a score (0.0-1.0) and human-readable explanation. Evaluators can run locally, call Orq’s LLM-as-judge API, or integrate external frameworks.Local Evaluators
Local Evaluators
Orq Evaluators
Orq Evaluators
Third-Party Evaluators
Third-Party Evaluators
Understanding Results
When you run an experiment via the API, theevaluatorq framework will:
- Execute each job on every DataPoint in your dataset
- Run all Evaluators against each job’s output
- Display results in your terminal after all DataPoints complete:
- A summary table showing all DataPoints processed
- Output text generated by each job variant
- Evaluator name, score (0.0-1.0), and explanation for each Evaluator
- Pass/fail status for each variant
- Sync to Orq.ai if you provided your
ORQ_API_KEY:- Results are automatically uploaded for storage and sharing
- Framework prints an experiment URL at the end (format:
https://my.orq.ai/experiments/01ARZ3NDEKTSV4RRFFQ69G5FAV) - Open this URL to access and share results via the Orq.ai UI
Terminal Output
After your experiment completes, your terminal will display a summary table with results:
- Summary metrics: Total DataPoints processed, success rate, job execution statistics
- Detailed Results: Evaluator scores for each job variant (0.0-1.0 scale)
- Evaluation progress: Success indicators as processing completes
Orq.ai AI Studio

Advanced Use Cases
evaluatorq Tutorial
Detailed walkthroughs and code examples for advanced evaluatorq patterns:
- Comparing Deployments and Agents
- Third-party framework integration (LangGraph, CrewAI, LlamaIndex, AutoGen)
- Multi-job workflows and custom data sources
- CI/CD integration strategies