TL;DR
- Run experiments from code to compare any AI system against your evaluation criteria, whether it’s Orq-native or built with LangGraph, CrewAI, or your own custom framework
- Results rendered in Orq’s UI so when experiments complete, prompt engineers can drill into failure points, identify why a version underperforms, and iterate on tool descriptions, agent instructions, or prompts directly in the platform
- Choose your evaluators using Orq’s native evaluation suite or plug in third-party tools like RAGAS and DeepEval
What is Evaluatorq?
Evaluatorq is an evaluation framework for running experiments programmatically, available in both Python and TypeScript — this cookbook focuses on Python. It features the following capabilities:- Define jobs: These are functions that run your model over inputs and produce outputs.
- Parallel evaluations: enabling running multiple jobs (model configurations, deployments, or agents) simultaneously against the same test dataset, then comparing their results side-by-side and decide which configurations will perform best in production.
- Flexible Data Sources: Apply jobs and evaluators over datasets. These could be inline arrays, async sources, or even datasets managed in the Orq.ai platform.
- Type-safe: Built with Python type hints for better IDE support
- Access to experiments from code: Test Orq deployments, Orq agents, or any third-party framework, execute them over datasets, and evaluate results without leaving your IDE. For examples and common patterns, check out the Evaluatorq repository
What will we build?
We will build two separate Orq.ai-native Agents using different models that act as cloud engineering consultants, evaluate their performance and challenge them against LangGraph Agent for the following task:Pre-requisites
1
Getting started
Install the required packages
2
Set up the Agents
Before we run any evaluations, we need to set up two Agents for comparison to do so:
-
Create a new Project in AI Studio

-
Add Agents to the Project to evaluate
Next, in Python we create two Agent variants to evaluate:
VariantAwith gpt-4oVariantBwith gpt-4o-miniKey Agent variables:key: Unique name of the Agent.path: Path to the Projectdescription: Detailed instructions how an Agent should behavemodel: Foundational model which we will evaluateAgent Variant A (gpt-4o)
Agent Variant B (gpt-4o-mini)
3
Assessing Agent performance with parallel evaluators
Once we have the Agent variants set up, we’re ready to run parallel evaluations using Evaluatorq. In the Evaluatorq evaluation framework, you’ll notice the following syntax:Expected output
Here you can see that the two Agent variants were evaluated and scored in parallel using 4 different evaluators. Based on this feedback you can optimize your Agent setup using an evaluation metric that is most important for your use-case.Learn more about custom evaluators with Evaluatorq:
This code demonstrates a parallel evaluation system for validating e-commerce product data. It defines a
This code demonstrates a parallel evaluation system for validating numerical dataset quality. It defines a Normal Distribution Checker - Validates whether data approximates a normal distribution by calculating the coefficient of variation and ensuring it falls within the expected 10-30% range
This code demonstrates a parallel evaluation system for validating extracted text patterns. It defines a
@jobdecorator is a wrapper that identifies and names the function as a jobasync def your_evaluatorevaluators are defined as functions
- Evaluator 1: LLM-as-a-judge
- Evaluator 2: DeepEval Faithfulness
- Evaluator 3: DeepEval Answer Relevancy
- Evaluator 4: Response Length (example of a custom Python script)

Domain-specific validations
Domain-specific validations
Domain-specific evaluators enforce business rules and quality standards unique to your use case, catching issues that generic validators would miss and ensuring outputs meet your exact requirements.
product_validator job that extracts SKU and price information from product inputs, then runs two concurrent evaluators:- SKU Format Validator: Checks that product SKUs match the required format (3 uppercase letters, hyphen, 5 digits: ABC-12345)
- Price Range Validator: Ensures prices fall within acceptable business limits ($0.01 - $10,000.00)
Statistical checks
Statistical checks
Statistical checks prevent flawed data from leading to incorrect conclusions, wasted resources, and poor business decisions.
data_analyzer job that computes descriptive statistics (mean, median, standard deviation, count) from numerical inputs, then runs two concurrent evaluators:Outlier Detection - Identifies data points that fall outside acceptable ranges using the Interquartile Range (IQR) method, flagging values beyond 1.5x IQR from Q1/Q3 quartilesString match
String match
Pattern matching validates that extracted data conforms to expected formats, preventing invalid information from entering your systems and workflows.
text_extractor job that uses regex to extract emails, phone numbers, URLs, and dates from text inputs, then runs two concurrent evaluators:- Email Validation - Applies strict pattern matching to verify extracted email addresses follow proper format
- Phone Format Consistency - Checks that all phone numbers use consistent formatting (dashes, dots, or no separators) to ensure data uniformity across records
4
Third-party evaluators
RAGAS
RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is a research-backed evaluation framework specifically designed for RAG systems. It provides both reference-free and reference-based metrics that assess retrieval quality and generation quality using LLM-as-a-judge.
Reference-Free Metrics (No Ground Truth Needed):
- Faithfulness: Checks if the response is grounded in the retrieved context
- Answer Relevancy: Checks if the response addresses the query
Reference-Based Metrics (Require Ground Truth):
- Context Precision: Measures if retrieved contexts are relevant to the ground truth
- Context Recall: Measures if all contexts were retrieved compared to ground truth
DeepEval
DeepEval
DeepEval is a comprehensive open-source LLM evaluation framework that treats AI testing like software unit testing. Built with pytest integration, it provides 15+ evaluation metrics covering RAG systems, chatbots, AI agents, and general LLM outputs.
Orq.ai vs LangGraph Agent
Orq.ai allows you to process third-party agent traces. This evaluation compares two AI agent implementations using GPT-4o model. Both agents act as Cloud Engineering Assistants and are tested on cloud infrastructure questions. Agents tested:LangChain Agent:Direct implementation using LangChain’s ChatOpenAI with custom system promptsOrq Native Agent:Agent deployed through Orq.ai platform with equivalent configuration
DeepEval Faithfulness: Measures how well responses align with provided contextCloud Engineering Relevance: Keyword-based scoring for cloud-specific terminology
1
Set up LangGraph traces in Orq.ai
Follow along the LangGraph vs Orq.ai Agent cell in Google Colab. Variables need to be configured under the
Step 1 section:ORQ_API_KEY - For Orq agent access and telemetry exportOPENAI_API_KEY - For LangChain agent and DeepEval metrics2
Run the evaluators
We set up in this step equivalent configurations of LangChain and DeepEval Agents and run two evaluators on the following those steps:Expected Results:
Step 2 - Install and Import LangChainStep 3 - Install and Import DeepEvalStep 4 - Create LangChain Agent (Matching Orq Setup)Step 5 - Call the Orq.ai-native AgentStep 6 - Run DeepEval and Relevance evals
3
Preview the results in Agent Studio
You can see the results directly in the AI Studio by clicking on the generated link that shows up after you run the agent evaluators:
