Skip to main content
TL;DR
  • Run experiments from code to compare any AI system against your evaluation criteria, whether it’s Orq-native or built with LangGraph, CrewAI, or your own custom framework
  • Results rendered in Orq’s UI so when experiments complete, prompt engineers can drill into failure points, identify why a version underperforms, and iterate on tool descriptions, agent instructions, or prompts directly in the platform
  • Choose your evaluators using Orq’s native evaluation suite or plug in third-party tools like RAGAS and DeepEval

What is Evaluatorq?

Evaluatorq is an evaluation framework for running experiments programmatically, available in both Python and TypeScript — this cookbook focuses on Python. It features the following capabilities:
  • Define jobs: These are functions that run your model over inputs and produce outputs.
  • Parallel evaluations: enabling running multiple jobs (model configurations, deployments, or agents) simultaneously against the same test dataset, then comparing their results side-by-side and decide which configurations will perform best in production.
  • Flexible Data Sources: Apply jobs and evaluators over datasets. These could be inline arrays, async sources, or even datasets managed in the Orq.ai platform.
  • Type-safe: Built with Python type hints for better IDE support
  • Access to experiments from code: Test Orq deployments, Orq agents, or any third-party framework, execute them over datasets, and evaluate results without leaving your IDE. For examples and common patterns, check out the Evaluatorq repository

What will we build?

We will build two separate Orq.ai-native Agents using different models that act as cloud engineering consultants, evaluate their performance and challenge them against LangGraph Agent for the following task:
"I'm preparing a technical presentation on microservices architecture. 

Can you help me create an outline covering the key benefits, challenges, 
and best practices in cloud computing?"
We will test the Agent configurations by running multiple evaluations in parallel using Evaluatorq. You will learn how to access readily available Orq.ai evaluators and external frameworks like DeepEval. The evaluation stack that we will build consists of: LLM-as-a-judge, DeepEval Faithfulness, DeepEval Answer Relevancy and an example of a custom Python evaluator. You can follow-along the build in Google Colab workbook.

Pre-requisites

1

Getting started

Install the required packages
# Install Evaluatorq
!pip install evaluatorq

# Install Orq SDK
!pip install orq-ai-sdk

# Optional: Third-party evaluators
!pip install ragas deepeval
2

Set up the Agents

Before we run any evaluations, we need to set up two Agents for comparison to do so:
  1. Create a new Project in AI Studio Screenshot2026 01 07at12 15 57
  2. Add Agents to the Project to evaluate Next, in Python we create two Agent variants to evaluate: VariantA with gpt-5-mini VariantB with claude-sonnet-4.5
    Key Agent variables: key : Unique name of the Agent. path : Path to the Project description : Detailed instructions how an Agent should behave model : Foundational model which we will evaluate

    Agent Variant A (gpt-5-mini)

    from orq_ai_sdk import Orq
    import os
    
    with Orq(api_key=os.getenv("ORQ_API_KEY", "")) as orq:
        agent = orq.agents.create(
            key="VariantA",
            role="Cloud Engineering Assistant",
            description="A helpful assistant for cloud engineering tasks",
            instructions="Be helpful and concise",
            path="Evaluatorq",
            model={"id": "openai/gpt-5-mini"},
            settings={
                "max_iterations": 3,
                "max_execution_time": 300,
                "tools": [
                    {
                        "type": "current_date"
                    }
                ]
            }
        )
    
        print(f"Agent created: {agent.key}")
    

    Agent Variant B (claude-sonnet-4.5)

    from orq_ai_sdk import Orq
    import os
    
    with Orq(api_key=os.getenv("ORQ_API_KEY", "")) as orq:
        agent = orq.agents.create(
            key="VariantB",
            role="Cloud Engineering Assistant",
            description="A helpful assistant for cloud engineering tasks with cost-efficient model",
            instructions="Be helpful and concise. Provide clear, practical answers.",
            path="Evaluatorq",
            model={"id": "anthropic/claude-sonnet-4-5-20250929"},
            settings={
                "max_iterations": 3,
                "max_execution_time": 300,
                "tools": [
                    {
                        "type": "current_date"
                    }
                ]
            }
        )
    
        print(f"✓ Agent created: {agent.key}")
    
3

Assessing Agent performance with parallel evaluators

Once we have the Agent variants set up, we’re ready to run parallel evaluations using Evaluatorq. In the Evaluatorq evaluation framework, you’ll notice the following syntax:
  • @job decorator is a wrapper that identifies and names the function as a job
  • async def your_evaluator evaluators are defined as functions
Before running evaluations: You must first create an Orq LLM-as-a-judge Evaluator via the UI or API. Once created, retrieve the Evaluator ID from the URL (e.g., https://my.orq.ai/project/evaluators/01KECJTD1GWGF90DMGSP1D8XZN) or via the Get All Evaluators API.Orq also supports custom Python evaluators, JSON-based evaluators, and HTTP evaluators—all invoked via their unique Evaluator ID.
In the example below we will run four evaluators in parallel:
import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
from orq_ai_sdk import Orq
import os

# ============================================
# CONFIGURATION
# ============================================
ORQ_API_KEY = os.getenv("ORQ_API_KEY", "")
if not ORQ_API_KEY:
    raise ValueError("ORQ_API_KEY environment variable must be set")

# ============================================
# CRITICAL: Set OpenAI API Key for DeepEval
# ============================================
# DeepEval uses OpenAI's API internally for evaluation
# You MUST set this before importing DeepEval
if not os.getenv("OPENAI_API_KEY"):
    print("CRITICAL: OPENAI_API_KEY not set!")
    print("Add this cell BEFORE running evaluation:")
    print("  import os")
    print('  os.environ["OPENAI_API_KEY"] = "sk-your-openai-key"')
    print()

# DeepEval library imports
try:
    from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
    from deepeval.test_case import LLMTestCase
    DEEPEVAL_AVAILABLE = True
    print("✓ DeepEval loaded")
except ImportError:
    DEEPEVAL_AVAILABLE = False
    print("DeepEval not installed. Run: pip install deepeval")

# Initialize Orq client
orq_client = Orq(api_key=ORQ_API_KEY)

# Replace with your Orq LLM-as-a-judge Evaluator ID
# Get it from: https://my.orq.ai/project/evaluators/<YOUR_ID>
# Or via API: https://docs.orq.ai/reference/evaluators/get-all-evaluators
LLM_JUDGE_EVAL_ID = "$YOUR_LLM_AS_A_JUDGE_ID"

# ============================================
# HELPER: Extract Response Text
# ============================================
def extract_response_text(response):
    """Helper function to extract text from Orq agent response."""
    if hasattr(response, 'content'):
        if isinstance(response.content, list):
            return " ".join([
                part.text if hasattr(part, 'text') else str(part)
                for part in response.content
            ])
        return str(response.content)
    return str(response)

# ============================================
# JOB 1: VariantA Agent (gpt-5-mini)
# ============================================
@job("VariantA")
async def variant_a_agent(data: DataPoint, row: int):
    """VariantA agent using gpt-5-mini."""
    with Orq(api_key=ORQ_API_KEY) as orq:
        response = orq.agents.responses.create(
            agent_key="VariantA",
            background=False,
            message={
                "role": "user",
                "parts": [{"kind": "text", "text": data.inputs["query"]}]
            }
        )

        return {
            "agent": "VariantA",
            "query": data.inputs["query"],
            "response": extract_response_text(response),
            "context": data.inputs.get("context", "")
        }

# ============================================
# JOB 2: VariantB Agent (claude-sonnet-4.5)
# ============================================
@job("VariantB")
async def variant_b_agent(data: DataPoint, row: int):
    """VariantB agent using claude-sonnet-4.5."""
    with Orq(api_key=ORQ_API_KEY) as orq:
        response = orq.agents.responses.create(
            agent_key="VariantB",
            background=False,
            message={
                "role": "user",
                "parts": [{"kind": "text", "text": data.inputs["query"]}]
            }
        )

        return {
            "agent": "VariantB",
            "query": data.inputs["query"],
            "response": extract_response_text(response),
            "context": data.inputs.get("context", "")
        }

# ============================================
# EVALUATOR 1: Orq Response Quality (LLM-as-a-judge)
# ============================================
async def orq_response_quality_evaluator(params):
    """Uses Orq's LLM-as-a-judge to assess response quality and coherence."""
    data: DataPoint = params["data"]
    output = params["output"]

    query = data.inputs.get("query", "").strip()
    response = output.get("response", "").strip()

    if not response or not query:
        return EvaluationResult(value=0.0, explanation="Missing data")

    try:
        evaluation = await asyncio.to_thread(
            orq_client.evals.invoke,
            id=LLM_JUDGE_EVAL_ID,
            query=query,
            output=response,
        )

        raw_score = float(evaluation.value.value)
        score = raw_score / 10.0 if raw_score > 1.0 else raw_score
        explanation = str(evaluation.value.explanation or "")[:80]

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: {explanation}"
        )
    except Exception as e:
        return EvaluationResult(value=0.0, explanation=f"Orq error: {str(e)[:50]}")

# ============================================
# EVALUATOR 2: DeepEval Faithfulness
# ============================================
async def deepeval_faithfulness_evaluator(params):
    """Uses DeepEval's faithfulness metric (requires OPENAI_API_KEY)."""
    if not DEEPEVAL_AVAILABLE:
        return EvaluationResult(value=0.0, explanation="DeepEval not installed")

    if not os.getenv("OPENAI_API_KEY"):
        return EvaluationResult(value=0.0, explanation="OPENAI_API_KEY not set")

    output = params["output"]
    query = output.get("query", "").strip()
    response = output.get("response", "").strip()
    context = output.get("context", "").strip()

    if not response or not context:
        return EvaluationResult(value=0.0, explanation="Missing response or context")

    try:
        # Create test case
        test_case = LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=[context],
        )

        # Initialize metric
        metric = FaithfulnessMetric(
            threshold=0.5,
            model="gpt-4o-mini",  # Use gpt-4o-mini to save costs
            include_reason=False,
        )

        # Measure (synchronous call in thread)
        def measure_sync():
            metric.measure(test_case)
            return float(metric.score) if metric.score is not None else 0.0

        score = await asyncio.to_thread(measure_sync)

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: Faithfulness {score:.2f}"
        )

    except Exception as e:
        return EvaluationResult(
            value=0.0,
            explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}"
        )

# ============================================
# EVALUATOR 3: DeepEval Answer Relevancy
# ============================================
async def deepeval_answer_relevancy_evaluator(params):
    """Uses DeepEval's answer relevancy metric (requires OPENAI_API_KEY)."""
    if not DEEPEVAL_AVAILABLE:
        return EvaluationResult(value=0.0, explanation="DeepEval not installed")

    if not os.getenv("OPENAI_API_KEY"):
        return EvaluationResult(value=0.0, explanation="OPENAI_API_KEY not set")

    output = params["output"]
    query = output.get("query", "").strip()
    response = output.get("response", "").strip()

    if not response or not query:
        return EvaluationResult(value=0.0, explanation="Missing query or response")

    try:
        # Create test case
        test_case = LLMTestCase(
            input=query,
            actual_output=response,
        )

        # Initialize metric
        metric = AnswerRelevancyMetric(
            threshold=0.5,
            model="gpt-4o-mini",  # Use gpt-4o-mini to save costs
            include_reason=False,
        )

        # Measure (synchronous call in thread)
        def measure_sync():
            metric.measure(test_case)
            return float(metric.score) if metric.score is not None else 0.0

        score = await asyncio.to_thread(measure_sync)

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: Relevancy {score:.2f}"
        )

    except Exception as e:
        return EvaluationResult(
            value=0.0,
            explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}"
        )

# ============================================
# EVALUATOR 4: Response Length
# ============================================
async def response_length_evaluator(params):
    """Checks if response length is appropriate."""
    output = params["output"]
    word_count = len(output["response"].split())

    if 50 <= word_count <= 300:
        score, verdict = 1.0, "Good"
    elif word_count < 50:
        score, verdict = word_count / 50, "Too short"
    else:
        score, verdict = 0.5, "Too long"

    return EvaluationResult(
        value=score,
        explanation=f"{output['agent']}: {word_count}w - {verdict}"
    )

# ============================================
# RUN EVALUATION
# ============================================
async def main():
    print("=" * 70)
    print("Comparing Agents: VariantA (gpt-5-mini) vs VariantB (claude-sonnet-4.5)")
    print("=" * 70)
    print()

    # Check configuration
    print("Configuration Check:")
    print(f"  ORQ_API_KEY: {'✓' if ORQ_API_KEY else '✗'}")
    print(f"  OPENAI_API_KEY: {'✓' if os.getenv('OPENAI_API_KEY') else '✗ REQUIRED FOR DEEPEVAL'}")
    print(f"  DeepEval: {'✓' if DEEPEVAL_AVAILABLE else '✗'}")
    print()

    if not os.getenv("OPENAI_API_KEY"):
        print("WARNING: DeepEval evaluators will return 0.00 without OPENAI_API_KEY")
        print("Add this in a cell before running:")
        print('os.environ["OPENAI_API_KEY"] = "sk-your-key"')
        print()

    await evaluatorq(
        "variant-comparison",
        data=[
            DataPoint(inputs={
                "query": "What are the best practices for microservices architecture?",
                "context": "Microservices architecture is a design pattern where applications are built as collections of loosely coupled services. Best practices include service independence, API-first design, and fault tolerance."
            }),
            DataPoint(inputs={
                "query": "How do I implement API rate limiting in a production system?",
                "context": "API rate limiting controls the number of requests a client can make to prevent abuse and ensure fair resource allocation. Common strategies include token bucket, leaky bucket, and fixed window algorithms."
            }),
            DataPoint(inputs={
                "query": "How does Kubernetes handle container orchestration?",
                "context": "Kubernetes orchestrates containers through a master-worker architecture. The control plane manages the cluster state, while worker nodes run containerized applications in pods. It handles scheduling, scaling, and self-healing automatically."
            }),
            DataPoint(inputs={
                "query": "What are best practices for CI/CD pipelines in cloud environments?",
                "context": "Best practices for cloud CI/CD include: automating testing at all stages, using infrastructure as code, implementing proper secrets management, maintaining separate environments (dev/staging/prod), and ensuring fast feedback loops."
            }),
            DataPoint(inputs={
                "query": "Explain the difference between AWS ECS and EKS",
                "context": "AWS ECS (Elastic Container Service) is Amazon's proprietary container orchestration platform, while EKS (Elastic Kubernetes Service) runs managed Kubernetes. ECS is simpler and AWS-specific, while EKS offers Kubernetes portability."
            }),
            DataPoint(inputs={
                "query": "How do I secure secrets in a cloud-native application?",
                "context": "Cloud-native secret management involves using services like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault. Best practices include encryption at rest and in transit, role-based access control, and regular rotation."
            }),
            DataPoint(inputs={
                "query": "What is the purpose of a service mesh like Istio?",
                "context": "A service mesh provides infrastructure layer for handling service-to-service communication. Istio manages traffic routing, load balancing, encryption, authentication, and observability without requiring application code changes."
            }),
            DataPoint(inputs={
                "query": "How does auto-scaling work in AWS?",
                "context": "AWS Auto Scaling monitors applications and automatically adjusts capacity based on CloudWatch metrics. It uses scaling policies to add or remove EC2 instances based on CPU utilization, request counts, or custom metrics."
            }),
            DataPoint(inputs={
                "query": "What are the benefits of using Infrastructure as Code?",
                "context": "Infrastructure as Code (IaC) allows version-controlled, repeatable infrastructure provisioning using tools like Terraform or CloudFormation. Benefits include consistency, auditability, disaster recovery, and reduced manual errors."
            }),
            DataPoint(inputs={
                "query": "How do I implement zero-downtime deployments?",
                "context": "Zero-downtime deployments use strategies like blue-green deployments, rolling updates, or canary releases. Load balancers gradually shift traffic to new versions while monitoring health checks and rollback capabilities."
            }),
        ],
        jobs=[variant_a_agent, variant_b_agent],
        evaluators=[
            {"name": "orq-response-quality", "scorer": orq_response_quality_evaluator},
            {"name": "deepeval-faithfulness", "scorer": deepeval_faithfulness_evaluator},
            {"name": "deepeval-relevancy", "scorer": deepeval_answer_relevancy_evaluator},
            {"name": "length", "scorer": response_length_evaluator},
        ],
    )

    print("\n" + "=" * 70)
    print("✓ Evaluation Complete!")
    print("=" * 70)

if __name__ == "__main__":
    await main()
Alternative Data Sources: Instead of defining DataPoints inline, you can load data from a CSV file or use Orq-managed Datasets. This is especially useful for running experiments over large evaluation sets.
Expected outputEvaluatorsInterpreting the Results: The table shows evaluation scores (0.0-1.0) for each agent variant across all four evaluators. Higher scores indicate better performance—a score of 0.75+ suggests the agent meets quality standards, while scores below 0.50 may indicate the agent needs refinement. Compare scores across variants to identify which model configuration performs best for your specific use case.
4

Third-party evaluators

RAGAS (Retrieval Augmented Generation Assessment) is a research-backed evaluation framework specifically designed for RAG systems. It provides both reference-free and reference-based metrics that assess retrieval quality and generation quality using LLM-as-a-judge.

Reference-Free Metrics (No Ground Truth Needed):

  • Faithfulness: Checks if the response is grounded in the retrieved context
  • Answer Relevancy: Checks if the response addresses the query

Reference-Based Metrics (Require Ground Truth):

  • Context Precision: Measures if retrieved contexts are relevant to the ground truth
  • Context Recall: Measures if all contexts were retrieved compared to ground truth
Before running this example: You must first create an Orq Deployment with a Knowledge Base enabled. Once created, replace "rag-knowledge-assistant" with your deployment key.
import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
from orq_ai_sdk import Orq
import os

# RAGAS library imports
try:
    from ragas import evaluate
    from ragas.metrics import faithfulness, answer_relevancy
    from datasets import Dataset
    RAGAS_AVAILABLE = True
except ImportError:
    RAGAS_AVAILABLE = False
    print("RAGAS not installed. Install with: pip install ragas datasets")

ORQ_API_KEY = os.getenv("ORQ_API_KEY", "your-api-key-here")

# ============================================
# JOB: RAG-Powered Q&A System
# ============================================
@job("rag-qa-system")
async def rag_qa_system(data: DataPoint, row: int):
    """
    RAG system that answers questions using knowledge base.
    This is what we're evaluating - an Orq deployment with RAG.
    """
    with Orq(api_key=ORQ_API_KEY) as orq:
        response = orq.deployments.invoke(
            key="rag-knowledge-assistant",  # Your RAG-enabled deployment
            context={
                "knowledge_base_id": "your-kb-id"  # Optional: specific KB
            },
            inputs={"question": data.inputs["question"]},
            messages=[{
                "role": "user",
                "content": data.inputs["question"]
            }]
        )
        
        answer = response.choices[0].message.content
        
        # Extract contexts from RAG response (if available in metadata)
        # Adjust based on your actual Orq response structure
        contexts = getattr(response, 'contexts', data.inputs.get("contexts", []))
        if not contexts:
            contexts = ["Retrieved context from knowledge base"]
        
        return {
            "query": data.inputs["question"],
            "response": answer,
            "contexts": contexts,
            "ground_truth": data.inputs.get("ground_truth", "")
        }

# ============================================
# EVALUATOR 1: RAGAS Faithfulness
# ============================================
async def ragas_faithfulness_scorer(params):
    """Evaluate faithfulness using RAGAS metric - checks if response is grounded in context."""
    if not RAGAS_AVAILABLE:
        return EvaluationResult(
            value=0,
            explanation="RAGAS library not available. Install with: pip install ragas datasets",
        )
    
    output = params["output"]
    
    try:
        # Prepare dataset for RAGAS evaluation
        dataset = Dataset.from_dict({
            "question": [output["query"]],
            "answer": [output["response"]],
            "contexts": [output["contexts"]],
        })
        
        # Evaluate using RAGAS faithfulness metric
        result = evaluate(dataset, metrics=[faithfulness])
        score = result["faithfulness"]
        
        return EvaluationResult(
            value=score,
            explanation=(
                f"Faithfulness score: {score:.2f} - Response is grounded in provided context"
                if score >= 0.7
                else f"Faithfulness score: {score:.2f} - Response contains unsupported claims"
            ),
        )
    except Exception as e:
        return EvaluationResult(
            value=0,
            explanation=f"Error evaluating faithfulness: {str(e)}",
        )

# ============================================
# EVALUATOR 2: RAGAS Answer Relevancy
# ============================================
async def ragas_answer_relevancy_scorer(params):
    """Evaluate answer relevancy using RAGAS metric - checks if response addresses the query."""
    if not RAGAS_AVAILABLE:
        return EvaluationResult(
            value=0,
            explanation="RAGAS library not available. Install with: pip install ragas datasets",
        )
    
    output = params["output"]
    
    try:
        # Prepare dataset for RAGAS evaluation
        dataset = Dataset.from_dict({
            "question": [output["query"]],
            "answer": [output["response"]],
            "contexts": [output["contexts"]],
        })
        
        # Evaluate using RAGAS answer relevancy metric
        result = evaluate(dataset, metrics=[answer_relevancy])
        score = result["answer_relevancy"]
        
        return EvaluationResult(
            value=score,
            explanation=(
                f"Answer relevancy score: {score:.2f} - Response directly addresses the query"
                if score >= 0.7
                else f"Answer relevancy score: {score:.2f} - Response is off-topic or incomplete"
            ),
        )
    except Exception as e:
        return EvaluationResult(
            value=0,
            explanation=f"Error evaluating answer relevancy: {str(e)}",
        )

# ============================================
# RUN EVALUATION
# ============================================
async def main():
    await evaluatorq(
        "rag-system-evaluation",
        data=[
            DataPoint(inputs={
                "question": "What is machine learning?",
                "contexts": ["Machine learning is a branch of AI focused on building systems that learn from data."],
                "ground_truth": "Machine learning is a type of AI that allows systems to learn from data."
            }),
            DataPoint(inputs={
                "question": "How does photosynthesis work?",
                "contexts": ["Plants use chlorophyll to capture light energy and convert CO2 and water into glucose."],
                "ground_truth": "Photosynthesis converts light energy into chemical energy in plants."
            }),
            DataPoint(inputs={
                "question": "What are the benefits of cloud computing?",
                "contexts": ["Cloud computing provides scalability, cost efficiency, and flexibility for businesses."],
                "ground_truth": "Cloud computing offers scalability and cost savings."
            }),
        ],
        jobs=[rag_qa_system],
        evaluators=[
            {"name": "ragas-faithfulness", "scorer": ragas_faithfulness_scorer},
            {"name": "ragas-answer-relevancy", "scorer": ragas_answer_relevancy_scorer},
        ],
    )

if __name__ == "__main__":
    await main()
DeepEval is a comprehensive open-source LLM evaluation framework that treats AI testing like software unit testing. Built with pytest integration, it provides 15+ evaluation metrics covering RAG systems, chatbots, AI agents, and general LLM outputs.
Dependencies
pip install deepeval

export OPENAI_API_KEY="your-api-key"
DeepEval implementation 
import asyncio
import os
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
from orq_ai_sdk import Orq

# DeepEval library imports
try:
    from deepeval.metrics import (
        AnswerRelevancyMetric,
        FaithfulnessMetric,
        HallucinationMetric,
    )
    from deepeval.test_case import LLMTestCase
    DEEPEVAL_AVAILABLE = True
except ImportError:
    DEEPEVAL_AVAILABLE = False
    print("DeepEval not installed. Install with: pip install deepeval")

# ============================================
# CONFIGURATION
# ============================================
ORQ_API_KEY = os.getenv("ORQ_API_KEY", "")

# Helper function to extract response text
def extract_response_text(response):
    """Helper function to extract text from Orq agent response."""
    if hasattr(response, 'content'):
        if isinstance(response.content, list):
            return " ".join([
                part.text if hasattr(part, 'text') else str(part)
                for part in response.content
            ])
        return str(response.content)
    return str(response)

@job("VariantA")
async def variant_a_agent(data: DataPoint, row: int):
    """VariantA agent using gpt-5-mini."""
    with Orq(api_key=ORQ_API_KEY) as orq:
        response = orq.agents.responses.create(
            agent_key="VariantA",
            background=False,
            message={
                "role": "user",
                "parts": [{"kind": "text", "text": data.inputs["query"]}]
            }
        )

        return {
            "agent": "VariantA",
            "query": data.inputs["query"],
            "response": extract_response_text(response),
            "context": data.inputs.get("context", "")
        }

async def deepeval_faithfulness_scorer(params):
    """Evaluate faithfulness using DeepEval metric."""
    if not DEEPEVAL_AVAILABLE:
        return EvaluationResult(
            value=0,
            explanation="DeepEval library not available. Install with: pip install deepeval",
        )

    output = params["output"]
    query = output.get("query", "").strip()
    response = output.get("response", "").strip()
    context = output.get("context", "").strip()

    if not response or not context:
        return EvaluationResult(value=0.0, explanation="Missing response or context")

    try:
        # Create test case for DeepEval evaluation
        test_case = LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=[context],
        )

        # Initialize DeepEval Faithfulness metric
        faithfulness_metric = FaithfulnessMetric(
            threshold=0.7,
            model="gpt-4o-mini",
            include_reason=False,
        )

        # Measure faithfulness (synchronous call in thread)
        def measure_sync():
            faithfulness_metric.measure(test_case)
            return float(faithfulness_metric.score) if faithfulness_metric.score is not None else 0.0

        score = await asyncio.to_thread(measure_sync)

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: Faithfulness {score:.2f}"
        )
    except Exception as e:
        return EvaluationResult(
            value=0,
            explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}",
        )

async def deepeval_hallucination_scorer(params):
    """Evaluate hallucination using DeepEval metric."""
    if not DEEPEVAL_AVAILABLE:
        return EvaluationResult(
            value=0,
            explanation="DeepEval library not available. Install with: pip install deepeval",
        )

    output = params["output"]
    query = output.get("query", "").strip()
    response = output.get("response", "").strip()
    context = output.get("context", "").strip()

    if not response or not context:
        return EvaluationResult(value=0.0, explanation="Missing response or context")

    try:
        # Create test case for DeepEval evaluation
        test_case = LLMTestCase(
            input=query,
            actual_output=response,
            context=[context],
        )

        # Initialize DeepEval Hallucination metric
        hallucination_metric = HallucinationMetric(
            threshold=0.5,
            model="gpt-4o-mini",
            include_reason=False,
        )

        # Measure hallucination (synchronous call in thread)
        def measure_sync():
            hallucination_metric.measure(test_case)
            return float(hallucination_metric.score) if hallucination_metric.score is not None else 0.0

        score = await asyncio.to_thread(measure_sync)

        # Invert score so higher is better (1 - hallucination_score)
        inverted_score = 1 - score

        return EvaluationResult(
            value=inverted_score,
            explanation=f"{output['agent']}: Hallucination {score:.2f} (inverted: {inverted_score:.2f})"
        )
    except Exception as e:
        return EvaluationResult(
            value=0,
            explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}",
        )

async def main():
    await evaluatorq(
        "variant-a-deepeval",
        data=[
            DataPoint(inputs={
                "query": "What are the best practices for microservices architecture?",
                "context": "Microservices architecture is a design pattern where applications are built as collections of loosely coupled services. Best practices include service independence, API-first design, and fault tolerance."
            }),
            DataPoint(inputs={
                "query": "How do I implement API rate limiting in a production system?",
                "context": "API rate limiting controls the number of requests a client can make to prevent abuse and ensure fair resource allocation. Common strategies include token bucket, leaky bucket, and fixed window algorithms."
            }),
        ],
        jobs=[variant_a_agent],
        evaluators=[
            {"name": "deepeval-faithfulness", "scorer": deepeval_faithfulness_scorer},
            {"name": "deepeval-hallucination", "scorer": deepeval_hallucination_scorer},
        ],
    )

if __name__ == "__main__":
    await main()

Orq.ai vs LangGraph Agent

Orq.ai allows you to process third-party agent traces. This evaluation compares two AI agent implementations using GPT-4o model. Both agents act as Cloud Engineering Assistants and are tested on cloud infrastructure questions. Agents tested:
  • LangChain Agent: Direct implementation using LangChain’s ChatOpenAI with custom system prompts
  • Orq Native Agent: Agent deployed through Orq.ai platform with equivalent configuration
Evaluation metrics:
  • DeepEval Faithfulness: Measures how well responses align with provided context
  • Cloud Engineering Relevance: Keyword-based scoring for cloud-specific terminology
1

Set up LangGraph traces in Orq.ai

Follow along the LangGraph vs Orq.ai Agent cell in Google Colab. Variables need to be configured under the Step 1 section:
# ============================================
# STEP 1: Configure Environment Variables
# ============================================
ORQ_API_KEY - For Orq agent access and telemetry exportOPENAI_API_KEY - For LangChain agent and DeepEval metrics
2

Run the evaluators

We set up in this step equivalent configurations of LangChain and DeepEval Agents and run two evaluators on the following those steps:Step 2 - Install and Import LangChainStep 3 - Install and Import DeepEvalStep 4 - Create LangChain Agent (Matching Orq Setup)Step 5 - Call the Orq.ai-native AgentStep 6 - Run DeepEval and Relevance evals
import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
from orq_ai_sdk import Orq
import os

# ============================================
# STEP 1: Configure Environment Variables
# ============================================
# Orq.ai OpenTelemetry exporter for LangGraph traces
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://api.orq.ai/v2/otel"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Bearer {os.getenv('ORQ_API_KEY')}"

# Enable LangSmith tracing in OTEL-only mode
os.environ["LANGSMITH_OTEL_ENABLED"] = "true"
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_OTEL_ONLY"] = "true"

# ============================================
# STEP 2: Install and Import LangChain
# ============================================
try:
    from langchain_openai import ChatOpenAI
    LANGCHAIN_AVAILABLE = True
    print("✓ LangChain loaded")
except ImportError:
    LANGCHAIN_AVAILABLE = False
    print(" LangChain not installed. Run: pip install langchain-openai")

# ============================================
# STEP 3: Install and Import DeepEval
# ============================================
try:
    from deepeval.metrics import FaithfulnessMetric
    from deepeval.test_case import LLMTestCase
    DEEPEVAL_AVAILABLE = True
    print("✓ DeepEval loaded")
except ImportError:
    DEEPEVAL_AVAILABLE = False
    print(" DeepEval not installed. Run: pip install deepeval")

# ============================================
# CONFIGURATION
# ============================================
ORQ_API_KEY = os.getenv("ORQ_API_KEY")
orq_client = Orq(api_key=ORQ_API_KEY)

# Agent keys
ORQ_AGENT_KEY = "VariantA"  # Your existing Orq agent

# ============================================
# STEP 4: Create LangChain Agent (Matching Orq Setup)
# ============================================
if LANGCHAIN_AVAILABLE:
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0.7,
        max_tokens=None
    )

    system_message = """You are a Cloud Engineering Assistant.

Role: Cloud Engineering Assistant
Description: A helpful assistant for cloud engineering tasks
Instructions: Be helpful and concise

Please assist the user with their cloud engineering questions."""

# ============================================
# JOB 1: LangChain Agent
# ============================================
@job("LangChain-Agent-GPT4o")
async def langchain_agent_job(data: DataPoint, row: int):
    """LangChain agent using GPT-4o (matching Orq setup)."""
    if not LANGCHAIN_AVAILABLE:
        return {
            "agent": "LangChain-GPT4o",
            "query": data.inputs["query"],
            "response": "LangChain not available",
            "context": data.inputs.get("context", ""),
            "error": True
        }

    try:
        messages = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": data.inputs["query"]}
        ]

        result = await asyncio.to_thread(llm.invoke, messages)
        response = result.content if hasattr(result, 'content') else str(result)

        print(f"✓ LangChain response: {response[:80]}...")

        return {
            "agent": "LangChain-GPT4o",
            "query": data.inputs["query"],
            "response": response,
            "context": data.inputs.get("context", ""),
            "error": False
        }
    except Exception as e:
        print(f"✗ LangChain error: {e}")
        return {
            "agent": "LangChain-GPT4o",
            "query": data.inputs["query"],
            "response": f"Error: {str(e)}",
            "context": data.inputs.get("context", ""),
            "error": True
        }

# ============================================
# JOB 2: Orq Native Agent (Your Existing Agent)
# ============================================
@job("VariantA")
async def orq_native_agent_job(data: DataPoint, row: int):
    """Orq native agent - VariantA."""
    try:
        with Orq(api_key=ORQ_API_KEY) as orq:
            response = orq.agents.responses.create(
                agent_key=ORQ_AGENT_KEY,
                background=False,
                message={
                    "role": "user",
                    "parts": [{"kind": "text", "text": data.inputs["query"]}]
                }
            )

            # Extract response text
            response_text = ""
            if hasattr(response, 'message'):
                if hasattr(response.message, 'content'):
                    response_text = response.message.content
                elif hasattr(response.message, 'parts'):
                    response_text = " ".join([
                        part.text if hasattr(part, 'text') else str(part)
                        for part in response.message.parts
                    ])
            elif hasattr(response, 'content'):
                response_text = response.content
            else:
                response_text = str(response)

            print(f"✓ Orq response: {response_text[:80]}...")

            return {
                "agent": "Orq-Native-GPT4o",
                "query": data.inputs["query"],
                "response": response_text,
                "context": data.inputs.get("context", ""),
                "error": False
            }
    except Exception as e:
        print(f"✗ Orq agent error: {e}")
        return {
            "agent": "Orq-Native-GPT4o",
            "query": data.inputs["query"],
            "response": f"Error: {str(e)}",
            "context": data.inputs.get("context", ""),
            "error": True
        }

# ============================================
# EVALUATOR 1: DeepEval Faithfulness
# ============================================
async def deepeval_faithfulness_evaluator(params):
    """Uses DeepEval's faithfulness metric (requires OPENAI_API_KEY)."""
    if not DEEPEVAL_AVAILABLE:
        return EvaluationResult(value=0.0, explanation="DeepEval not installed")

    if not os.getenv("OPENAI_API_KEY"):
        return EvaluationResult(value=0.0, explanation="OPENAI_API_KEY not set")

    output = params["output"]

    if output.get("error"):
        return EvaluationResult(value=0.0, explanation=f"{output['agent']}: Job error")

    query = output.get("query", "").strip()
    response = output.get("response", "").strip()
    context = output.get("context", "").strip()

    if not response or not context:
        return EvaluationResult(value=0.0, explanation="Missing response or context")

    try:
        # Create test case
        test_case = LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=[context],
        )

        # Initialize metric
        metric = FaithfulnessMetric(
            threshold=0.5,
            model="gpt-4o-mini",  # Use gpt-4o-mini to save costs
            include_reason=False,
        )

        # Measure (synchronous call in thread)
        def measure_sync():
            metric.measure(test_case)
            return float(metric.score) if metric.score is not None else 0.0

        score = await asyncio.to_thread(measure_sync)

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: Faithfulness {score:.2f}"
        )

    except Exception as e:
        print(f"✗ DeepEval error: {e}")
        return EvaluationResult(
            value=0.0,
            explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}"
        )

# ============================================
# EVALUATOR 2: Cloud Engineering Relevance
# ============================================
async def cloud_engineering_relevance_evaluator(params):
    """Checks if response is relevant to cloud engineering."""
    output = params["output"]
    response = output.get("response", "").lower()

    if output.get("error"):
        return EvaluationResult(value=0.0, explanation=f"{output['agent']}: Job error")

    # Cloud engineering keywords
    cloud_keywords = [
        "aws", "azure", "gcp", "google cloud", "cloud",
        "kubernetes", "k8s", "docker", "container",
        "serverless", "lambda", "ec2", "s3", "rds",
        "deployment", "infrastructure", "devops",
        "ci/cd", "cicd", "pipeline", "terraform",
        "ansible", "microservices", "api", "rest",
        "scalability", "availability", "region",
        "zone", "load balancer", "auto scaling",
        "vpc", "subnet", "security group", "iam"
    ]

    keyword_count = sum(1 for keyword in cloud_keywords if keyword in response)

    if keyword_count >= 4:
        score = 1.0
        verdict = "Highly relevant"
    elif keyword_count >= 2:
        score = 0.7
        verdict = "Relevant"
    elif keyword_count >= 1:
        score = 0.4
        verdict = "Somewhat relevant"
    else:
        score = 0.1
        verdict = "Not cloud-specific"

    return EvaluationResult(
        value=score,
        explanation=f"{output['agent']}: {verdict} ({keyword_count} keywords)"
    )

# ============================================
# RUN EVALUATION
# ============================================
async def main():
    print("=" * 70)
    print("Comparing LangChain vs Orq Native Agent")
    print("Both agents: GPT-4o | Cloud Engineering Assistant")
    print("=" * 70)
    print()

    print("Configuration:")
    print(f"  ORQ_API_KEY: {'✓' if ORQ_API_KEY else '✗'}")
    print(f"  OPENAI_API_KEY: {'✓' if os.getenv('OPENAI_API_KEY') else '✗ REQUIRED FOR DEEPEVAL'}")
    print(f"  LangChain: {'✓' if LANGCHAIN_AVAILABLE else '✗'}")
    print(f"  DeepEval: {'✓' if DEEPEVAL_AVAILABLE else '✗'}")
    print(f"  Orq Agent Key: {ORQ_AGENT_KEY}")
    print()

    if not os.getenv("OPENAI_API_KEY"):
        print("WARNING: DeepEval requires OPENAI_API_KEY")
        print("Set it with: os.environ['OPENAI_API_KEY'] = 'sk-your-key'")
        print()

    await evaluatorq(
        "langchain-vs-orq-comparison",
        data=[
            DataPoint(inputs={
                "query": "What are the best practices for microservices architecture?",
                "context": "Microservices architecture is a design pattern where applications are built as collections of loosely coupled services. Best practices include service independence, API-first design, and fault tolerance."
            }),
            DataPoint(inputs={
                "query": "How do I implement API rate limiting in a production system?",
                "context": "API rate limiting controls the number of requests a client can make to prevent abuse and ensure fair resource allocation. Common strategies include token bucket, leaky bucket, and fixed window algorithms."
            }),
            DataPoint(inputs={
                "query": "How does Kubernetes handle container orchestration?",
                "context": "Kubernetes orchestrates containers through a master-worker architecture. The control plane manages the cluster state, while worker nodes run containerized applications in pods. It handles scheduling, scaling, and self-healing automatically."
            }),
            DataPoint(inputs={
                "query": "What are best practices for CI/CD pipelines in cloud environments?",
                "context": "Best practices for cloud CI/CD include: automating testing at all stages, using infrastructure as code, implementing proper secrets management, maintaining separate environments (dev/staging/prod), and ensuring fast feedback loops."
            }),
            DataPoint(inputs={
                "query": "Explain the difference between AWS ECS and EKS",
                "context": "AWS ECS (Elastic Container Service) is Amazon's proprietary container orchestration platform, while EKS (Elastic Kubernetes Service) runs managed Kubernetes. ECS is simpler and AWS-specific, while EKS offers Kubernetes portability."
            }),
            DataPoint(inputs={
                "query": "How do I secure secrets in a cloud-native application?",
                "context": "Cloud-native secret management involves using services like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault. Best practices include encryption at rest and in transit, role-based access control, and regular rotation."
            }),
            DataPoint(inputs={
                "query": "What is the purpose of a service mesh like Istio?",
                "context": "A service mesh provides infrastructure layer for handling service-to-service communication. Istio manages traffic routing, load balancing, encryption, authentication, and observability without requiring application code changes."
            }),
            DataPoint(inputs={
                "query": "How does auto-scaling work in AWS?",
                "context": "AWS Auto Scaling monitors applications and automatically adjusts capacity based on CloudWatch metrics. It uses scaling policies to add or remove EC2 instances based on CPU utilization, request counts, or custom metrics."
            }),
            DataPoint(inputs={
                "query": "What are the benefits of using Infrastructure as Code?",
                "context": "Infrastructure as Code (IaC) allows version-controlled, repeatable infrastructure provisioning using tools like Terraform or CloudFormation. Benefits include consistency, auditability, disaster recovery, and reduced manual errors."
            }),
            DataPoint(inputs={
                "query": "How do I implement zero-downtime deployments?",
                "context": "Zero-downtime deployments use strategies like blue-green deployments, rolling updates, or canary releases. Load balancers gradually shift traffic to new versions while monitoring health checks and rollback capabilities."
            }),
        ],
        jobs=[langchain_agent_job, orq_native_agent_job],
        evaluators=[
            {"name": "deepeval-faithfulness", "scorer": deepeval_faithfulness_evaluator},
            {"name": "cloud-relevance", "scorer": cloud_engineering_relevance_evaluator},
        ],
    )

    print("\n" + "=" * 70)
    print("✓ Evaluation Complete!")
    print("Check Orq.ai workspace for results and LangChain traces")
    print("=" * 70)

if __name__ == "__main__":
    await main()
Expected Results:2
3

Preview the results in Agent Studio

You can see the results directly in the AI Studio by clicking on the generated link that shows up after you run the agent evaluators:Evaluatorq

Key Takeaways

You can kick off experiments from code every time you make a big update to your AI system, running them against your golden truth dataset to ensure changes improve rather than degrade performance. The real power of Evaluatorq lies in its ability to catch performance dips before they reach users, validate that new model versions maintain quality standards, and provide the confidence needed to iterate quickly on AI systems. Whether you’re optimizing prompt configurations, testing agent decision-making logic, or validating RAG system faithfulness, Evaluatorq gives you the evaluation infrastructure to build reliable, production-ready AI applications at scale.