Skip to main content

Overview

Run experiments directly from code using the evaluatorq framework. Compare Deployments and Agents side-by-side—including Orq agents against third-party agents—and inspect tool usage and parameters for every execution. View results in your Terminal or in Orq’s AI Studio.

Use-cases

Use the evaluatorq framework when:
  • Comparing Deployments: Test two versions of a Prompt/model configuration to ensure performance hasn’t degraded after changes.
  • A/B testing Agents: Compare agentic approaches on the same Dataset.
  • Development validation: Run experiments locally before pushing changes to production.
  • CI/CD integration: Automatically run experiments as part of your testing pipeline.
  • Quick iteration: Kick off experiments from code and get immediate feedback on model performance.

Prerequisites

Install the necessary libraries to get started:
pip install orq-ai-sdk
pip install evaluatorq
Configure your environment variables before running experiments:
export ORQ_API_KEY="your-api-key"
export ORQ_ENV="production"  # or "staging" (default: production)
export ORQ_EVALUATOR_ID="01ARZ3NDEKTSV4RRFFQ69G5FAV"  # optional: ULID of your Evaluator
ORQ_API_KEY is required to:
  • Invoke Deployments and Agents
  • Invoke Evaluators
  • Sync results to Orq.ai UI (see “Sharing Results” section below)
Without your API key, experiments run locally only and results won’t render in the dashboard.

Basic Workflow

Define Your Data

Create a list of test cases (DataPoints) with inputs that will be passed to your Deployments or Agents. Choose one of three approaches:
Parse a CSV or JSON file to create DataPoints. Useful for local development and testing.
import csv
import json
from evaluatorq import DataPoint

# Load from CSV
test_data = []
with open("test_data.csv", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        test_data.append(DataPoint(inputs=row))

# Load from JSON
with open("test_data.json", "r") as f:
    data = json.load(f)
    test_data = [DataPoint(inputs=item) for item in data]
Define DataPoints directly in code for quick local experiments and testing.
In this example, each DataPoint includes a “text” entry within its inputs. This entry is then used by jobs and evaluators for processing and assessment.
from evaluatorq import DataPoint

test_data = [
    DataPoint(inputs={
        "text": "Cinderella tells the story of a kind young woman forced into servitude by her cruel stepfamily, whose life changes through a touch of magic and her own quiet resilience."
    }),
    DataPoint(inputs={
        "text": "Little Red Riding Hood follows a girl traveling through the forest to visit her grandmother, unaware of the danger posed by a cunning wolf."
    }),
]

Define Your Jobs

Jobs define the work to be done on each test case. Create a job for each variant you want to test.
Jobs can invoke Deployments, Agents, or Prompts.Beyond Orq, you can integrate third-party frameworks such as LangGraph, CrewAI, LlamaIndex, or AutoGen. This allows you to compare Orq features against third-party systems side-by-side using the same evaluatorq framework.
In the example below, each job invokes an Orq Deployment using orq_client.deployments.invoke(). This calls your deployed prompt/model with specific inputs and parameters. For each DataPoint, the job:
  1. Invokes the Deployment with configuration options (e.g., reasoning parameter)
  2. Receives a response object from the model
  3. Parses and structures the response for evaluation
import asyncio
import os
from evaluatorq import job, DataPoint
from orq_ai_sdk import Orq

# Initialize Orq client once at module level
orq_client = Orq(
    api_key=os.getenv("ORQ_API_KEY"),
    server_url=os.getenv("ORQ_SERVER_URL", "https://my.orq.ai")
)

# Helper function to extract response text
def extract_response_text(response):
    """Extract text from Orq agent response."""
    if hasattr(response, "output") and response.output:
        if isinstance(response.output, list) and len(response.output) > 0:
            part = response.output[0]
            if hasattr(part, "parts") and part.parts:
                return part.parts[0].text if hasattr(part.parts[0], "text") else str(part.parts[0])
    if hasattr(response, "content"):
        if isinstance(response.content, list):
            return " ".join(
                part.text if hasattr(part, "text") else str(part)
                for part in response.content
            )
        return str(response.content)
    return str(response)


@job("summarize-variant-a")
async def summarize_variant_a(data: DataPoint, row: int):
    text = data.inputs["text"]

    response = await asyncio.to_thread(
            orq_client.deployments.invoke,
            key="summarization_v2",
            context={
            "environments": [],
            "reasoning": ["minimal"]},
            inputs={"text": text},
            metadata={"variant": "v1", "row": row},
            )

    return {
        "variant": "variant-a",
        "input": data.inputs["text"],
        "summary": extract_response_text(response),
        "reference": data.inputs.get("expected_output", ""),
    }


@job("summarize-variant-b")
async def summarize_variant_b(data: DataPoint, row: int):
    text = data.inputs["text"]

    response = await asyncio.to_thread(
            orq_client.deployments.invoke,
            key="summarization_v2",
            context={
            "environments": [],
            "reasoning": ["medium"]},
            inputs={"text": text},
            metadata={"variant": "v1", "row": row},
            )

    return {
        "variant": "variant-b",
        "input": data.inputs["text"],
        "summary": extract_response_text(response),
        "reference": data.inputs.get("expected_output", ""),
    }

Define Your Evaluators

Evaluators score the outputs from each job.
You can use Orq’s built-in Evaluators, create custom Evaluators, or integrate third-party frameworks. Popular options include Ragas and DeepEval.
Each evaluator is an async function that receives the job output and returns an EvaluationResult with a score (0.0-1.0) and human-readable explanation. Evaluators can run locally, call Orq’s LLM-as-judge API, or integrate external frameworks.
import asyncio
from evaluatorq import EvaluationResult

# Local evaluator - no external calls
async def word_count_scorer(params):
    """Check if summary has sufficient word count."""
    output = params["output"]
    word_count = len(output.get("summary", "").split())

    if word_count >= 10:
        return EvaluationResult(value=1.0, explanation=f"Sufficient ({word_count} words)")
    elif word_count >= 5:
        return EvaluationResult(value=0.5, explanation=f"Partial ({word_count} words)")
    else:
        return EvaluationResult(value=0.0, explanation=f"Too short ({word_count} words)")
import asyncio
import os
from evaluatorq import EvaluationResult

SUMMARIZATION_EVAL_ID = os.environ.get("ORQ_EVALUATOR_ID", "your-evaluator-id")

async def summarization_quality_scorer(params):
    """Use Orq Evaluator to assess summary quality."""
    data = params["data"]
    output = params["output"]

    source_text = (data.inputs.get("text") or "").strip()
    summary = (output.get("summary") or "").strip()

    if not summary:
        return EvaluationResult(value=0.0, explanation="No summary found in job output")

    if not source_text:
        return EvaluationResult(value=0.0, explanation="No source text provided in DataPoint.inputs['text']")

    try:
        evaluation = await asyncio.to_thread(
            orq_client.evals.invoke,
            id=SUMMARIZATION_EVAL_ID,
            query=source_text,
            output=summary,
            reference=None,
            messages=[],
            retrievals=[],
        )
        score = float(evaluation.value.value)
        explanation = str(evaluation.value.explanation or "")
        return EvaluationResult(value=score, explanation=explanation)

    except Exception as e:
        return EvaluationResult(
            value=0.0,
            explanation=f"Eval error: {str(e)[:60]}"
        )
import asyncio
from evaluatorq import EvaluationResult

# DeepEval Evaluator
async def deepeval_relevancy_scorer(params):
    """Use DeepEval to score relevancy of summary."""
    from deepeval.metrics import AnswerRelevancyMetric
    from deepeval.test_case import LLMTestCase

    data = params["data"]
    output = params["output"]

    source_text = (data.inputs.get("text") or "").strip()
    summary = (output.get("summary") or "").strip()

    if not summary or not source_text:
        return EvaluationResult(value=0.0, explanation="Missing source or summary")

    try:
        metric = AnswerRelevancyMetric(threshold=0.5)
        test_case = LLMTestCase(input=source_text, actual_output=summary)
        result = await asyncio.to_thread(metric.measure, test_case)
        return EvaluationResult(value=float(result.score), explanation=f"DeepEval relevancy: {result.score:.2f}")
    except Exception as e:
        return EvaluationResult(value=0.0, explanation=f"DeepEval evaluation failed: {e}")

Run the Experiment

Execute your experiment by passing your data, jobs, and evaluators to the evaluatorq function:
import asyncio
from evaluatorq import evaluatorq, DatasetIdInput

async def main():
    # Option A: Use Dataset ID (Approach 1 - Recommended)
    await evaluatorq(
        "compare-summarization-variants",
        data=DatasetIdInput(dataset_id="01ARZ3NDEKTSV4RRFFQ69G5FAV"),  # From Step 1 Approach 1
        jobs=[summarize_variant_a, summarize_variant_b],  # From Step 2
        evaluators=[
            {"name": "word-count", "scorer": word_count_scorer},  # From Step 3
            {"name": "quality", "scorer": summarization_quality_scorer},  # From Step 3
        ],
    )

    # Option B: Use inline data (Approach 3)
    # await evaluatorq(
    #     "compare-summarization-variants",
    #     data=test_data,  # From Step 1 Approach 3
    #     jobs=[summarize_variant_a, summarize_variant_b],
    #     evaluators=[...],
    # )

if __name__ == "__main__":
    asyncio.run(main())

Understanding Results

When you run an experiment via the API, the evaluatorq framework will:
  1. Execute each job on every DataPoint in your dataset
  2. Run all Evaluators against each job’s output
  3. Display results in your terminal after all DataPoints complete:
    • A summary table showing all DataPoints processed
    • Output text generated by each job variant
    • Evaluator name, score (0.0-1.0), and explanation for each Evaluator
    • Pass/fail status for each variant
  4. Sync to Orq.ai if you provided your ORQ_API_KEY:
    • Results are automatically uploaded for storage and sharing
    • Framework prints an experiment URL at the end (format: https://my.orq.ai/experiments/01ARZ3NDEKTSV4RRFFQ69G5FAV)
    • Open this URL to access and share results via the Orq.ai UI

Terminal Output

After your experiment completes, your terminal will display a summary table with results:
The table shows:
  • Summary metrics: Total DataPoints processed, success rate, job execution statistics
  • Detailed Results: Evaluator scores for each job variant (0.0-1.0 scale)
  • Evaluation progress: Success indicators as processing completes

Orq.ai AI Studio

Critical: Results only render in the Orq.ai UI if you provide your ORQ_API_KEY. Without it, experiments run locally and results appear only in your terminal/IDE—no UI access or shareable URLs.
Open the URL to access the Experiments UI in the AI Studio. The UI displays all DataPoints, job outputs, Evaluator scores, and side-by-side comparisons. See Experiments Overview for more on navigating results.

Advanced Use Cases

evaluatorq Tutorial

Detailed walkthroughs and code examples for advanced evaluatorq patterns:
  • Comparing Deployments and Agents
  • Third-party framework integration (LangGraph, CrewAI, LlamaIndex, AutoGen)
  • Multi-job workflows and custom data sources
  • CI/CD integration strategies