> ## Documentation Index
> Fetch the complete documentation index at: https://docs.orq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Running evaluations in parallel with Evaluatorq

> Run AI experiments from code using Evaluatorq. Compare deployments and agents side-by-side with custom evaluators across any framework.

<Info>
  TL;DR

  * **Run experiments from code** to compare any AI system against your evaluation criteria, whether it's Orq-native or built with LangGraph, CrewAI, or your own custom framework
  * **Results rendered in Orq's UI** so when experiments complete, prompt engineers can drill into failure points, identify why a version underperforms, and iterate on tool descriptions, agent instructions, or prompts directly in the platform
  * **Choose your evaluators** using Orq's native evaluation suite or plug in third-party tools like RAGAS and DeepEval
</Info>

## What is Evaluatorq?

**Evaluatorq** is an evaluation framework for running experiments programmatically, available in both [Python](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq-py) and [TypeScript](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq). This cookbook focuses on Python.
It features the following capabilities:

* **Define jobs**: These are functions that run your model over inputs and produce outputs.
* **Parallel evaluations:** enabling running multiple jobs (model configurations, deployments, or agents) simultaneously against the same test dataset, then comparing their results side-by-side and decide which configurations will perform best in production.
* **Flexible Data Sources**: Apply jobs and evaluators over datasets. These could be inline arrays, async sources, or even datasets managed in the [Orq.ai](http://orq.ai/) platform.
* **Type-safe**: Built with Python type hints for better IDE support
* **Access to experiments from code**: Test Orq deployments, Orq agents, or any third-party framework, execute them over datasets, and evaluate results without leaving your IDE. For examples and common patterns, check out the [**Evaluatorq repository**](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq)

## What will we build?

We will build two separate [Orq.ai](http://Orq.ai)-native Agents using different models that act as cloud engineering consultants, evaluate their performance and challenge them against [LangGraph Agent ](https://docs.orq.ai/docs/proxy/frameworks/langchain#langchain-langgraph)for the following task:

```text theme={"theme":{"light":"github-light","dark":"github-dark"}}
"I'm preparing a technical presentation on microservices architecture. 

Can you help me create an outline covering the key benefits, challenges, 
and best practices in cloud computing?"
```

We will test the Agent configurations by running multiple evaluations in parallel using **Evaluatorq**. You will learn how to access readily available [Orq.ai](http://Orq.ai) evaluators and external frameworks like DeepEval. The evaluation stack that we will build consists of:  [LLM-as-a-judge](/docs/evaluators/build#llm-evaluator), [DeepEval Faithfulness](https://deepeval.com/docs/metrics-faithfulness), [DeepEval Answer Relevancy](https://deepeval.com/docs/metrics-answer-relevancy) and an example of a custom Python evaluator.

You can follow-along the build in [Google Colab workbook](https://colab.research.google.com/drive/1Jv1J_tQAFYrRjUXXrD37MkyH588CI7mm?usp=sharing).

## Prerequisites

<Steps>
  <Step title="Getting started">
    Install the required packages

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    # Install Evaluatorq
    !pip install evaluatorq

    # Install Orq SDK
    !pip install orq-ai-sdk

    # Optional: Third-party evaluators
    !pip install ragas deepeval
    ```
  </Step>

  <Step title="Set up the Agents">
    Before we run any evaluations, we need to set up two Agents for comparison to do so:

    1. [Create a new Project ](https://docs.orq.ai/docs/projects/overview)in AI Studio

           <img src="https://mintcdn.com/orqai/0jcpCWU0uC9byhYW/images/Screenshot2026-01-07at12.15.57.png?fit=max&auto=format&n=0jcpCWU0uC9byhYW&q=85&s=469c9c0d3a5d2eb925cbe2ce887877c9" alt="Screenshot2026 01 07at12 15 57" title="Screenshot2026 01 07at12 15 57" style={{ width:"64%" }} width="916" height="568" data-path="images/Screenshot2026-01-07at12.15.57.png" />
    2. [Add Agents](https://docs.orq.ai/docs/agents/getting-started) to the Project to evaluate

       Next, in Python we create two Agent variants to evaluate:

       `VariantA` with gpt-5-mini

       `VariantB` with claude-sonnet-4.5

           <Info>
             Key Agent variables:

             `         key` : Unique name of the Agent.

             `        path` : Path to the Project

             ` description` : Detailed instructions how an Agent should behave

             `       model` : Foundational model which we will evaluate
           </Info>

       **Agent Variant A (gpt-5-mini)**

       ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
       from orq_ai_sdk import Orq
       import os

       with Orq(api_key=os.getenv("ORQ_API_KEY", "")) as orq:
           agent = orq.agents.create(
               key="VariantA",
               role="Cloud Engineering Assistant",
               description="A helpful assistant for cloud engineering tasks",
               instructions="Be helpful and concise",
               path="Evaluatorq",
               model={"id": "openai/gpt-5-mini"},
               settings={
                   "max_iterations": 3,
                   "max_execution_time": 300,
                   "tools": [
                       {
                           "type": "current_date"
                       }
                   ]
               }
           )

           print(f"Agent created: {agent.key}")
       ```

       **Agent Variant B (claude-sonnet-4.5)**

       ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
       from orq_ai_sdk import Orq
       import os

       with Orq(api_key=os.getenv("ORQ_API_KEY", "")) as orq:
           agent = orq.agents.create(
               key="VariantB",
               role="Cloud Engineering Assistant",
               description="A helpful assistant for cloud engineering tasks with cost-efficient model",
               instructions="Be helpful and concise. Provide clear, practical answers.",
               path="Evaluatorq",
               model={"id": "anthropic/claude-sonnet-4-5-20250929"},
               settings={
                   "max_iterations": 3,
                   "max_execution_time": 300,
                   "tools": [
                       {
                           "type": "current_date"
                       }
                   ]
               }
           )

           print(f"✓ Agent created: {agent.key}")
       ```
  </Step>

  <Step title="Assessing Agent performance with parallel evaluators">
    Once we have the Agent variants set up, we're ready to run parallel evaluations using **Evaluatorq**. In the Evaluatorq evaluation framework, you'll notice the following syntax:

    * `@job` decorator is a wrapper that identifies and names the function as a job
    * `async def your_evaluator` evaluators are defined as functions

    <Info>
      **Before running evaluations**: You must first create an [Orq LLM-as-a-judge Evaluator](https://docs.orq.ai/docs/evaluators/build#llm-evaluator) via the UI or [API](https://docs.orq.ai/docs/evaluators/build#llm-evaluator). Once created, retrieve the Evaluator ID from the URL (e.g., `https://my.orq.ai/project/evaluators/01KECJTD1GWGF90DMGSP1D8XZN`) or via the [Get All Evaluators API](/reference/evaluators/get-all-evaluators).

      Orq also supports [custom Python evaluators](https://docs.orq.ai/docs/evaluators/build#python-evaluator), [JSON-based evaluators](https://docs.orq.ai/docs/evaluators/build#json-evaluator), and [HTTP evaluators](https://docs.orq.ai/docs/evaluators/build#http-evaluator), all invoked via their unique Evaluator ID.
    </Info>

    In the example below we will run four evaluators in parallel:

    * **Evaluator 1**: [Orq LLM-as-a-judge](https://docs.orq.ai/docs/evaluators/build#llm-evaluator) (checks response quality and coherence)
    * **Evaluator 2**: [DeepEval Faithfulness](https://deepeval.com/docs/metrics-faithfulness)
    * **Evaluator 3**: [DeepEval Answer Relevancy](https://deepeval.com/docs/metrics-answer-relevancy)
    * **Evaluator 4**: Response Length (example of a custom Python script)

    ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
    import asyncio
    from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
    from orq_ai_sdk import Orq
    import os

    # ============================================
    # CONFIGURATION
    # ============================================
    ORQ_API_KEY = os.getenv("ORQ_API_KEY", "")
    if not ORQ_API_KEY:
        raise ValueError("ORQ_API_KEY environment variable must be set")

    # ============================================
    # CRITICAL: Set OpenAI API Key for DeepEval
    # ============================================
    # DeepEval uses OpenAI's API internally for evaluation
    # You MUST set this before importing DeepEval
    if not os.getenv("OPENAI_API_KEY"):
        print("CRITICAL: OPENAI_API_KEY not set!")
        print("Add this cell BEFORE running evaluation:")
        print("  import os")
        print('  os.environ["OPENAI_API_KEY"] = "sk-your-openai-key"')
        print()

    # DeepEval library imports
    try:
        from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
        from deepeval.test_case import LLMTestCase
        DEEPEVAL_AVAILABLE = True
        print("✓ DeepEval loaded")
    except ImportError:
        DEEPEVAL_AVAILABLE = False
        print("DeepEval not installed. Run: pip install deepeval")

    # Initialize Orq client
    orq_client = Orq(api_key=ORQ_API_KEY)

    # Replace with your Orq LLM-as-a-judge Evaluator ID
    # Get it from: https://my.orq.ai/project/evaluators/<YOUR_ID>
    # Or via API: https://docs.orq.ai/reference/evaluators/get-all-evaluators
    LLM_JUDGE_EVAL_ID = "$YOUR_LLM_AS_A_JUDGE_ID"

    # ============================================
    # HELPER: Extract Response Text
    # ============================================
    def extract_response_text(response):
        """Helper function to extract text from Orq agent response."""
        if hasattr(response, 'content'):
            if isinstance(response.content, list):
                return " ".join([
                    part.text if hasattr(part, 'text') else str(part)
                    for part in response.content
                ])
            return str(response.content)
        return str(response)

    # ============================================
    # JOB 1: VariantA Agent (gpt-5-mini)
    # ============================================
    @job("VariantA")
    async def variant_a_agent(data: DataPoint, row: int):
        """VariantA agent using gpt-5-mini."""
        with Orq(api_key=ORQ_API_KEY) as orq:
            response = orq.agents.responses.create(
                agent_key="VariantA",
                background=False,
                message={
                    "role": "user",
                    "parts": [{"kind": "text", "text": data.inputs["query"]}]
                }
            )

            return {
                "agent": "VariantA",
                "query": data.inputs["query"],
                "response": extract_response_text(response),
                "context": data.inputs.get("context", "")
            }

    # ============================================
    # JOB 2: VariantB Agent (claude-sonnet-4.5)
    # ============================================
    @job("VariantB")
    async def variant_b_agent(data: DataPoint, row: int):
        """VariantB agent using claude-sonnet-4.5."""
        with Orq(api_key=ORQ_API_KEY) as orq:
            response = orq.agents.responses.create(
                agent_key="VariantB",
                background=False,
                message={
                    "role": "user",
                    "parts": [{"kind": "text", "text": data.inputs["query"]}]
                }
            )

            return {
                "agent": "VariantB",
                "query": data.inputs["query"],
                "response": extract_response_text(response),
                "context": data.inputs.get("context", "")
            }

    # ============================================
    # EVALUATOR 1: Orq Response Quality (LLM-as-a-judge)
    # ============================================
    async def orq_response_quality_evaluator(params):
        """Uses Orq's LLM-as-a-judge to assess response quality and coherence."""
        data: DataPoint = params["data"]
        output = params["output"]

        query = data.inputs.get("query", "").strip()
        response = output.get("response", "").strip()

        if not response or not query:
            return EvaluationResult(value=0.0, explanation="Missing data")

        try:
            evaluation = await asyncio.to_thread(
                orq_client.evals.invoke,
                id=LLM_JUDGE_EVAL_ID,
                query=query,
                output=response,
            )

            raw_score = float(evaluation.value.value)
            score = raw_score / 10.0 if raw_score > 1.0 else raw_score
            explanation = str(evaluation.value.explanation or "")[:80]

            return EvaluationResult(
                value=score,
                explanation=f"{output['agent']}: {explanation}"
            )
        except Exception as e:
            return EvaluationResult(value=0.0, explanation=f"Orq error: {str(e)[:50]}")

    # ============================================
    # EVALUATOR 2: DeepEval Faithfulness
    # ============================================
    async def deepeval_faithfulness_evaluator(params):
        """Uses DeepEval's faithfulness metric (requires OPENAI_API_KEY)."""
        if not DEEPEVAL_AVAILABLE:
            return EvaluationResult(value=0.0, explanation="DeepEval not installed")

        if not os.getenv("OPENAI_API_KEY"):
            return EvaluationResult(value=0.0, explanation="OPENAI_API_KEY not set")

        output = params["output"]
        query = output.get("query", "").strip()
        response = output.get("response", "").strip()
        context = output.get("context", "").strip()

        if not response or not context:
            return EvaluationResult(value=0.0, explanation="Missing response or context")

        try:
            # Create test case
            test_case = LLMTestCase(
                input=query,
                actual_output=response,
                retrieval_context=[context],
            )

            # Initialize metric
            metric = FaithfulnessMetric(
                threshold=0.5,
                model="gpt-4o-mini",  # Use gpt-4o-mini to save costs
                include_reason=False,
            )

            # Measure (synchronous call in thread)
            def measure_sync():
                metric.measure(test_case)
                return float(metric.score) if metric.score is not None else 0.0

            score = await asyncio.to_thread(measure_sync)

            return EvaluationResult(
                value=score,
                explanation=f"{output['agent']}: Faithfulness {score:.2f}"
            )

        except Exception as e:
            return EvaluationResult(
                value=0.0,
                explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}"
            )

    # ============================================
    # EVALUATOR 3: DeepEval Answer Relevancy
    # ============================================
    async def deepeval_answer_relevancy_evaluator(params):
        """Uses DeepEval's answer relevancy metric (requires OPENAI_API_KEY)."""
        if not DEEPEVAL_AVAILABLE:
            return EvaluationResult(value=0.0, explanation="DeepEval not installed")

        if not os.getenv("OPENAI_API_KEY"):
            return EvaluationResult(value=0.0, explanation="OPENAI_API_KEY not set")

        output = params["output"]
        query = output.get("query", "").strip()
        response = output.get("response", "").strip()

        if not response or not query:
            return EvaluationResult(value=0.0, explanation="Missing query or response")

        try:
            # Create test case
            test_case = LLMTestCase(
                input=query,
                actual_output=response,
            )

            # Initialize metric
            metric = AnswerRelevancyMetric(
                threshold=0.5,
                model="gpt-4o-mini",  # Use gpt-4o-mini to save costs
                include_reason=False,
            )

            # Measure (synchronous call in thread)
            def measure_sync():
                metric.measure(test_case)
                return float(metric.score) if metric.score is not None else 0.0

            score = await asyncio.to_thread(measure_sync)

            return EvaluationResult(
                value=score,
                explanation=f"{output['agent']}: Relevancy {score:.2f}"
            )

        except Exception as e:
            return EvaluationResult(
                value=0.0,
                explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}"
            )

    # ============================================
    # EVALUATOR 4: Response Length
    # ============================================
    async def response_length_evaluator(params):
        """Checks if response length is appropriate."""
        output = params["output"]
        word_count = len(output["response"].split())

        if 50 <= word_count <= 300:
            score, verdict = 1.0, "Good"
        elif word_count < 50:
            score, verdict = word_count / 50, "Too short"
        else:
            score, verdict = 0.5, "Too long"

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: {word_count}w - {verdict}"
        )

    # ============================================
    # RUN EVALUATION
    # ============================================
    async def main():
        print("=" * 70)
        print("Comparing Agents: VariantA (gpt-5-mini) vs VariantB (claude-sonnet-4.5)")
        print("=" * 70)
        print()

        # Check configuration
        print("Configuration Check:")
        print(f"  ORQ_API_KEY: {'✓' if ORQ_API_KEY else '✗'}")
        print(f"  OPENAI_API_KEY: {'✓' if os.getenv('OPENAI_API_KEY') else '✗ REQUIRED FOR DEEPEVAL'}")
        print(f"  DeepEval: {'✓' if DEEPEVAL_AVAILABLE else '✗'}")
        print()

        if not os.getenv("OPENAI_API_KEY"):
            print("WARNING: DeepEval evaluators will return 0.00 without OPENAI_API_KEY")
            print("Add this in a cell before running:")
            print('os.environ["OPENAI_API_KEY"] = "sk-your-key"')
            print()

        await evaluatorq(
            "variant-comparison",
            data=[
                DataPoint(inputs={
                    "query": "What are the best practices for microservices architecture?",
                    "context": "Microservices architecture is a design pattern where applications are built as collections of loosely coupled services. Best practices include service independence, API-first design, and fault tolerance."
                }),
                DataPoint(inputs={
                    "query": "How do I implement API rate limiting in a production system?",
                    "context": "API rate limiting controls the number of requests a client can make to prevent abuse and ensure fair resource allocation. Common strategies include token bucket, leaky bucket, and fixed window algorithms."
                }),
                DataPoint(inputs={
                    "query": "How does Kubernetes handle container orchestration?",
                    "context": "Kubernetes orchestrates containers through a master-worker architecture. The control plane manages the cluster state, while worker nodes run containerized applications in pods. It handles scheduling, scaling, and self-healing automatically."
                }),
                DataPoint(inputs={
                    "query": "What are best practices for CI/CD pipelines in cloud environments?",
                    "context": "Best practices for cloud CI/CD include: automating testing at all stages, using infrastructure as code, implementing proper secrets management, maintaining separate environments (dev/staging/prod), and ensuring fast feedback loops."
                }),
                DataPoint(inputs={
                    "query": "Explain the difference between AWS ECS and EKS",
                    "context": "AWS ECS (Elastic Container Service) is Amazon's proprietary container orchestration platform, while EKS (Elastic Kubernetes Service) runs managed Kubernetes. ECS is simpler and AWS-specific, while EKS offers Kubernetes portability."
                }),
                DataPoint(inputs={
                    "query": "How do I secure secrets in a cloud-native application?",
                    "context": "Cloud-native secret management involves using services like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault. Best practices include encryption at rest and in transit, role-based access control, and regular rotation."
                }),
                DataPoint(inputs={
                    "query": "What is the purpose of a service mesh like Istio?",
                    "context": "A service mesh provides infrastructure layer for handling service-to-service communication. Istio manages traffic routing, load balancing, encryption, authentication, and observability without requiring application code changes."
                }),
                DataPoint(inputs={
                    "query": "How does auto-scaling work in AWS?",
                    "context": "AWS Auto Scaling monitors applications and automatically adjusts capacity based on CloudWatch metrics. It uses scaling policies to add or remove EC2 instances based on CPU utilization, request counts, or custom metrics."
                }),
                DataPoint(inputs={
                    "query": "What are the benefits of using Infrastructure as Code?",
                    "context": "Infrastructure as Code (IaC) allows version-controlled, repeatable infrastructure provisioning using tools like Terraform or CloudFormation. Benefits include consistency, auditability, disaster recovery, and reduced manual errors."
                }),
                DataPoint(inputs={
                    "query": "How do I implement zero-downtime deployments?",
                    "context": "Zero-downtime deployments use strategies like blue-green deployments, rolling updates, or canary releases. Load balancers gradually shift traffic to new versions while monitoring health checks and rollback capabilities."
                }),
            ],
            jobs=[variant_a_agent, variant_b_agent],
            evaluators=[
                {"name": "orq-response-quality", "scorer": orq_response_quality_evaluator},
                {"name": "deepeval-faithfulness", "scorer": deepeval_faithfulness_evaluator},
                {"name": "deepeval-relevancy", "scorer": deepeval_answer_relevancy_evaluator},
                {"name": "length", "scorer": response_length_evaluator},
            ],
        )

        print("\n" + "=" * 70)
        print("✓ Evaluation Complete!")
        print("=" * 70)

    await main()
    ```

    <Info>
      **Alternative Data Sources**: Instead of defining DataPoints inline, you can load data from a CSV file or use [Orq-managed Datasets](/docs/datasets/overview). This is especially useful for running experiments over large evaluation sets.
    </Info>

    **Expected output**

    <img src="https://mintcdn.com/orqai/qS-8vEwIQ3cNHJtw/evaluators.png?fit=max&auto=format&n=qS-8vEwIQ3cNHJtw&q=85&s=69869a2018a4561068f99d7809e9d721" alt="Evaluators" title="Evaluators" style={{ width:"79%" }} width="1314" height="1284" data-path="evaluators.png" />

    **Interpreting the Results**: The table shows evaluation scores (0.0-1.0) for each agent variant across all four evaluators. Higher scores indicate better performance. A score of 0.75+ suggests the agent meets quality standards, while scores below 0.50 may indicate the agent needs refinement. Compare scores across variants to identify which model configuration performs best for your specific use case.
  </Step>

  <Step title="Third-party evaluators ">
    <Accordion title="RAGAS" icon="plug">
      <Check>
        [**<u>RAGAS (Retrieval Augmented Generation Assessment)</u>**](https://docs.ragas.io/en/stable/) is a research-backed evaluation framework specifically designed for RAG systems. It provides both reference-free and reference-based metrics that assess retrieval quality and generation quality using LLM-as-a-judge.

        **Reference-Free Metrics (No Ground Truth Needed):**

        * **Faithfulness**: Checks if the response is grounded in the retrieved context
        * **Answer Relevancy**: Checks if the response addresses the query

        **Reference-Based Metrics (Require Ground Truth):**

        * **Context Precision**: Measures if retrieved contexts are relevant to the ground truth
        * **Context Recall**: Measures if all contexts were retrieved compared to ground truth
      </Check>

      <Info>
        **Before running this example**: You must first [create an Orq Deployment](/docs/deployments/creating) with a [Knowledge Base enabled](/docs/deployments/creating#knowledge-base). Once created, replace `"rag-knowledge-assistant"` with your deployment key.
      </Info>

      ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import asyncio
      from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
      from orq_ai_sdk import Orq
      import os

      # RAGAS library imports
      try:
          from ragas import evaluate
          from ragas.metrics import faithfulness, answer_relevancy
          from datasets import Dataset
          RAGAS_AVAILABLE = True
      except ImportError:
          RAGAS_AVAILABLE = False
          print("RAGAS not installed. Install with: pip install ragas datasets")

      ORQ_API_KEY = os.getenv("ORQ_API_KEY", "your-api-key-here")

      # ============================================
      # JOB: RAG-Powered Q&A System
      # ============================================
      @job("rag-qa-system")
      async def rag_qa_system(data: DataPoint, row: int):
          """
          RAG system that answers questions using knowledge base.
          This is what we're evaluating - an Orq deployment with RAG.
          """
          with Orq(api_key=ORQ_API_KEY) as orq:
              response = orq.deployments.invoke(
                  key="rag-knowledge-assistant",  # Your RAG-enabled deployment
                  context={
                      "knowledge_base_id": "your-kb-id"  # Optional: specific KB
                  },
                  inputs={"question": data.inputs["question"]},
                  messages=[{
                      "role": "user",
                      "content": data.inputs["question"]
                  }]
              )
              
              answer = response.choices[0].message.content
              
              # Extract contexts from RAG response (if available in metadata)
              # Adjust based on your actual Orq response structure
              contexts = getattr(response, 'contexts', data.inputs.get("contexts", []))
              if not contexts:
                  contexts = ["Retrieved context from knowledge base"]
              
              return {
                  "query": data.inputs["question"],
                  "response": answer,
                  "contexts": contexts,
                  "ground_truth": data.inputs.get("ground_truth", "")
              }

      # ============================================
      # EVALUATOR 1: RAGAS Faithfulness
      # ============================================
      async def ragas_faithfulness_scorer(params):
          """Evaluate faithfulness using RAGAS metric - checks if response is grounded in context."""
          if not RAGAS_AVAILABLE:
              return EvaluationResult(
                  value=0,
                  explanation="RAGAS library not available. Install with: pip install ragas datasets",
              )
          
          output = params["output"]
          
          try:
              # Prepare dataset for RAGAS evaluation
              dataset = Dataset.from_dict({
                  "question": [output["query"]],
                  "answer": [output["response"]],
                  "contexts": [output["contexts"]],
              })
              
              # Evaluate using RAGAS faithfulness metric
              result = evaluate(dataset, metrics=[faithfulness])
              score = result["faithfulness"]
              
              return EvaluationResult(
                  value=score,
                  explanation=(
                      f"Faithfulness score: {score:.2f} - Response is grounded in provided context"
                      if score >= 0.7
                      else f"Faithfulness score: {score:.2f} - Response contains unsupported claims"
                  ),
              )
          except Exception as e:
              return EvaluationResult(
                  value=0,
                  explanation=f"Error evaluating faithfulness: {str(e)}",
              )

      # ============================================
      # EVALUATOR 2: RAGAS Answer Relevancy
      # ============================================
      async def ragas_answer_relevancy_scorer(params):
          """Evaluate answer relevancy using RAGAS metric - checks if response addresses the query."""
          if not RAGAS_AVAILABLE:
              return EvaluationResult(
                  value=0,
                  explanation="RAGAS library not available. Install with: pip install ragas datasets",
              )
          
          output = params["output"]
          
          try:
              # Prepare dataset for RAGAS evaluation
              dataset = Dataset.from_dict({
                  "question": [output["query"]],
                  "answer": [output["response"]],
                  "contexts": [output["contexts"]],
              })
              
              # Evaluate using RAGAS answer relevancy metric
              result = evaluate(dataset, metrics=[answer_relevancy])
              score = result["answer_relevancy"]
              
              return EvaluationResult(
                  value=score,
                  explanation=(
                      f"Answer relevancy score: {score:.2f} - Response directly addresses the query"
                      if score >= 0.7
                      else f"Answer relevancy score: {score:.2f} - Response is off-topic or incomplete"
                  ),
              )
          except Exception as e:
              return EvaluationResult(
                  value=0,
                  explanation=f"Error evaluating answer relevancy: {str(e)}",
              )

      # ============================================
      # RUN EVALUATION
      # ============================================
      async def main():
          await evaluatorq(
              "rag-system-evaluation",
              data=[
                  DataPoint(inputs={
                      "question": "What is machine learning?",
                      "contexts": ["Machine learning is a branch of AI focused on building systems that learn from data."],
                      "ground_truth": "Machine learning is a type of AI that allows systems to learn from data."
                  }),
                  DataPoint(inputs={
                      "question": "How does photosynthesis work?",
                      "contexts": ["Plants use chlorophyll to capture light energy and convert CO2 and water into glucose."],
                      "ground_truth": "Photosynthesis converts light energy into chemical energy in plants."
                  }),
                  DataPoint(inputs={
                      "question": "What are the benefits of cloud computing?",
                      "contexts": ["Cloud computing provides scalability, cost efficiency, and flexibility for businesses."],
                      "ground_truth": "Cloud computing offers scalability and cost savings."
                  }),
              ],
              jobs=[rag_qa_system],
              evaluators=[
                  {"name": "ragas-faithfulness", "scorer": ragas_faithfulness_scorer},
                  {"name": "ragas-answer-relevancy", "scorer": ragas_answer_relevancy_scorer},
              ],
          )

      await main()
      ```
    </Accordion>

    <Accordion title="DeepEval" icon="plug">
      <Check>
        [**DeepEval**](https://deepeval.com/) is a comprehensive open-source LLM evaluation framework that treats AI testing like software unit testing. Built with pytest integration, it provides 15+ evaluation metrics covering RAG systems, chatbots, AI agents, and general LLM outputs.
      </Check>

      Dependencies

      ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
      pip install deepeval

      export OPENAI_API_KEY="your-api-key"
      ```

      DeepEval implementation 

      ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import asyncio
      import os
      from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
      from orq_ai_sdk import Orq

      # DeepEval library imports
      try:
          from deepeval.metrics import (
              AnswerRelevancyMetric,
              FaithfulnessMetric,
              HallucinationMetric,
          )
          from deepeval.test_case import LLMTestCase
          DEEPEVAL_AVAILABLE = True
      except ImportError:
          DEEPEVAL_AVAILABLE = False
          print("DeepEval not installed. Install with: pip install deepeval")

      # ============================================
      # CONFIGURATION
      # ============================================
      ORQ_API_KEY = os.getenv("ORQ_API_KEY", "")

      # Helper function to extract response text
      def extract_response_text(response):
          """Helper function to extract text from Orq agent response."""
          if hasattr(response, 'content'):
              if isinstance(response.content, list):
                  return " ".join([
                      part.text if hasattr(part, 'text') else str(part)
                      for part in response.content
                  ])
              return str(response.content)
          return str(response)

      @job("VariantA")
      async def variant_a_agent(data: DataPoint, row: int):
          """VariantA agent using gpt-5-mini."""
          with Orq(api_key=ORQ_API_KEY) as orq:
              response = orq.agents.responses.create(
                  agent_key="VariantA",
                  background=False,
                  message={
                      "role": "user",
                      "parts": [{"kind": "text", "text": data.inputs["query"]}]
                  }
              )

              return {
                  "agent": "VariantA",
                  "query": data.inputs["query"],
                  "response": extract_response_text(response),
                  "context": data.inputs.get("context", "")
              }

      async def deepeval_faithfulness_scorer(params):
          """Evaluate faithfulness using DeepEval metric."""
          if not DEEPEVAL_AVAILABLE:
              return EvaluationResult(
                  value=0,
                  explanation="DeepEval library not available. Install with: pip install deepeval",
              )

          output = params["output"]
          query = output.get("query", "").strip()
          response = output.get("response", "").strip()
          context = output.get("context", "").strip()

          if not response or not context:
              return EvaluationResult(value=0.0, explanation="Missing response or context")

          try:
              # Create test case for DeepEval evaluation
              test_case = LLMTestCase(
                  input=query,
                  actual_output=response,
                  retrieval_context=[context],
              )

              # Initialize DeepEval Faithfulness metric
              faithfulness_metric = FaithfulnessMetric(
                  threshold=0.7,
                  model="gpt-4o-mini",
                  include_reason=False,
              )

              # Measure faithfulness (synchronous call in thread)
              def measure_sync():
                  faithfulness_metric.measure(test_case)
                  return float(faithfulness_metric.score) if faithfulness_metric.score is not None else 0.0

              score = await asyncio.to_thread(measure_sync)

              return EvaluationResult(
                  value=score,
                  explanation=f"{output['agent']}: Faithfulness {score:.2f}"
              )
          except Exception as e:
              return EvaluationResult(
                  value=0,
                  explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}",
              )

      async def deepeval_hallucination_scorer(params):
          """Evaluate hallucination using DeepEval metric."""
          if not DEEPEVAL_AVAILABLE:
              return EvaluationResult(
                  value=0,
                  explanation="DeepEval library not available. Install with: pip install deepeval",
              )

          output = params["output"]
          query = output.get("query", "").strip()
          response = output.get("response", "").strip()
          context = output.get("context", "").strip()

          if not response or not context:
              return EvaluationResult(value=0.0, explanation="Missing response or context")

          try:
              # Create test case for DeepEval evaluation
              test_case = LLMTestCase(
                  input=query,
                  actual_output=response,
                  context=[context],
              )

              # Initialize DeepEval Hallucination metric
              hallucination_metric = HallucinationMetric(
                  threshold=0.5,
                  model="gpt-4o-mini",
                  include_reason=False,
              )

              # Measure hallucination (synchronous call in thread)
              def measure_sync():
                  hallucination_metric.measure(test_case)
                  return float(hallucination_metric.score) if hallucination_metric.score is not None else 0.0

              score = await asyncio.to_thread(measure_sync)

              # Invert score so higher is better (1 - hallucination_score)
              inverted_score = 1 - score

              return EvaluationResult(
                  value=inverted_score,
                  explanation=f"{output['agent']}: Hallucination {score:.2f} (inverted: {inverted_score:.2f})"
              )
          except Exception as e:
              return EvaluationResult(
                  value=0,
                  explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}",
              )

      async def main():
          await evaluatorq(
              "variant-a-deepeval",
              data=[
                  DataPoint(inputs={
                      "query": "What are the best practices for microservices architecture?",
                      "context": "Microservices architecture is a design pattern where applications are built as collections of loosely coupled services. Best practices include service independence, API-first design, and fault tolerance."
                  }),
                  DataPoint(inputs={
                      "query": "How do I implement API rate limiting in a production system?",
                      "context": "API rate limiting controls the number of requests a client can make to prevent abuse and ensure fair resource allocation. Common strategies include token bucket, leaky bucket, and fixed window algorithms."
                  }),
              ],
              jobs=[variant_a_agent],
              evaluators=[
                  {"name": "deepeval-faithfulness", "scorer": deepeval_faithfulness_scorer},
                  {"name": "deepeval-hallucination", "scorer": deepeval_hallucination_scorer},
              ],
          )

      await main()
      ```
    </Accordion>
  </Step>
</Steps>

## Orq.ai vs LangGraph Agent

[Orq.ai](http://Orq.ai) allows you to process third-party agent traces. This evaluation compares two AI agent implementations using GPT-4o model. Both agents act as Cloud Engineering Assistants and are tested on cloud infrastructure questions.

**Agents tested:**

* `LangChain Agent:` Direct implementation using LangChain's ChatOpenAI with custom system prompts
* `Orq Native Agent: `Agent deployed through [Orq.ai](http://Orq.ai) platform with equivalent configuration

**Evaluation metrics:**

* `DeepEval Faithfulness`: Measures how well responses align with provided context
* `Cloud Engineering Relevance`: Keyword-based scoring for cloud-specific terminology

<Steps>
  <Step title="Set up LangGraph traces in Orq.ai ">
    Follow along the [LangGraph vs Orq.ai Agent](https://colab.research.google.com/drive/1Jv1J_tQAFYrRjUXXrD37MkyH588CI7mm?usp=sharing) cell in Google Colab. Variables need to be configured under the `Step 1` section:

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    # ============================================
    # STEP 1: Configure Environment Variables
    # ============================================
    ```

    `ORQ_API_KEY` - For Orq agent access and telemetry export

    `OPENAI_API_KEY` - For LangChain agent and DeepEval metrics
  </Step>

  <Step title="Run the evaluators">
    We set up in this step equivalent configurations of LangChain and DeepEval Agents and run two evaluators on the following those steps:

    `Step 2` - Install and Import LangChain

    `Step 3` - Install and Import DeepEval

    `Step 4` - Create LangChain Agent (Matching Orq Setup)

    `Step 5` - Call the [Orq.ai](http://Orq.ai)-native Agent

    `Step 6` - Run DeepEval and Relevance evals

    ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
    import asyncio
    from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult
    from orq_ai_sdk import Orq
    import os

    # ============================================
    # STEP 1: Configure Environment Variables
    # ============================================
    # Orq.ai OpenTelemetry exporter for LangGraph traces
    os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://api.orq.ai/v2/otel"
    os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Bearer {os.getenv('ORQ_API_KEY')}"

    # Enable LangSmith tracing in OTEL-only mode
    os.environ["LANGSMITH_OTEL_ENABLED"] = "true"
    os.environ["LANGSMITH_TRACING"] = "true"
    os.environ["LANGSMITH_OTEL_ONLY"] = "true"

    # ============================================
    # STEP 2: Install and Import LangChain
    # ============================================
    try:
        from langchain_openai import ChatOpenAI
        LANGCHAIN_AVAILABLE = True
        print("✓ LangChain loaded")
    except ImportError:
        LANGCHAIN_AVAILABLE = False
        print(" LangChain not installed. Run: pip install langchain-openai")

    # ============================================
    # STEP 3: Install and Import DeepEval
    # ============================================
    try:
        from deepeval.metrics import FaithfulnessMetric
        from deepeval.test_case import LLMTestCase
        DEEPEVAL_AVAILABLE = True
        print("✓ DeepEval loaded")
    except ImportError:
        DEEPEVAL_AVAILABLE = False
        print(" DeepEval not installed. Run: pip install deepeval")

    # ============================================
    # CONFIGURATION
    # ============================================
    ORQ_API_KEY = os.getenv("ORQ_API_KEY")
    orq_client = Orq(api_key=ORQ_API_KEY)

    # Agent keys
    ORQ_AGENT_KEY = "VariantA"  # Your existing Orq agent

    # ============================================
    # STEP 4: Create LangChain Agent (Matching Orq Setup)
    # ============================================
    if LANGCHAIN_AVAILABLE:
        llm = ChatOpenAI(
            model="gpt-4o",
            temperature=0.7,
            max_tokens=None
        )

        system_message = """You are a Cloud Engineering Assistant.

    Role: Cloud Engineering Assistant
    Description: A helpful assistant for cloud engineering tasks
    Instructions: Be helpful and concise

    Please assist the user with their cloud engineering questions."""

    # ============================================
    # JOB 1: LangChain Agent
    # ============================================
    @job("LangChain-Agent-GPT4o")
    async def langchain_agent_job(data: DataPoint, row: int):
        """LangChain agent using GPT-4o (matching Orq setup)."""
        if not LANGCHAIN_AVAILABLE:
            return {
                "agent": "LangChain-GPT4o",
                "query": data.inputs["query"],
                "response": "LangChain not available",
                "context": data.inputs.get("context", ""),
                "error": True
            }

        try:
            messages = [
                {"role": "system", "content": system_message},
                {"role": "user", "content": data.inputs["query"]}
            ]

            result = await asyncio.to_thread(llm.invoke, messages)
            response = result.content if hasattr(result, 'content') else str(result)

            print(f"✓ LangChain response: {response[:80]}...")

            return {
                "agent": "LangChain-GPT4o",
                "query": data.inputs["query"],
                "response": response,
                "context": data.inputs.get("context", ""),
                "error": False
            }
        except Exception as e:
            print(f"✗ LangChain error: {e}")
            return {
                "agent": "LangChain-GPT4o",
                "query": data.inputs["query"],
                "response": f"Error: {str(e)}",
                "context": data.inputs.get("context", ""),
                "error": True
            }

    # ============================================
    # JOB 2: Orq Native Agent (Your Existing Agent)
    # ============================================
    @job("VariantA")
    async def orq_native_agent_job(data: DataPoint, row: int):
        """Orq native agent - VariantA."""
        try:
            with Orq(api_key=ORQ_API_KEY) as orq:
                response = orq.agents.responses.create(
                    agent_key=ORQ_AGENT_KEY,
                    background=False,
                    message={
                        "role": "user",
                        "parts": [{"kind": "text", "text": data.inputs["query"]}]
                    }
                )

                # Extract response text
                response_text = ""
                if hasattr(response, 'message'):
                    if hasattr(response.message, 'content'):
                        response_text = response.message.content
                    elif hasattr(response.message, 'parts'):
                        response_text = " ".join([
                            part.text if hasattr(part, 'text') else str(part)
                            for part in response.message.parts
                        ])
                elif hasattr(response, 'content'):
                    response_text = response.content
                else:
                    response_text = str(response)

                print(f"✓ Orq response: {response_text[:80]}...")

                return {
                    "agent": "Orq-Native-GPT4o",
                    "query": data.inputs["query"],
                    "response": response_text,
                    "context": data.inputs.get("context", ""),
                    "error": False
                }
        except Exception as e:
            print(f"✗ Orq agent error: {e}")
            return {
                "agent": "Orq-Native-GPT4o",
                "query": data.inputs["query"],
                "response": f"Error: {str(e)}",
                "context": data.inputs.get("context", ""),
                "error": True
            }

    # ============================================
    # EVALUATOR 1: DeepEval Faithfulness
    # ============================================
    async def deepeval_faithfulness_evaluator(params):
        """Uses DeepEval's faithfulness metric (requires OPENAI_API_KEY)."""
        if not DEEPEVAL_AVAILABLE:
            return EvaluationResult(value=0.0, explanation="DeepEval not installed")

        if not os.getenv("OPENAI_API_KEY"):
            return EvaluationResult(value=0.0, explanation="OPENAI_API_KEY not set")

        output = params["output"]

        if output.get("error"):
            return EvaluationResult(value=0.0, explanation=f"{output['agent']}: Job error")

        query = output.get("query", "").strip()
        response = output.get("response", "").strip()
        context = output.get("context", "").strip()

        if not response or not context:
            return EvaluationResult(value=0.0, explanation="Missing response or context")

        try:
            # Create test case
            test_case = LLMTestCase(
                input=query,
                actual_output=response,
                retrieval_context=[context],
            )

            # Initialize metric
            metric = FaithfulnessMetric(
                threshold=0.5,
                model="gpt-4o-mini",  # Use gpt-4o-mini to save costs
                include_reason=False,
            )

            # Measure (synchronous call in thread)
            def measure_sync():
                metric.measure(test_case)
                return float(metric.score) if metric.score is not None else 0.0

            score = await asyncio.to_thread(measure_sync)

            return EvaluationResult(
                value=score,
                explanation=f"{output['agent']}: Faithfulness {score:.2f}"
            )

        except Exception as e:
            print(f"✗ DeepEval error: {e}")
            return EvaluationResult(
                value=0.0,
                explanation=f"{output['agent']}: DeepEval error - {str(e)[:50]}"
            )

    # ============================================
    # EVALUATOR 2: Cloud Engineering Relevance
    # ============================================
    async def cloud_engineering_relevance_evaluator(params):
        """Checks if response is relevant to cloud engineering."""
        output = params["output"]
        response = output.get("response", "").lower()

        if output.get("error"):
            return EvaluationResult(value=0.0, explanation=f"{output['agent']}: Job error")

        # Cloud engineering keywords
        cloud_keywords = [
            "aws", "azure", "gcp", "google cloud", "cloud",
            "kubernetes", "k8s", "docker", "container",
            "serverless", "lambda", "ec2", "s3", "rds",
            "deployment", "infrastructure", "devops",
            "ci/cd", "cicd", "pipeline", "terraform",
            "ansible", "microservices", "api", "rest",
            "scalability", "availability", "region",
            "zone", "load balancer", "auto scaling",
            "vpc", "subnet", "security group", "iam"
        ]

        keyword_count = sum(1 for keyword in cloud_keywords if keyword in response)

        if keyword_count >= 4:
            score = 1.0
            verdict = "Highly relevant"
        elif keyword_count >= 2:
            score = 0.7
            verdict = "Relevant"
        elif keyword_count >= 1:
            score = 0.4
            verdict = "Somewhat relevant"
        else:
            score = 0.1
            verdict = "Not cloud-specific"

        return EvaluationResult(
            value=score,
            explanation=f"{output['agent']}: {verdict} ({keyword_count} keywords)"
        )

    # ============================================
    # RUN EVALUATION
    # ============================================
    async def main():
        print("=" * 70)
        print("Comparing LangChain vs Orq Native Agent")
        print("Both agents: GPT-4o | Cloud Engineering Assistant")
        print("=" * 70)
        print()

        print("Configuration:")
        print(f"  ORQ_API_KEY: {'✓' if ORQ_API_KEY else '✗'}")
        print(f"  OPENAI_API_KEY: {'✓' if os.getenv('OPENAI_API_KEY') else '✗ REQUIRED FOR DEEPEVAL'}")
        print(f"  LangChain: {'✓' if LANGCHAIN_AVAILABLE else '✗'}")
        print(f"  DeepEval: {'✓' if DEEPEVAL_AVAILABLE else '✗'}")
        print(f"  Orq Agent Key: {ORQ_AGENT_KEY}")
        print()

        if not os.getenv("OPENAI_API_KEY"):
            print("WARNING: DeepEval requires OPENAI_API_KEY")
            print("Set it with: os.environ['OPENAI_API_KEY'] = 'sk-your-key'")
            print()

        await evaluatorq(
            "langchain-vs-orq-comparison",
            data=[
                DataPoint(inputs={
                    "query": "What are the best practices for microservices architecture?",
                    "context": "Microservices architecture is a design pattern where applications are built as collections of loosely coupled services. Best practices include service independence, API-first design, and fault tolerance."
                }),
                DataPoint(inputs={
                    "query": "How do I implement API rate limiting in a production system?",
                    "context": "API rate limiting controls the number of requests a client can make to prevent abuse and ensure fair resource allocation. Common strategies include token bucket, leaky bucket, and fixed window algorithms."
                }),
                DataPoint(inputs={
                    "query": "How does Kubernetes handle container orchestration?",
                    "context": "Kubernetes orchestrates containers through a master-worker architecture. The control plane manages the cluster state, while worker nodes run containerized applications in pods. It handles scheduling, scaling, and self-healing automatically."
                }),
                DataPoint(inputs={
                    "query": "What are best practices for CI/CD pipelines in cloud environments?",
                    "context": "Best practices for cloud CI/CD include: automating testing at all stages, using infrastructure as code, implementing proper secrets management, maintaining separate environments (dev/staging/prod), and ensuring fast feedback loops."
                }),
                DataPoint(inputs={
                    "query": "Explain the difference between AWS ECS and EKS",
                    "context": "AWS ECS (Elastic Container Service) is Amazon's proprietary container orchestration platform, while EKS (Elastic Kubernetes Service) runs managed Kubernetes. ECS is simpler and AWS-specific, while EKS offers Kubernetes portability."
                }),
                DataPoint(inputs={
                    "query": "How do I secure secrets in a cloud-native application?",
                    "context": "Cloud-native secret management involves using services like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault. Best practices include encryption at rest and in transit, role-based access control, and regular rotation."
                }),
                DataPoint(inputs={
                    "query": "What is the purpose of a service mesh like Istio?",
                    "context": "A service mesh provides infrastructure layer for handling service-to-service communication. Istio manages traffic routing, load balancing, encryption, authentication, and observability without requiring application code changes."
                }),
                DataPoint(inputs={
                    "query": "How does auto-scaling work in AWS?",
                    "context": "AWS Auto Scaling monitors applications and automatically adjusts capacity based on CloudWatch metrics. It uses scaling policies to add or remove EC2 instances based on CPU utilization, request counts, or custom metrics."
                }),
                DataPoint(inputs={
                    "query": "What are the benefits of using Infrastructure as Code?",
                    "context": "Infrastructure as Code (IaC) allows version-controlled, repeatable infrastructure provisioning using tools like Terraform or CloudFormation. Benefits include consistency, auditability, disaster recovery, and reduced manual errors."
                }),
                DataPoint(inputs={
                    "query": "How do I implement zero-downtime deployments?",
                    "context": "Zero-downtime deployments use strategies like blue-green deployments, rolling updates, or canary releases. Load balancers gradually shift traffic to new versions while monitoring health checks and rollback capabilities."
                }),
            ],
            jobs=[langchain_agent_job, orq_native_agent_job],
            evaluators=[
                {"name": "deepeval-faithfulness", "scorer": deepeval_faithfulness_evaluator},
                {"name": "cloud-relevance", "scorer": cloud_engineering_relevance_evaluator},
            ],
        )

        print("\n" + "=" * 70)
        print("✓ Evaluation Complete!")
        print("Check Orq.ai workspace for results and LangChain traces")
        print("=" * 70)

    await main()
    ```

    Expected Results:

    <img src="https://mintcdn.com/orqai/qS-8vEwIQ3cNHJtw/images/2.png?fit=max&auto=format&n=qS-8vEwIQ3cNHJtw&q=85&s=21ae5b2d2b3db4d8290197cb182cebde" alt="2" width="1916" height="1406" data-path="images/2.png" />
  </Step>

  <Step title="Preview the results in AI Studio">
    You can see the results directly in the AI Studio by clicking on the generated link that shows up after you run the agent evaluators:

    <img src="https://mintcdn.com/orqai/7GeWctBW8LNNtOtd/images/evaluatorq.gif?s=f977b8e955f9cc3bb503fd0f30dda46e" alt="Evaluatorq" width="2680" height="1080" data-path="images/evaluatorq.gif" />
  </Step>
</Steps>

## Key Takeaways

You can kick off experiments from code every time you make a big update to your AI system, running them against your golden truth dataset to ensure changes improve rather than degrade performance. The real power of Evaluatorq lies in its ability to catch performance dips before they reach users, validate that new model versions maintain quality standards, and provide the confidence needed to iterate quickly on AI systems. Whether you're optimizing prompt configurations, testing agent decision-making logic, or validating RAG system faithfulness, Evaluatorq gives you the evaluation infrastructure to build reliable, production-ready AI applications at scale.
