> ## Documentation Index
> Fetch the complete documentation index at: https://docs.orq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Build Experiments

> Test prompts and models at scale. Compare performance metrics, evaluate outputs, and iterate on configurations via the AI Studio, API, or Orq MCP.

**Experiments** run model generations across a [Dataset](/docs/ai-studio/optimize/datasets) and record **Latency**, **Cost**, and **Time to First Token** for each generation. Results can be reviewed manually or scored automatically with [Evaluators](/docs/ai-studio/optimize/evaluators) and Human Reviews. For code-driven experiments, **Orq.ai** provides the **evaluatorq** framework — available as separate packages for [Python](https://github.com/orq-ai/evaluatorq) and [TypeScript](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq) — to define jobs, evaluators, and data sources programmatically and sync results back to the AI Studio.

<div className="max-w-md">
  <iframe src="https://www.youtube.com/embed/LNPyde8c_0Q" title="YouTube video player" frameborder="0" className="w-full aspect-video rounded-xl" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen />
</div>

## Use Cases

<AccordionGroup>
  <Accordion title="Compare models side by side" icon="sparkles">
    Run the same dataset through multiple models to compare output quality, cost, and latency. Works for newly released models, fine-tuned models, and private models added to the [AI Gateway](/docs/ai-studio/ai-gateway/add-models).
  </Accordion>

  <Accordion title="Optimise prompts" icon="pen-to-square">
    Test multiple prompt variants on the same dataset. Use evaluators like Cosine Similarity to quantitatively assess which version produces the best results.
  </Accordion>

  <Accordion title="Pre-deployment and regression testing" icon="flask">
    Run experiments against your current prompt configuration before shipping changes. Use historical datasets to verify that updates haven't degraded performance in any area.
  </Accordion>

  <Accordion title="Security and red teaming" icon="shield-halved">
    Test how your model responds to jailbreak attempts and adversarial inputs in a controlled environment before putting it into production.
  </Accordion>
</AccordionGroup>

## Prerequisites

<CardGroup cols={3}>
  <Card title="Dataset" icon="database" href="/docs/ai-studio/optimize/datasets">
    A Dataset with Inputs, Messages, and/or Expected Outputs
  </Card>

  <Card title="AI Gateway" icon="code-fork" href="/docs/ai-studio/ai-gateway/add-models">
    Models added to the AI Gateway
  </Card>

  <Card title="API Key" icon="key" href="/docs/ai-studio/organization/api-keys">
    An API Key (API and MCP only)
  </Card>
</CardGroup>

## Create an Experiment

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    In the AI Studio, choose a [Project](/docs/ai-studio/get-started/projects) and folder, click the <kbd><Icon icon="plus" /></kbd> button, and select **Experiment**.

    Select a [Dataset](/docs/ai-studio/optimize/datasets) and one or more models, then click <kbd className="key">Create</kbd>. Use the search field to find datasets quickly.

    You are taken to the Experiment Studio where you configure data entries and tasks before running.
  </Tab>

  <Tab title="API & SDK" icon="code">
    Use the **evaluatorq framework** to run experiments from code — available as separate packages for [Python](https://github.com/orq-ai/evaluatorq) and [TypeScript](https://github.com/orq-ai/orqkit/tree/main/packages/evaluatorq).

    **Install:**

    <CodeGroup>
      ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      pip install orq-ai-sdk
      pip install evaluatorq
      ```

      ```bash Node.js theme={"theme":{"light":"github-light","dark":"github-dark"}}
      npm install @orq-ai/evaluatorq
      npm install @orq-ai/node
      ```
    </CodeGroup>

    **Configure environment:**

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    export ORQ_API_KEY="your-api-key"
    export ORQ_ENV="production"
    export ORQ_EVALUATOR_ID="your-evaluator-ulid"  # optional
    ```

    <Warning>
      `ORQ_API_KEY` is required to invoke Deployments and Agents, run Evaluators, and sync results to the **Orq.ai** UI. Without it, experiments run locally only.
    </Warning>

    **Define your data.** Choose one of three approaches:

    <AccordionGroup>
      <Accordion title="Reference an existing Dataset (recommended)">
        <CodeGroup>
          ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
          from evaluatorq import DatasetIdInput

          dataset_id = "01ARZ3NDEKTSV4RRFFQ69G5FAV"
          # Pass DatasetIdInput directly to evaluatorq in the Run step
          ```

          ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
          import { DatasetIdInput } from "@orq-ai/evaluatorq";

          const datasetId = "01ARZ3NDEKTSV4RRFFQ69G5FAV";
          // Pass DatasetIdInput directly to evaluatorq in the Run step
          ```
        </CodeGroup>
      </Accordion>

      <Accordion title="Load from CSV or JSON">
        <CodeGroup>
          ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
          import csv, json
          from evaluatorq import DataPoint

          with open("test_data.csv", "r") as f:
              test_data = [DataPoint(inputs=row) for row in csv.DictReader(f)]

          # or from JSON
          with open("test_data.json", "r") as f:
              test_data = [DataPoint(inputs=item) for item in json.load(f)]
          ```

          ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
          import { DataPoint } from "@orq-ai/evaluatorq";
          import * as fs from "fs";
          import csv from "csv-parser";

          const data = JSON.parse(fs.readFileSync("test_data.json", "utf-8"));
          const testData = data.map((item: any) => ({ inputs: item }));
          ```
        </CodeGroup>
      </Accordion>

      <Accordion title="Define inline">
        <CodeGroup>
          ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
          from evaluatorq import DataPoint

          test_data = [
              DataPoint(inputs={"text": "Cinderella tells the story of a kind young woman..."}),
              DataPoint(inputs={"text": "Little Red Riding Hood follows a girl traveling..."}),
          ]
          ```

          ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
          import { DataPoint } from "@orq-ai/evaluatorq";

          const testData: DataPoint[] = [
              { inputs: { text: "Cinderella tells the story of a kind young woman..." } },
              { inputs: { text: "Little Red Riding Hood follows a girl traveling..." } },
          ];
          ```
        </CodeGroup>
      </Accordion>
    </AccordionGroup>

    <Tip>See the [evaluatorq Tutorial](/docs/tutorials/evaluator-q) for advanced patterns including third-party framework integration and CI/CD setup.</Tip>
  </Tab>

  <Tab title="MCP" icon="https://mintcdn.com/orqai/i7ZhKI7LFRfXU7ox/images/logos/mcp.svg?fit=max&auto=format&n=i7ZhKI7LFRfXU7ox&q=85&s=cef7916eb5fe1f6bb97541398d3f7639" width="16" height="16" data-path="images/logos/mcp.svg">
    **Create an experiment from an existing dataset:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 using the "user-queries" dataset
    ```

    The assistant uses `search_entities` to find the dataset, then `create_experiment` with two model configurations and `auto_run` enabled.

    ***

    **Compare two prompt strategies:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create an experiment using the "customer-feedback" dataset with two prompts: one focused on empathy and one on brevity. Run it and summarize the results.
    ```

    The assistant uses `create_experiment` with two prompt variants and `auto_run` enabled, then `get_experiment_run` to retrieve and summarise the evaluation metrics.
  </Tab>
</Tabs>

### Configure Tasks

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    The left side of the Experiment table shows the loaded Dataset entries. Each row runs separately against each configured task.

    Add new test rows with the **Add Row** button. Edit Inputs, Messages, and Expected Outputs by selecting any cell.

    <Tip>
      Columns can be reorganised and hidden using the <kbd><Icon icon="ellipsis" /></kbd> menu.
    </Tip>

    <img src="https://mintcdn.com/orqai/E8L3R46ivX7g9-QI/images/docs/d22119f2c29008b097fa145ed1f71d86af4a37dc69acb0725774aa3f7b912673-iScreen_Shoter_-_Google_Chrome_-_250210113359.jpg?fit=max&auto=format&n=E8L3R46ivX7g9-QI&q=85&s=45278776ebcc821500e8971d89f72c23" alt="CS_demo experiment grid in Draft state showing Inputs, Messages, Expected Output, and Response columns with gpt-4o and claude-3-5-sonnet variants and 10 dataset rows." width="3542" height="1295" data-path="images/docs/d22119f2c29008b097fa145ed1f71d86af4a37dc69acb0725774aa3f7b912673-iScreen_Shoter_-_Google_Chrome_-_250210113359.jpg" />

    To add a task, open the sidebar and select **+Task**:

    <AccordionGroup>
      <Accordion title="Configure a Model" icon="cubes">
        Select a model to open the Prompt panel. Configure the prompt template using:

        * The **Messages** column from the dataset.
        * A configured **Prompt**.
        * A combination of both.

        <Frame caption="Open the Prompt panel by selecting the model name on the left panel.">
          <img src="https://mintcdn.com/orqai/kym08_pOTNRFhXF_/images/experiment-prompt-model.png?fit=max&auto=format&n=kym08_pOTNRFhXF_&q=85&s=a232f223f23af3981843d5166e9fb806" alt="Experiment view with the Prompt panel open on the right, showing model settings for gpt-4.2 including temperature, max tokens, and messaging column configuration." width="1629" height="960" data-path="images/experiment-prompt-model.png" />
        </Frame>

        <Info>
          To learn more about Prompt Template configuration, see [Creating a Prompt](/docs/ai-studio/prompts/prompts).
        </Info>
      </Accordion>

      <Accordion title="Configure an Agent" icon="robot">
        Choose an Agent from the **+Task** menu. Its configuration is automatically loaded as a new column.

        The agent prompt can use:

        * **Instructions + Messages** only.
        * **Instructions + Dataset Messages** column.

        <Frame caption="Open the Prompt panel by selecting the Agent name on the left panel.">
          <img src="https://mintcdn.com/orqai/kym08_pOTNRFhXF_/images/experiment-agent.png?fit=max&auto=format&n=kym08_pOTNRFhXF_&q=85&s=4bd216afead03e625a6757fa34897cf9" alt="Experiment view with the Agent panel open on the right, showing the bank_creditcard_agent_gpt_4.2 agent with instructions for Dutch Royal Bank Credit Card Support." width="1632" height="955" data-path="images/experiment-agent.png" />
        </Frame>

        <Info>
          To learn more about Agent configuration, see [Build Agents](/docs/ai-studio/ai-engineering/build-agents).
        </Info>
      </Accordion>
    </AccordionGroup>
  </Tab>

  <Tab title="API & SDK" icon="code">
    Define jobs using the `@job` decorator (Python) or `job()` function (TypeScript). Each job defines one variant to test.

    <CodeGroup>
      ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import asyncio, os
      from evaluatorq import job, DataPoint
      from orq_ai_sdk import Orq

      orq_client = Orq(
          api_key=os.getenv("ORQ_API_KEY"),
          server_url=os.getenv("ORQ_SERVER_URL", "https://my.orq.ai")
      )

      def extract_response_text(response):
          if hasattr(response, "output") and response.output:
              if isinstance(response.output, list) and len(response.output) > 0:
                  part = response.output[0]
                  if hasattr(part, "parts") and part.parts:
                      return part.parts[0].text if hasattr(part.parts[0], "text") else str(part.parts[0])
          if hasattr(response, "content"):
              if isinstance(response.content, list):
                  return " ".join(part.text if hasattr(part, "text") else str(part) for part in response.content)
              return str(response.content)
          return str(response)

      @job("summarize-variant-a")
      async def summarize_variant_a(data: DataPoint, row: int):
          response = await asyncio.to_thread(
              orq_client.deployments.invoke,
              key="summarization_v2",
              context={"environments": [], "reasoning": ["minimal"]},
              inputs={"text": data.inputs["text"]},
          )
          return {"variant": "variant-a", "input": data.inputs["text"], "summary": extract_response_text(response)}

      @job("summarize-variant-b")
      async def summarize_variant_b(data: DataPoint, row: int):
          response = await asyncio.to_thread(
              orq_client.deployments.invoke,
              key="summarization_v2",
              context={"environments": [], "reasoning": ["medium"]},
              inputs={"text": data.inputs["text"]},
          )
          return {"variant": "variant-b", "input": data.inputs["text"], "summary": extract_response_text(response)}
      ```

      ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import { job, DataPoint } from "@orq-ai/evaluatorq";
      import { Orq } from "@orq-ai/node";

      const orqClient = new Orq({
          apiKey: process.env.ORQ_API_KEY,
          serverUrl: process.env.ORQ_SERVER_URL || "https://my.orq.ai",
      });

      function extractResponseText(response: any): string {
          if (response?.output?.[0]?.parts?.[0]?.text) return response.output[0].parts[0].text;
          if (Array.isArray(response?.content)) return response.content.map((p: any) => p.text || String(p)).join(" ");
          if (response?.content) return String(response.content);
          return String(response);
      }

      const summarizeVariantA = job("summarize-variant-a", async (data: DataPoint) => {
          const response = await orqClient.deployments.invoke({
              key: "summarization_v2",
              context: { environments: [], reasoning: ["minimal"] },
              inputs: { text: data.inputs.text as string },
          });
          return { variant: "variant-a", input: data.inputs.text, summary: extractResponseText(response) };
      });

      const summarizeVariantB = job("summarize-variant-b", async (data: DataPoint) => {
          const response = await orqClient.deployments.invoke({
              key: "summarization_v2",
              context: { environments: [], reasoning: ["medium"] },
              inputs: { text: data.inputs.text as string },
          });
          return { variant: "variant-b", input: data.inputs.text, summary: extractResponseText(response) };
      });
      ```
    </CodeGroup>

    <Tip>Jobs can invoke [Deployments](/docs/deployments/overview), [Agents](/docs/ai-studio/ai-engineering/build-agents), or [Prompts](/docs/ai-studio/prompts/prompts). Third-party frameworks (LangGraph, CrewAI, LlamaIndex, AutoGen) can be integrated to compare against Orq features side-by-side.</Tip>
  </Tab>
</Tabs>

#### Variables and Prompt Templating

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Reference dataset inputs in your prompt using `{{variable_name}}`. Values come from the **Inputs** column and are substituted per row when the experiment runs.

    Select the **Template Engine** from the Prompt Settings panel:

    * **Text** (default): `{{double_braces}}` syntax.
    * **Jinja**: conditionals, loops, filters, and more.
    * **Mustache**: logic-less templating with sections.

    <Frame caption="Select a Template Engine in the Prompt Settings panel.">
      <img src="https://mintcdn.com/orqai/HVm7-3vBg7cwVv2-/images/experiment-engine.png?fit=max&auto=format&n=HVm7-3vBg7cwVv2-&q=85&s=52c933788aa84e0eed529e5f66fbfe54" alt="Engine dropdown in the Prompt panel with Jinja selected and options for Text, Jinja, and Mustache." width="553" height="243" data-path="images/experiment-engine.png" />
    </Frame>

    <Tabs>
      <Tab title="Jinja">
        <Steps>
          <Step title="Prompt template">
            ```jinja theme={"theme":{"light":"github-light","dark":"github-dark"}}
            You are a support assistant for {{company_name}}.

            {% if user_tier == "premium" %}
            {{customer_name}} is a premium customer. Greet them by name with priority support and a 2-hour SLA.
            {% else %}
            {{customer_name}} is on the free plan. Standard response time is 24 hours.
            {% endif %}
            ```
          </Step>

          <Step title="Dataset inputs">
            ```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
            { "company_name": "Acme", "customer_name": "Sarah", "user_tier": "premium" }
            ```
          </Step>

          <Step title="Rendered prompt">
            ```text wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
            You are a support assistant for Acme.

            Sarah is a premium customer. Greet them by name with priority support and a 2-hour SLA.
            ```
          </Step>
        </Steps>
      </Tab>

      <Tab title="Mustache">
        <Steps>
          <Step title="Prompt template">
            ```handlebars theme={"theme":{"light":"github-light","dark":"github-dark"}}
            You are a support assistant for {{company_name}}.

            {{# is_premium}}
            {{customer_name}} is a premium customer. Priority support with a 2-hour SLA.
            {{/ is_premium}}
            {{^ is_premium}}
            {{customer_name}} is on the free plan. Standard response time is 24 hours.
            {{/ is_premium}}
            ```
          </Step>

          <Step title="Dataset inputs">
            ```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
            { "company_name": "Acme", "customer_name": "Sarah", "is_premium": true }
            ```
          </Step>

          <Step title="Rendered prompt">
            ```text wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
            You are a support assistant for Acme.

            Sarah is a premium customer. Priority support with a 2-hour SLA.
            ```
          </Step>
        </Steps>
      </Tab>
    </Tabs>

    <Info>
      For a complete reference of template features, see [Prompt Templating](/docs/ai-studio/prompts/prompt-templating).
    </Info>
  </Tab>
</Tabs>

#### Tool Calls for Agents

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    When using agents, attach **executable tools** that run in real-time during the experiment. These perform actual operations (HTTP requests, Python code, MCP calls).

    1. Open the agent configuration panel.
    2. Select **Add Tool** in the **Tools** section.
    3. Choose from available tools in your project.

    <Info>
      See [Build Agents](/docs/ai-studio/ai-engineering/build-agents) for full tool configuration options.
    </Info>
  </Tab>
</Tabs>

#### Tool Calls for Prompts (Historical Testing)

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Add a **historical Tool Call** chain to a model's execution to test how it handles specific tool payloads or error scenarios.

    <Warning>
      These tool calls are **simulated and do not execute**. They provide historical context to test function calling behaviour. For real executable tools, use [Tool Calls for Agents](#tool-calls-for-agents) above.
    </Warning>

    Use the <kbd><Icon icon="wrench" /></kbd> button to add a tool call to any message. Configure:

    * **Function Name**: which tool was called.
    * **Input**: the payload sent to the tool.
    * **Output**: the response the tool returned.

    <Frame caption="Configuring a tool call input and output.">
      <img src="https://mintcdn.com/orqai/598O1ftLlq3U7tj-/images/add-tool-call-experiment.png?fit=max&auto=format&n=598O1ftLlq3U7tj-&q=85&s=ee6c44fc7842df5a6cbcc89d8906085e" alt="Add Tool Call Experiment" className="mx-auto" style={{width:"79%"}} width="501" height="760" data-path="images/add-tool-call-experiment.png" />
    </Frame>
  </Tab>
</Tabs>

### Configure Evaluators

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    To add an Evaluator, go to the right of the Experiment table and select **Add new Column > Evaluator**.

    The panel shows all Evaluators available in the current [Project](/docs/ai-studio/get-started/projects). Enable the toggle to add an Evaluator as a new column.

    <img src="https://mintcdn.com/orqai/x_6IXnot9ETOc_0g/images/docs/51205d5b61a55af182a21c5c3e85f2e86ad55e31736f56963d6481ba50689285-Screenshot_2025-02-10_at_11.47.17.png?fit=max&auto=format&n=x_6IXnot9ETOc_0g&q=85&s=90a01b49ad854f7054610f5bba07c059" alt="Evaluators selection panel showing available evaluators including Contains Any, Contains None, Context Recall, Cosine Similarity, demo-evaluator, demo-json, Fact Checking Knowledge Base, and Factchecker with toggle controls." width="2486" height="1706" data-path="images/docs/51205d5b61a55af182a21c5c3e85f2e86ad55e31736f56963d6481ba50689285-Screenshot_2025-02-10_at_11.47.17.png" />

    <Info>
      To add Evaluators to your project, see [Evaluators](/docs/ai-studio/optimize/evaluators). Import from the [Hub](/docs/ai-studio/optimize/hub#evaluators) or create a custom [LLM Evaluator](/docs/ai-studio/optimize/evaluators#llm-evaluator).
    </Info>
  </Tab>

  <Tab title="API & SDK" icon="code">
    Define evaluators as async functions that return an `EvaluationResult` with a score (0.0 to 1.0) and an explanation.

    <AccordionGroup>
      <Accordion title="Local evaluator" icon="code">
        <CodeGroup>
          ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
          from evaluatorq import EvaluationResult

          async def word_count_scorer(params):
              word_count = len(params["output"].get("summary", "").split())
              if word_count >= 10:
                  return EvaluationResult(value=1.0, explanation=f"Sufficient ({word_count} words)")
              elif word_count >= 5:
                  return EvaluationResult(value=0.5, explanation=f"Partial ({word_count} words)")
              else:
                  return EvaluationResult(value=0.0, explanation=f"Too short ({word_count} words)")
          ```

          ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
          const wordCountScorer = async (params: any) => {
              const wordCount = (params?.output?.summary || "").split(" ").filter((w: string) => w.length > 0).length;
              return {
                  value: wordCount >= 10 ? 1.0 : wordCount >= 5 ? 0.5 : 0.0,
                  explanation: `Word count: ${wordCount}`,
              };
          };
          ```
        </CodeGroup>
      </Accordion>

      <Accordion title="Orq Evaluator" icon="brain">
        <CodeGroup>
          ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
          import asyncio, os
          from evaluatorq import EvaluationResult

          EVAL_ID = os.environ.get("ORQ_EVALUATOR_ID", "your-evaluator-id")

          async def summarization_quality_scorer(params):
              data, output = params["data"], params["output"]
              source_text = (data.inputs.get("text") or "").strip()
              summary = (output.get("summary") or "").strip()
              if not summary or not source_text:
                  return EvaluationResult(value=0.0, explanation="Missing source or summary")
              evaluation = await asyncio.to_thread(
                  orq_client.evals.invoke,
                  id=EVAL_ID, query=source_text, output=summary,
                  reference=None, messages=[], retrievals=[],
              )
              return EvaluationResult(value=float(evaluation.value.value), explanation=str(evaluation.value.explanation or ""))
          ```

          ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
          const EVAL_ID = process.env.ORQ_EVALUATOR_ID || "your-evaluator-id";

          const summarizationQualityScorer = async (params: any) => {
              const sourceText = (params.data?.inputs?.text || "").trim();
              const summary = (params.output?.summary || "").trim();
              if (!summary || !sourceText) return { value: 0.0, explanation: "Missing source or summary" };
              const evaluation = await orqClient.evals.invoke({
                  id: EVAL_ID, query: sourceText, output: summary,
                  reference: undefined, messages: [], retrievals: [],
              });
              return { value: parseFloat(evaluation.value.value), explanation: evaluation.value.explanation || "" };
          };
          ```
        </CodeGroup>
      </Accordion>

      <Accordion title="Third-party evaluator (DeepEval)" icon="puzzle">
        <CodeGroup>
          ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
          from evaluatorq import EvaluationResult

          async def deepeval_relevancy_scorer(params):
              from deepeval.metrics import AnswerRelevancyMetric
              from deepeval.test_case import LLMTestCase
              source_text = (params["data"].inputs.get("text") or "").strip()
              summary = (params["output"].get("summary") or "").strip()
              if not summary or not source_text:
                  return EvaluationResult(value=0.0, explanation="Missing source or summary")
              metric = AnswerRelevancyMetric(threshold=0.5)
              test_case = LLMTestCase(input=source_text, actual_output=summary)
              result = await asyncio.to_thread(metric.measure, test_case)
              return EvaluationResult(value=float(result.score), explanation=f"DeepEval relevancy: {result.score:.2f}")
          ```
        </CodeGroup>
      </Accordion>
    </AccordionGroup>

    <Tip>See the [evaluatorq Tutorial](/docs/tutorials/evaluator-q) for more evaluator patterns including Ragas and other frameworks.</Tip>
  </Tab>

  <Tab title="MCP" icon="https://mintcdn.com/orqai/i7ZhKI7LFRfXU7ox/images/logos/mcp.svg?fit=max&auto=format&n=i7ZhKI7LFRfXU7ox&q=85&s=cef7916eb5fe1f6bb97541398d3f7639" width="16" height="16" data-path="images/logos/mcp.svg">
    **Create an experiment with an evaluator:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create an experiment from the "qa-dataset" dataset with the "tone-scorer" evaluator attached
    ```

    The assistant uses `search_entities` to find the dataset and evaluator, then `create_experiment` with both the dataset ID and evaluator ID, with `auto_run` enabled.

    ***

    **Create an evaluator first, then run an experiment:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create an LLM-as-a-Judge evaluator that scores responses on tone, then run an experiment on the "customer-feedback" dataset using that evaluator
    ```

    The assistant uses `create_llm_eval` to create the evaluator, then `create_experiment` with the returned evaluator key.
  </Tab>
</Tabs>

#### Human Reviews

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    To add a Human Review column, find the **Human Review** panel and select **Add Human Review**.

    <Frame caption="Human Reviews appear as a new column. Each output can be reviewed individually.">
      <img src="https://mintcdn.com/orqai/3nt6UkYDp2QEiEBs/images/human-review-experiment.png?fit=max&auto=format&n=3nt6UkYDp2QEiEBs&q=85&s=cb26a4593588ca1bea7b3ba0fec797a4" alt="Experiment grid with a Select Feedback dialog open showing Good and Bad options with an explanation field, and Bad selected with the note Could've offered a link to relevant documentation." width="1118" height="668" data-path="images/human-review-experiment.png" />
    </Frame>

    <Info>
      To learn more, see [Human Reviews](/docs/ai-studio/observability/annotation-queues).
    </Info>
  </Tab>
</Tabs>

## Run an Experiment

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Click the **Run** button to start the experiment. Depending on the dataset size, all generations may take a few minutes to complete. The status changes to **Completed** when done.

    <Info>
      To start a new iteration with different prompts or data, use the **New Run** button. A new Experiment Run is created in **Draft** state.
    </Info>
  </Tab>

  <Tab title="API & SDK" icon="code">
    Pass your data, jobs, and evaluators to `evaluatorq()`:

    <CodeGroup>
      ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import asyncio
      from evaluatorq import evaluatorq, DatasetIdInput

      async def main():
          await evaluatorq(
              "compare-summarization-variants",
              data=DatasetIdInput(dataset_id="01ARZ3NDEKTSV4RRFFQ69G5FAV"),
              jobs=[summarize_variant_a, summarize_variant_b],
              evaluators=[
                  {"name": "word-count", "scorer": word_count_scorer},
                  {"name": "quality", "scorer": summarization_quality_scorer},
              ],
          )

      if __name__ == "__main__":
          asyncio.run(main())
      ```

      ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
      import { evaluatorq } from "@orq-ai/evaluatorq";

      await evaluatorq("compare-summarization-variants", {
          data: { datasetId: "01ARZ3NDEKTSV4RRFFQ69G5FAV" },
          jobs: [summarizeVariantA, summarizeVariantB],
          evaluators: [
              { name: "word-count", scorer: wordCountScorer },
              { name: "quality", scorer: summarizationQualityScorer },
          ],
      });
      ```
    </CodeGroup>

    Once complete, `evaluatorq` prints a summary table in the terminal and a URL to the results in the **Orq.ai** AI Studio.

    <Frame caption="Terminal output after an experiment run.">
      <img src="https://mintcdn.com/orqai/UyqHKZasjtJIMOwi/images/terminal-evaluatorq.png?fit=max&auto=format&n=UyqHKZasjtJIMOwi&q=85&s=e5663a845a87ced172a406edb5831c95" alt="Terminal output showing an evaluation completed summary with 3 total data points, quality and word-count evaluator scores, and a link to view results in Orq.ai." width="1385" height="704" data-path="images/terminal-evaluatorq.png" />
    </Frame>

    **Add evaluators from the UI after a code run:**

    Once the experiment completes, attach evaluators and re-run evaluations directly in the AI Studio without touching code. Use the <kbd className="key"><Icon icon="circle-plus" color="#fff" /> Evaluator</kbd> button to attach any evaluator and trigger a new evaluation pass.

    <Frame caption="Use the + Evaluator button to attach evaluators to an evaluatorq experiment run.">
      <img src="https://mintcdn.com/orqai/6kvJGT17Rfyyilmw/images/evaluatorq-evaluators-ui.png?fit=max&auto=format&n=6kvJGT17Rfyyilmw&q=85&s=4d25f778efe3b5dd2762a757ded2cef8" alt="vercel-multi-agent-eval experiment Run view showing Tasks (research-agent, math-agent) and Evaluators (city-relevance, correctness, quality-rubric, tool-usage, length_less_than_uqrv, llm_evaluator_tmmq) in the left panel." width="1724" height="944" data-path="images/evaluatorq-evaluators-ui.png" />
    </Frame>

    <Card title="evaluatorq Tutorial" icon="book-open" href="/docs/tutorials/evaluator-q" arrow="true">
      Advanced patterns: comparing Deployments and Agents, third-party framework integration, multi-job workflows, CI/CD integration.
    </Card>

    <Card title="Red Teaming LLMs with evaluatorq" icon="shield-halved" href="/docs/tutorials/red-teaming" arrow="true">
      Probing LLM deployments and agents for security vulnerabilities using the evaluatorq red teaming CLI.
    </Card>
  </Tab>

  <Tab title="MCP" icon="https://mintcdn.com/orqai/i7ZhKI7LFRfXU7ox/images/logos/mcp.svg?fit=max&auto=format&n=i7ZhKI7LFRfXU7ox&q=85&s=cef7916eb5fe1f6bb97541398d3f7639" width="16" height="16" data-path="images/logos/mcp.svg">
    **Run an experiment with auto-run enabled:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Create an experiment comparing GPT-5.2 and Claude Sonnet 4.6 on the "user-queries" dataset and run it automatically
    ```

    The assistant uses `create_experiment` with `auto_run: true` and returns the experiment ID once both configurations have run.

    ***

    **List recent runs:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Show me the latest experiment runs in my workspace
    ```

    The assistant uses `list_experiment_runs` with cursor pagination to retrieve recent runs.
  </Tab>
</Tabs>

### Evaluation-Only Mode

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    To score existing responses in your dataset without generating new outputs:

    1. Set up the experiment with a dataset that already contains responses in the **Messages** column.
    2. Do not select a prompt during setup.
    3. Add your evaluators.
    4. Run the experiment.
  </Tab>
</Tabs>

### Run a Single Prompt

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    To run one task against the existing dataset without re-running everything, click <kbd><Icon icon="angle-down" /></kbd> next to the task and choose **Run**.

    <Frame>
      <img src="https://mintcdn.com/orqai/7Dru9SOm-qTNSU3m/images/contextual-menu-for-experiment-model-run.png?fit=max&auto=format&n=7Dru9SOm-qTNSU3m&q=85&s=bb909234906320056e81df8d86e23e4d" alt="Context menu on the gpt-5-mini column header showing options: Run, Settings, Duplicate, Hide Column, and Delete." width="390" height="293" data-path="images/contextual-menu-for-experiment-model-run.png" />
    </Frame>
  </Tab>
</Tabs>

### Partial Runs

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Hover on a single cell and click <kbd><Icon icon="arrows-rotate-reverse" /></kbd> to re-run that row only.

    <img src="https://mintcdn.com/orqai/pyKgLAXgUb0ooMkd/images/re-run-prompt.png?fit=max&auto=format&n=pyKgLAXgUb0ooMkd&q=85&s=bbad88effac57b992000d893d1cb5d6a" alt="Re Run Prompt" className="mx-auto" style={{width:"61%"}} width="325" height="200" data-path="images/re-run-prompt.png" />

    Select **Partial Run** from the Run menu to re-run all cells that are in Error or have not been run yet.

    <img src="https://mintcdn.com/orqai/pyKgLAXgUb0ooMkd/images/partial-run-experiment.png?fit=max&auto=format&n=pyKgLAXgUb0ooMkd&q=85&s=cd3602a9b84e106b35e16e0280532b5f" alt="Partial Run" className="mx-auto" style={{width:"63%"}} width="553" height="278" data-path="images/partial-run-experiment.png" />
  </Tab>
</Tabs>

### Add Evaluators After Running

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Add extra Evaluators or Human Reviews to an already-completed run. Use the drop-down on the Evaluator column to run only the newly added evaluations without re-running model generations.

    <Frame caption="Use the drop-down on your Evaluator column to run newly added Evaluations.">
      <img src="https://mintcdn.com/orqai/MIQvMD51vcgugI2x/images/experiment-extra-evaluator.png?fit=max&auto=format&n=MIQvMD51vcgugI2x&q=85&s=f1ea64cc8011821903aff127991c56c9" alt="Experiment Extra Evaluator" className="mx-auto" style={{width:"67%"}} width="708" height="474" data-path="images/experiment-extra-evaluator.png" />
    </Frame>
  </Tab>
</Tabs>

## View Results

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Once the experiment status changes to **Completed**, open the **Review** tab.

    The Review tab has two views:

    * <kbd className="key"><Icon icon="eye" color="#fff" /> Review</kbd>: inspect each model output individually.
    * <kbd className="key"><Icon icon="columns" color="#fff" /> Compare</kbd>: view multiple model outputs side by side.
  </Tab>

  <Tab title="API & SDK" icon="code">
    Results sync to the **Orq.ai** AI Studio automatically when `ORQ_API_KEY` is set. The framework prints the experiment URL at the end of the run.

    <Frame caption="The Orq.ai UI after a code-triggered experiment run.">
      <img src="https://mintcdn.com/orqai/UyqHKZasjtJIMOwi/images/ui-evaluatorq.png?fit=max&auto=format&n=UyqHKZasjtJIMOwi&q=85&s=5303ed03daa833feb0b459014da6f3d0" alt="compare-summarization experiment Run #5 showing Tasks (summarize-variant-a, summarize-variant-b) with input texts and quality evaluator scores for each row." width="1799" height="570" data-path="images/ui-evaluatorq.png" />
    </Frame>

    [LangGraph](/docs/ai-studio/integrations/frameworks/langgraph) and [Vercel AI SDK](/docs/ai-studio/integrations/frameworks/vercel-ai) agent executions are fully visualised in the UI, including individual steps and tool invocations.

    <Frame caption="Vercel AI SDK execution trace with all agent steps and tool invocations visible in Orq.ai.">
      <img src="https://mintcdn.com/orqai/6kvJGT17Rfyyilmw/images/evaluatorq-vercel.png?fit=max&auto=format&n=6kvJGT17Rfyyilmw&q=85&s=4044de000baf76ddb26d19b0d68e75b7" alt="vercel-multi-agent-eval experiment Review showing Job 1 of 8 for research-agent with a knowledgeBase tool call using topic population of Tokyo 2023, and evaluator scores in the Feedback panel." width="1724" height="816" data-path="images/evaluatorq-vercel.png" />
    </Frame>
  </Tab>

  <Tab title="MCP" icon="https://mintcdn.com/orqai/i7ZhKI7LFRfXU7ox/images/logos/mcp.svg?fit=max&auto=format&n=i7ZhKI7LFRfXU7ox&q=85&s=cef7916eb5fe1f6bb97541398d3f7639" width="16" height="16" data-path="images/logos/mcp.svg">
    **Export results:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Export the latest experiment run as CSV
    ```

    The assistant uses `list_experiment_runs` to find the most recent run, then `get_experiment_run` with CSV export format and returns a signed download URL.

    ***

    **Get results for a specific run:**

    ```prompt wrap theme={"theme":{"light":"github-light","dark":"github-dark"}}
    Show me the results for experiment run ID "01ARZ3NDEKTSV4RRFFQ69G5FAV"
    ```

    The assistant uses `get_experiment_run` to retrieve the full run including all evaluation scores.
  </Tab>
</Tabs>

### Column Result Overview

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Each response column shows an aggregated summary at the top: average evaluator score, latency, and cost across all rows.

    <Frame caption="Each response column shows an aggregated result overview: evaluator score, latency, and cost.">
      <img src="https://mintcdn.com/orqai/1LTUSDrjrmE49Lpa/images/experiments-result-overview.png?fit=max&auto=format&n=1LTUSDrjrmE49Lpa&q=85&s=38837b45d077149d2481bc19fb15099b" alt="Experiment results grid showing gpt-4o and basic_translator variant columns with a tooltip over gpt-4o showing Pass Rate 33%, Avg. Latency 2,354ms, Avg. Cost $0.00218, Input Tokens 2,376, and Total Tokens 3,090." width="1097" height="340" data-path="images/experiments-result-overview.png" />
    </Frame>
  </Tab>
</Tabs>

### Compare Mode

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Visualise multiple model executions side by side. Variables and Expected Outputs are shown on the left. Evaluator scores appear at the bottom. Human Reviews can be applied here too, from the controls at the bottom of the screen.

    <Frame caption="View multiple model generations side by side.">
      <img src="https://mintcdn.com/orqai/kym08_pOTNRFhXF_/images/models-comparison-experiment.png?fit=max&auto=format&n=kym08_pOTNRFhXF_&q=85&s=4e8a2c14d4617a45060b4db169ef706a" style={{width:"100%"}} alt="Side-by-side experiment comparison with two model columns, openai-gpt-5 and parent-agent, each showing System, User, and Assistant messages, and the Question and Expected output on the left." width="1625" height="904" data-path="images/models-comparison-experiment.png" />
    </Frame>
  </Tab>
</Tabs>

### Tool Call History

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    When reviewing a model execution, see the step-by-step tool call history including payloads sent and responses received.

    <Frame caption="See the model interpretation and reasoning around each tool call.">
      <img src="https://mintcdn.com/orqai/-GOD-4cxQAoeO49V/images/experiment-tool-history.png?fit=max&auto=format&n=-GOD-4cxQAoeO49V&q=85&s=0713306205e65f9995fe019cfefcdcd1" className="mx-auto" style={{width:"59%"}} alt="See the model interpretation and reasoning around each tool call." width="562" height="1221" data-path="images/experiment-tool-history.png" />
    </Frame>
  </Tab>
</Tabs>

### Multiple Runs

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    Use the **Runs** tab to see all previous runs for an experiment and compare Evaluator results across runs at a glance.

    <Frame caption="See at a glance how results evolved between two experiment runs.">
      <img src="https://mintcdn.com/orqai/x_6IXnot9ETOc_0g/images/docs/1c014acf1a2cbb955d4e565534869748d12433e5badf78eb667036dd2b216dda-image.png?fit=max&auto=format&n=x_6IXnot9ETOc_0g&q=85&s=1f49462a49498c48fdb684cc4b9b336c" alt="Runs tab for a New experiment showing a table with Status, Prompt, Cosine Similarity, JSON Schema Evaluator, Run, Creator, and Added columns, listing two Completed runs using gpt-4.1." width="2306" height="688" data-path="images/docs/1c014acf1a2cbb955d4e565534869748d12433e5badf78eb667036dd2b216dda-image.png" />
    </Frame>
  </Tab>
</Tabs>

### Export Results

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    <Frame caption="Exports are available after an experiment runs successfully.">
      <img src="https://mintcdn.com/orqai/kym08_pOTNRFhXF_/images/experiment-export.png?fit=max&auto=format&n=kym08_pOTNRFhXF_&q=85&s=7af0843af89d7f0866ce573f29ec170a" alt="Experiment context menu showing Edit, Duplicate, Share, Export with CSV, JSON, and JSON Lines options, Move to, and Delete." width="524" height="312" data-path="images/experiment-export.png" />
    </Frame>

    The exported file contains: datasets, model configuration, responses, metrics (including Time to First Token), and Human Reviews.

    <Frame caption="Example CSV export: each column holds data entries and generated responses.">
      <img src="https://mintcdn.com/orqai/E8L3R46ivX7g9-QI/images/docs/df7d94697a5462ba1c1a6aa7e882abadb21456231209220ed5931f1944dc81a1-Screenshot_2025-03-25_at_14.16.29.png?fit=max&auto=format&n=E8L3R46ivX7g9-QI&q=85&s=482ad112bbbb743cd52cbcc9102d8c87" alt="CSV export table showing experiment log rows with timestamp, status, model, template, context, reference, and llm_response columns for gpt-3.5-turbo and meta-llama models answering questions about historical figures." width="3134" height="1150" data-path="images/docs/df7d94697a5462ba1c1a6aa7e882abadb21456231209220ed5931f1944dc81a1-Screenshot_2025-03-25_at_14.16.29.png" />
    </Frame>
  </Tab>
</Tabs>

## Review Results

Once an experiment completes, open the **Review** tab in the experiment top nav. It offers two views.

<Tabs>
  <Tab title="Review Mode">
    Open <Icon icon="eye" /> **Review** to step through every response one at a time.

    <Frame caption="The Review tab shows inputs, metrics, the full conversation, and the Annotations panel with Human Reviews and Evaluator results for each response.">
      <img src="https://mintcdn.com/orqai/MT6hA9WdZfQ3qEfV/images/experiment-review.png?fit=max&auto=format&n=MT6hA9WdZfQ3qEfV&q=85&s=5a49d61a00ee6cb4ce44f57508d5ae53" alt="Experiment Review screen showing Response 1 of 20 for product-orchestrator-A. Left panel: Inputs (prompt, should_trigger, scenario, expectations), Expected output, and Metrics (Latency 79,231 ms, TTFT 1,032 ms, Cost $0.13991, token counts). Center panel: System instructions, User input, and Assistant output with function calls. Right panel: an Annotations comment and good/bad rating above an Evaluators section listing a json_check evaluator marked No." width="3494" height="1740" data-path="images/experiment-review.png" />
    </Frame>

    The screen is divided into three panels:

    * **Left**: Inputs, Expected output, and Metrics (Latency, <Tooltip tip="Time to First Token">TTFT</Tooltip>, Cost, token counts).
    * **Center**: the full conversation for the selected entry: the User message, the Assistant response, and the tool calls made by the agent.
    * **Right**: the **Annotations** panel with [Human Review](/docs/ai-studio/observability/annotation-queues) controls for manual annotation, above the Evaluator scores.

    Use <kbd><Icon icon="chevron-down" /></kbd> / <kbd><Icon icon="chevron-up" /></kbd> or <kbd>J</kbd> / <kbd>K</kbd> to step through responses with the keyboard.
  </Tab>

  <Tab title="Compare Mode">
    Open <Icon icon="columns" /> **Compare** to view multiple model executions side by side. Variables and Expected Outputs are shown on the left, and Evaluator scores appear at the bottom of each column.

    Human Reviews can be applied here too: annotate each output from the controls at the bottom of the screen.

    <Frame caption="Compare two model executions side by side and annotate each output at the bottom of the screen.">
      <img src="https://mintcdn.com/orqai/MT6hA9WdZfQ3qEfV/images/experiment-compare-review.png?fit=max&auto=format&n=MT6hA9WdZfQ3qEfV&q=85&s=93a74911fe1bdb3ead321e82d2ba6b48" style={{width:"100%"}} alt="Experiment Compare review showing Row 1 of 10 with two side-by-side columns, product-orchestrator-A and product-orchestrator-B, each showing Instructions, User input, Assistant responses, and tool calls. A comment popup reads too many tool calls to get to result, expensive execution with a Bad rating at the bottom left, and the json_check and fruitBool evaluators are listed." width="3490" height="1978" data-path="images/experiment-compare-review.png" />
    </Frame>
  </Tab>
</Tabs>

### Duplicate an Experiment

<Tabs>
  <Tab title="AI Studio" icon="https://mintcdn.com/orqai/My16MDKJXrKALEHC/images/logos/ai-studio-round.svg?fit=max&auto=format&n=My16MDKJXrKALEHC&q=85&s=ac04dd509320d58ab9701cb6d6137733" width="100" height="100" data-path="images/logos/ai-studio-round.svg">
    To duplicate an experiment with all its configuration (dataset, prompts, evaluators):

    1. Open the experiment.
    2. Click <kbd><Icon icon="ellipsis" /></kbd> in the top-right corner.
    3. Select **Duplicate**.
    4. Provide a new name and click **Confirm**.
  </Tab>
</Tabs>
