> ## Documentation Index
> Fetch the complete documentation index at: https://docs.orq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Build and compare insurance claims agents with MCP

> Build single-agent and multi-agent insurance claims systems via orq.ai MCP, then compare them with evaluators and a 15-case dataset.

<Info>
  TL;DR

  * **Build two architectures side by side** from a coding assistant: one monolithic claims-assessor and one multi-agent orchestrator with document extraction, FAQ, and claim calculation sub-agents
  * **Score them with two evaluators**, an LLM-as-a-judge for claim accuracy and a Python evaluator for format compliance
  * **Run one experiment over 15 test cases** covering covered / not-covered / edge / incomplete scenarios, and compare accuracy, cost, and latency in a single pass
  * **Pick a winner and ship it** via AI Chat, Python, TypeScript, or curl once the experiment identifies which architecture earns its complexity
</Info>

## What you'll build

A working auto insurance claims system, built twice: once as a single end-to-end agent and once as a multi-agent orchestrator. Both are scored with the same evaluators against the same 15-case dataset to identify the winner. The insurance domain is just the vehicle; the focus is the **architecture comparison loop** for deciding whether a multi-agent system earns its orchestration overhead.

## What you'll learn

This guide covers how to:

* **Create and configure agents and sub-agents** in orq.ai via MCP, directly from a coding assistant
* **Wire sub-agents into an orchestrator** and give it the tools it needs to route messages
* **Build evaluators** (LLM-as-a-judge and Python) that score agent outputs against an expected answer
* **Create a dataset and run one experiment** that scores both architectures with both evaluators in a single run
* **Invoke the winning agent** from AI Chat or programmatically via the Python, TypeScript, or REST API

<Tip>
  **Core takeaway:** don't default to multi-agent. Build both, score both, and let the experiment reveal whether the orchestration complexity pays off. Simpler architectures ship faster and fail less often; reach for the orchestrator when the evaluator shows it earns more accuracy.
</Tip>

## Pre-requisites

* An [orq.ai](https://my.orq.ai) workspace and API key
* A project named `00-insurance-claims` in the orq.ai dashboard (Projects, New Project). Every agent, evaluator, and dataset in this cookbook lives under this project.
* A coding assistant with the orq.ai MCP server connected (Claude Code, Cursor, or any MCP-compatible assistant)

<Note>
  Need to set up MCP? See the [MCP integration guide](/docs/ai-studio/integrations/code-assistants/claude-code) first.
</Note>

**Time:** \~20 minutes setup plus 2-5 minutes experiment execution.
**Region:** Netherlands auto insurance rules (€ currency, WA / WA+ / allrisk tiers).
**Cost:** Roughly \$0.50 to \$2.00 per full experiment run.

## Architecture overview

With this guide, build two architectures and compare them head to head.

**Architecture A: single agent**

```mermaid theme={"theme":{"light":"github-light","dark":"github-dark"}}
graph LR
    A[Customer Message] --> B[Claims Assessor]
    B --> C[Decision & Payout]

    style B fill:#e1f5ff,stroke:#0288d1
```

**Architecture B: multi-agent system**

```mermaid theme={"theme":{"light":"github-light","dark":"github-dark"}}
graph TD
    A[Customer Message] --> B[Claims Orchestrator]
    B --> C[Document Extractor]
    B --> D[FAQ Assistant]
    B --> E[Claim Calculator]
    C --> B
    D --> B
    E --> B
    B --> F[Decision & Payout]

    style B fill:#e1f5ff,stroke:#0288d1
    style C fill:#fff4e6,stroke:#f57c00
    style D fill:#fff4e6,stroke:#f57c00
    style E fill:#fff4e6,stroke:#f57c00
```

**Testing and measurement**

```mermaid theme={"theme":{"light":"github-light","dark":"github-dark"}}
graph LR
    A[Dataset: 15 test cases] --> B[Experiment]
    B --> C[Single Agent]
    B --> D[Orchestrator]
    C --> E[Evaluators]
    D --> E
    E --> F[Results]

    style A fill:#e1f5ff,stroke:#0288d1
    style E fill:#f3e5f5,stroke:#7b1fa2
    style F fill:#e8f5e9,stroke:#388e3c
```

<Note>
  **When to use a workflow instead**

  This cookbook uses an **orchestrator agent** that dynamically decides which sub-agent to call. That's ideal when the conversation is open-ended, the policyholder might ask questions, provide info in any order, or need follow-ups.

  If the execution order is already known (e.g. a UI collects all claim data upfront in a form), skip the orchestrator entirely and **chain deployments and agents in code**:

  ```
  Form data → Document Extractor → Claim Calculator → Decision template
  ```

  In that pattern, invoke each deployment or agent sequentially via the SDK, passing the output of one as input to the next. This gives deterministic execution order, full control over the flow, and easier error handling at each step.

  **When to choose which:**

  |                     | Orchestrator (this cookbook)                                                    | Workflow in code                                                                           |
  | ------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
  | **Best for**        | Chat interfaces, conversational UX                                              | Forms, APIs, batch processing                                                              |
  | **Execution order** | Dynamic, agent decides                                                          | Fixed, explicitly controlled                                                               |
  | **Flexibility**     | Handles unexpected inputs                                                       | Deterministic, fewer failure modes                                                         |
  | **Reliability**     | Testable, but harder to reach high reliability with complex multi-agent routing | Faster to production-grade reliability. Each step is isolated and independently verifiable |

  See the [Chaining Deployments tutorial](/docs/tutorials/chaining-deployments) for a step-by-step example of the workflow approach.
</Note>

<Note>
  **Create folders in the UI first.** MCP can create agents, evaluators, and datasets, but it can't create the folders they live in. To land them in specific folders like `00-insurance-claims/single-agent`, `00-insurance-claims/multi-agent`, `00-insurance-claims/evaluators`, and `00-insurance-claims/datasets`, create those folders in the orq.ai dashboard before running Step 1. Otherwise, drop the `path` from the prompts and the assistant will create them under Default or whichever folder is preferred.
</Note>

## Step 1: Create the single agent

The single agent handles the entire claims workflow end to end: incident intake, document processing, coverage verification, payout calculation, and decision communication. It runs on GPT-5.2 with conservative sampling settings (low temperature, `top_p: 0`) so the financial calculations stay consistent.

Paste the following prompt into a coding assistant:

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a single agent `claims-assessor` under `00-insurance-claims/single-agent`
running on openai/gpt-5.2 with conservative sampling (temperature 0.2, top_p 0).
It should be an end-to-end Dutch auto insurance claims assistant that handles
intake, document collection, coverage check, payout calculation, and decision
communication, with Netherlands rules (WA/WA+/allrisk tiers, €150 / €300
deductibles, 75% total-loss threshold, 10% depreciation on 5+ year vehicles).
```

**Expected outcome**

The MCP tool returns a confirmation with the agent's unique ID (a long string like `01KJQ8...`). Reference this agent by its `key` (`claims-assessor`) in later steps.

**Key terms used in the instructions:**

* **WA (Wettelijke Aansprakelijkheid):** liability-only, the mandatory minimum coverage in the Netherlands
* **WA+ / Collision:** adds own-vehicle collision damage to liability coverage
* **Allrisk / Comprehensive:** full coverage including theft, fire, storm, and vandalism
* **Total loss:** when repair cost exceeds 75% of the vehicle's market value
* **Temperature / top\_p:** sampling knobs that control randomness. Lower means more predictable, which matters for financial calculations.

## Step 2: Create the sub-agents

The multi-agent architecture splits work across three specialized agents. These three calls are independent and can be fired in parallel if the assistant supports it.

### 2a: Document extractor

Parses policyholder messages and extracts structured claim data.

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a `claims-document-extractor` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5-mini. Its job is to read policyholder messages and pull out
structured claim data (policy number, incident details, repair cost, market
value, policy type, fault, etc.), leaving any missing fields as null and
listing them under `missing_fields`.
```

### 2b: FAQ assistant

Handles policyholder questions about the claims process and coverage.

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a `claims-faq-assistant` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5-mini. It answers policyholder questions about Dutch auto
insurance (WA / WA+ / allrisk coverage, deductibles, claim process, required
documents, standard exclusions) in plain language and never gives specific
legal advice.
```

### 2c: Claim calculator

The calculation engine. Performs all payout math on structured data.

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a `claims-calculator` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5.2. It takes structured claim data and applies the Dutch rules
(coverage check, fault handling, €150/€300 deductible, 75% total-loss threshold,
10% depreciation on 5+ year vehicles) to return a precise payout in whole euros
with a covered / not covered / partial decision.
```

**Expected outcome**

Each call returns a confirmation with the sub-agent's ID. Save the `key` values; they are needed in Step 3.

**Why different models?** The document extractor and FAQ assistant use GPT-5 Mini because they're cheaper and fast enough for focused tasks. The claim calculator uses GPT-5.2 because financial calculations are where precision buys its keep.

## Step 3: Create the multi-agent system

The orchestrator coordinates the three sub-agents, deciding which one to call based on the policyholder's message.

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a `claims-orchestrator` agent under `00-insurance-claims/multi-agent`
on openai/gpt-5.2 that coordinates the three sub-agents from Step 2. Give it
the `call_sub_agent` and `retrieve_agents` tools and add all three sub-agents
to its team. It should route questions to the FAQ assistant, new info to the
document extractor, and run the calculator once extraction is complete, never
calculating or guessing itself.
```

**Expected outcome**

The orchestrator is created with the three sub-agents wired to it. It will automatically route messages to the appropriate sub-agent based on conversation context. Both complete systems are now ready to test.

## Step 4: Create the evaluators

Evaluators automatically score agent responses. Create two: one LLM-as-a-judge and one deterministic Python check.

### 4a: LLM evaluator, claim accuracy

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a boolean LLM-as-a-judge evaluator `claim-accuracy` under
`00-insurance-claims/evaluators` on anthropic/claude-sonnet-4-5 that compares
the agent's response (`{{log.output}}`) against the expected output
(`{{log.reference}}`) for the original message (`{{log.messages}}`) and returns
true only if coverage, fault, deductible, total-loss, depreciation, and final
payout are all correct.
```

Template variables are filled in automatically with the policyholder's message, the agent's response, and the expected answer from the dataset.

### 4b: Python evaluator, format compliance

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a boolean Python evaluator `format-compliance` under
`00-insurance-claims/evaluators` that does a deterministic check for four
required elements in the agent's response: a clear decision, a euro payout
amount, the deductible (or "waived"), and next-step guidance.
```

**Expected outcome**

Each evaluator returns a confirmation with its unique ID. Save these IDs; they are needed in Step 6.

## Step 5: Create the dataset and add test cases

The test dataset covers 15 claims across four scenarios:

| Category    | Count | Tests                                          |
| ----------- | ----- | ---------------------------------------------- |
| Covered     | 5     | Agent correctly approves and calculates payout |
| Not covered | 5     | Agent identifies correct denial reason         |
| Edge cases  | 3     | Handles borderline or unusual scenarios        |
| Incomplete  | 2     | Asks follow-up questions instead of guessing   |

### 5a: Create the dataset

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create a dataset called "Insurance Claims Test Cases" under
`00-insurance-claims/datasets`.
```

**Expected outcome:** returns the dataset's unique ID. Copy this ID; it is needed in the next step and in Step 6.

### 5b: Add test cases

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Add 15 datapoints to that dataset: 5 covered, 5 not covered, 3 edge cases,
2 incomplete. Each row needs a `user_input` column (the policyholder's
message, not `input` since that's reserved) and an `expected_output` column
with the correct assessment. Make sure the cases exercise every rule:
WA / WA+ / allrisk, own-fault vs other-party, standard vs under-24
deductible, total loss, and depreciation.
```

## Step 6: Create and run the experiment

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Create an experiment that runs both `claims-assessor` and `claims-orchestrator`
against the dataset from Step 5, scored by both evaluators from Step 4, and
auto-run it.
```

The experiment runs each of the 15 test cases through both agents and scores the outputs with both evaluators, producing a side-by-side comparison in a single pass.

**Expected outcome:** returns the experiment ID and a run ID. Typical runtime is 2-5 minutes.

<Tip>
  Want to compare models directly instead of agents? Create experiments with `task.type: "prompt"` and a `models` array to test different models against the same instructions.
</Tip>

## Step 7: Get the experiment results

```text Claude Code theme={"theme":{"light":"github-light","dark":"github-dark"}}
Fetch the results for that experiment run.
```

**Expected outcome:** a download URL (valid for 1 hour) pointing to a JSON/JSONL file with each agent's response, evaluator scores, cost, and latency for every test case.

**Compare the architectures on:**

* Accuracy rate (% of correct claim decisions)
* Format compliance (% with all required elements)
* Average cost per call
* Average response time

## Step 8: Invoke the winner in production

After identifying the best-performing agent, test it conversationally in [AI Chat](/docs/ai-studio/ai-chat/using-the-ai-chat) or integrate it programmatically via the Python SDK, TypeScript SDK, or REST API.

<CodeGroup>
  ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import os

  from orq_ai_sdk import Orq

  client = Orq(api_key=os.environ.get("ORQ_API_KEY", ""))

  response = client.responses.create(
      model="agent/claims-assessor",
      input="Policy NL-2024-88431. Jan de Vries, allrisk policy. Rear-ended at traffic light on A2, 12 March. Other driver admitted fault. Repair estimate €3,200. Car is 2022 VW Golf, market value €24,000. Police report filed. Age 35.",
  )

  print(response.output[0]["content"][0]["text"] if response.output else None)
  ```

  ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import { Orq } from "@orq-ai/node";

  const client = new Orq({ apiKey: process.env.ORQ_API_KEY ?? "" });

  const response = await client.responses.create({
    model: "agent/claims-assessor",
    input: "Policy NL-2024-88431. Jan de Vries, allrisk policy. Rear-ended at traffic light on A2, 12 March. Other driver admitted fault. Repair estimate €3,200. Car is 2022 VW Golf, market value €24,000. Police report filed. Age 35.",
  });

  console.log(response.output?.[0]?.content?.[0]?.text);
  ```

  ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl -X POST https://api.orq.ai/v3/router/responses \
    -H "Authorization: Bearer $ORQ_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "agent/claims-assessor",
      "input": "Policy NL-2024-88431. Jan de Vries, allrisk policy. Rear-ended at traffic light on A2, 12 March. Other driver admitted fault. Repair estimate €3,200. Car is 2022 VW Golf, market value €24,000. Police report filed. Age 35."
    }'
  ```
</CodeGroup>

Find the API key in **orq.ai dashboard → Settings → API Keys**.

## The architecture comparison loop

This is a repeatable pattern for any "should this be multi-agent?" decision: **build simple first → build the orchestrated version → score both with the same evaluators on the same dataset → compare accuracy, cost, and latency in one experiment → ship the winner**. The orchestration overhead of a multi-agent system is real (more prompts to tune, more places to fail, more latency per turn), so pay it only when the evaluator shows a meaningful accuracy gain in return.

## Next steps

* [Multi-agent HR system](/docs/tutorials/agents-API), a deeper dive into building multi-agent orchestrators with memory and knowledge bases
* [Chaining deployments](/docs/tutorials/chaining-deployments), the deterministic alternative when the execution order is already known
* [Automate evals with Claude Code](/docs/tutorials/automate-evals-and-observability-with-claude-code), close the loop by optimizing evaluators the same way agents were optimized
* [Red teaming](/docs/tutorials/red-teaming), probe the claims agent for safety and policy-bypass failures before shipping