Build and compare insurance claims agents with MCP

TL;DR

Build two architectures side by side from a coding assistant: one monolithic claims-assessor and one multi-agent orchestrator with document extraction, FAQ, and claim calculation sub-agents
Score them with two evaluators, an LLM-as-a-judge for claim accuracy and a Python evaluator for format compliance
Run one experiment over 15 test cases covering covered / not-covered / edge / incomplete scenarios, and compare accuracy, cost, and latency in a single pass
Pick a winner and ship it via AI Chat, Python, TypeScript, or curl once the experiment identifies which architecture earns its complexity

What you’ll build

A working auto insurance claims system, built twice: once as a single end-to-end agent and once as a multi-agent orchestrator. Both are scored with the same evaluators against the same 15-case dataset to identify the winner. The insurance domain is just the vehicle; the focus is the architecture comparison loop for deciding whether a multi-agent system earns its orchestration overhead.

What you’ll learn

This guide covers how to:

Create and configure agents and sub-agents in orq.ai via MCP, directly from a coding assistant
Wire sub-agents into an orchestrator and give it the tools it needs to route messages
Build evaluators (LLM-as-a-judge and Python) that score agent outputs against an expected answer
Create a dataset and run one experiment that scores both architectures with both evaluators in a single run
Invoke the winning agent from AI Chat or programmatically via the Python, TypeScript, or REST API

Core takeaway: don’t default to multi-agent. Build both, score both, and let the experiment reveal whether the orchestration complexity pays off. Simpler architectures ship faster and fail less often; reach for the orchestrator when the evaluator shows it earns more accuracy.

Pre-requisites

An orq.ai workspace and API key
A project named 00-insurance-claims in the orq.ai dashboard (Projects, New Project). Every agent, evaluator, and dataset in this cookbook lives under this project.
A coding assistant with the orq.ai MCP server connected (Claude Code, Cursor, or any MCP-compatible assistant)

Need to set up MCP? See the MCP integration guide first.

Time: ~20 minutes setup plus 2-5 minutes experiment execution. Region: Netherlands auto insurance rules (€ currency, WA / WA+ / allrisk tiers). Cost: Roughly $0.50 to $2.00 per full experiment run.

Architecture overview

With this guide, build two architectures and compare them head to head. Architecture A: single agent Architecture B: multi-agent system Testing and measurement

When to use a workflow insteadThis cookbook uses an orchestrator agent that dynamically decides which sub-agent to call. That’s ideal when the conversation is open-ended, the policyholder might ask questions, provide info in any order, or need follow-ups.If the execution order is already known (e.g. a UI collects all claim data upfront in a form), skip the orchestrator entirely and chain deployments and agents in code:

Form data → Document Extractor → Claim Calculator → Decision template

In that pattern, invoke each deployment or agent sequentially via the SDK, passing the output of one as input to the next. This gives deterministic execution order, full control over the flow, and easier error handling at each step.When to choose which:

	Orchestrator (this cookbook)	Workflow in code
Best for	Chat interfaces, conversational UX	Forms, APIs, batch processing
Execution order	Dynamic, agent decides	Fixed, explicitly controlled
Flexibility	Handles unexpected inputs	Deterministic, fewer failure modes
Reliability	Testable, but harder to reach high reliability with complex multi-agent routing	Faster to production-grade reliability. Each step is isolated and independently verifiable

See the Chaining Deployments tutorial for a step-by-step example of the workflow approach.

Create folders in the UI first. MCP can create agents, evaluators, and datasets, but it can’t create the folders they live in. To land them in specific folders like 00-insurance-claims/single-agent, 00-insurance-claims/multi-agent, 00-insurance-claims/evaluators, and 00-insurance-claims/datasets, create those folders in the orq.ai dashboard before running Step 1. Otherwise, drop the path from the prompts and the assistant will create them under Default or whichever folder is preferred.

Step 1: Create the single agent

The single agent handles the entire claims workflow end to end: incident intake, document processing, coverage verification, payout calculation, and decision communication. It runs on GPT-5.2 with conservative sampling settings (low temperature, top_p: 0) so the financial calculations stay consistent. Paste the following prompt into a coding assistant:

Claude Code

Create a single agent `claims-assessor` under `00-insurance-claims/single-agent`
running on openai/gpt-5.2 with conservative sampling (temperature 0.2, top_p 0).
It should be an end-to-end Dutch auto insurance claims assistant that handles
intake, document collection, coverage check, payout calculation, and decision
communication, with Netherlands rules (WA/WA+/allrisk tiers, €150 / €300
deductibles, 75% total-loss threshold, 10% depreciation on 5+ year vehicles).

Expected outcome The MCP tool returns a confirmation with the agent’s unique ID (a long string like 01KJQ8...). Reference this agent by its key (claims-assessor) in later steps. Key terms used in the instructions:

WA (Wettelijke Aansprakelijkheid): liability-only, the mandatory minimum coverage in the Netherlands
WA+ / Collision: adds own-vehicle collision damage to liability coverage
Allrisk / Comprehensive: full coverage including theft, fire, storm, and vandalism
Total loss: when repair cost exceeds 75% of the vehicle’s market value
Temperature / top_p: sampling knobs that control randomness. Lower means more predictable, which matters for financial calculations.

Step 2: Create the sub-agents

The multi-agent architecture splits work across three specialized agents. These three calls are independent and can be fired in parallel if the assistant supports it.

2a: Document extractor

Parses policyholder messages and extracts structured claim data.

Claude Code

Create a `claims-document-extractor` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5-mini. Its job is to read policyholder messages and pull out
structured claim data (policy number, incident details, repair cost, market
value, policy type, fault, etc.), leaving any missing fields as null and
listing them under `missing_fields`.

2b: FAQ assistant

Handles policyholder questions about the claims process and coverage.

Claude Code

Create a `claims-faq-assistant` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5-mini. It answers policyholder questions about Dutch auto
insurance (WA / WA+ / allrisk coverage, deductibles, claim process, required
documents, standard exclusions) in plain language and never gives specific
legal advice.

2c: Claim calculator

The calculation engine. Performs all payout math on structured data.

Claude Code

Create a `claims-calculator` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5.2. It takes structured claim data and applies the Dutch rules
(coverage check, fault handling, €150/€300 deductible, 75% total-loss threshold,
10% depreciation on 5+ year vehicles) to return a precise payout in whole euros
with a covered / not covered / partial decision.

Expected outcome Each call returns a confirmation with the sub-agent’s ID. Save the key values; they are needed in Step 3. Why different models? The document extractor and FAQ assistant use GPT-5 Mini because they’re cheaper and fast enough for focused tasks. The claim calculator uses GPT-5.2 because financial calculations are where precision buys its keep.

Step 3: Create the multi-agent system

The orchestrator coordinates the three sub-agents, deciding which one to call based on the policyholder’s message.

Claude Code

Create a `claims-orchestrator` agent under `00-insurance-claims/multi-agent`
on openai/gpt-5.2 that coordinates the three sub-agents from Step 2. Give it
the `call_sub_agent` and `retrieve_agents` tools and add all three sub-agents
to its team. It should route questions to the FAQ assistant, new info to the
document extractor, and run the calculator once extraction is complete, never
calculating or guessing itself.

Expected outcome The orchestrator is created with the three sub-agents wired to it. It will automatically route messages to the appropriate sub-agent based on conversation context. Both complete systems are now ready to test.

Step 4: Create the evaluators

Evaluators automatically score agent responses. Create two: one LLM-as-a-judge and one deterministic Python check.

4a: LLM evaluator, claim accuracy

Claude Code

Create a boolean LLM-as-a-judge evaluator `claim-accuracy` under
`00-insurance-claims/evaluators` on anthropic/claude-sonnet-4-5 that compares
the agent's response (`{{log.output}}`) against the expected output
(`{{log.reference}}`) for the original message (`{{log.messages}}`) and returns
true only if coverage, fault, deductible, total-loss, depreciation, and final
payout are all correct.

Template variables are filled in automatically with the policyholder’s message, the agent’s response, and the expected answer from the dataset.

4b: Python evaluator, format compliance

Claude Code

Create a boolean Python evaluator `format-compliance` under
`00-insurance-claims/evaluators` that does a deterministic check for four
required elements in the agent's response: a clear decision, a euro payout
amount, the deductible (or "waived"), and next-step guidance.

Expected outcome Each evaluator returns a confirmation with its unique ID. Save these IDs; they are needed in Step 6.

Step 5: Create the dataset and add test cases

The test dataset covers 15 claims across four scenarios:

Category	Count	Tests
Covered	5	Agent correctly approves and calculates payout
Not covered	5	Agent identifies correct denial reason
Edge cases	3	Handles borderline or unusual scenarios
Incomplete	2	Asks follow-up questions instead of guessing

5a: Create the dataset

Claude Code

Create a dataset called "Insurance Claims Test Cases" under
`00-insurance-claims/datasets`.

Expected outcome: returns the dataset’s unique ID. Copy this ID; it is needed in the next step and in Step 6.

5b: Add test cases

Claude Code

Add 15 datapoints to that dataset: 5 covered, 5 not covered, 3 edge cases,
2 incomplete. Each row needs a `user_input` column (the policyholder's
message, not `input` since that's reserved) and an `expected_output` column
with the correct assessment. Make sure the cases exercise every rule:
WA / WA+ / allrisk, own-fault vs other-party, standard vs under-24
deductible, total loss, and depreciation.

Step 6: Create and run the experiment

Claude Code

Create an experiment that runs both `claims-assessor` and `claims-orchestrator`
against the dataset from Step 5, scored by both evaluators from Step 4, and
auto-run it.

The experiment runs each of the 15 test cases through both agents and scores the outputs with both evaluators, producing a side-by-side comparison in a single pass. Expected outcome: returns the experiment ID and a run ID. Typical runtime is 2-5 minutes.

Want to compare models directly instead of agents? Create experiments with task.type: "prompt" and a models array to test different models against the same instructions.

Step 7: Get the experiment results

Claude Code

Fetch the results for that experiment run.

Expected outcome: a download URL (valid for 1 hour) pointing to a JSON/JSONL file with each agent’s response, evaluator scores, cost, and latency for every test case. Compare the architectures on:

Accuracy rate (% of correct claim decisions)
Format compliance (% with all required elements)
Average cost per call
Average response time

Step 8: Invoke the winner in production

After identifying the best-performing agent, test it conversationally in AI Chat or integrate it programmatically via the Python SDK, TypeScript SDK, or REST API.

import os

from orq_ai_sdk import Orq

client = Orq(api_key=os.environ.get("ORQ_API_KEY", ""))

response = client.responses.create(
    model="agent/claims-assessor",
    input="Policy NL-2024-88431. Jan de Vries, allrisk policy. Rear-ended at traffic light on A2, 12 March. Other driver admitted fault. Repair estimate €3,200. Car is 2022 VW Golf, market value €24,000. Police report filed. Age 35.",
)

print(response.output[0]["content"][0]["text"] if response.output else None)

Find the API key in orq.ai dashboard → Settings → API Keys.

The architecture comparison loop

This is a repeatable pattern for any “should this be multi-agent?” decision: build simple first → build the orchestrated version → score both with the same evaluators on the same dataset → compare accuracy, cost, and latency in one experiment → ship the winner. The orchestration overhead of a multi-agent system is real (more prompts to tune, more places to fail, more latency per turn), so pay it only when the evaluator shows a meaningful accuracy gain in return.

Next steps

Multi-agent HR system, a deeper dive into building multi-agent orchestrators with memory and knowledge bases
Chaining deployments, the deterministic alternative when the execution order is already known
Automate evals with Claude Code, close the loop by optimizing evaluators the same way agents were optimized
Red teaming, probe the claims agent for safety and policy-bypass failures before shipping

​What you’ll build

​What you’ll learn

​Pre-requisites

​Architecture overview

​Step 1: Create the single agent

​Step 2: Create the sub-agents

​2a: Document extractor

​2b: FAQ assistant

​2c: Claim calculator

​Step 3: Create the multi-agent system

​Step 4: Create the evaluators

​4a: LLM evaluator, claim accuracy

​4b: Python evaluator, format compliance

​Step 5: Create the dataset and add test cases

​5a: Create the dataset

​5b: Add test cases

​Step 6: Create and run the experiment

​Step 7: Get the experiment results

​Step 8: Invoke the winner in production

​The architecture comparison loop

​Next steps

What you’ll build

What you’ll learn

Pre-requisites

Architecture overview

Step 1: Create the single agent

Step 2: Create the sub-agents

2a: Document extractor

2b: FAQ assistant

2c: Claim calculator

Step 3: Create the multi-agent system

Step 4: Create the evaluators

4a: LLM evaluator, claim accuracy

4b: Python evaluator, format compliance

Step 5: Create the dataset and add test cases

5a: Create the dataset

5b: Add test cases

Step 6: Create and run the experiment

Step 7: Get the experiment results

Step 8: Invoke the winner in production

The architecture comparison loop

Next steps