Skip to main content
TL;DR
  • Build two architectures side by side from your coding assistant: one monolithic claims-assessor and one multi-agent orchestrator with document extraction, FAQ, and claim calculation sub-agents
  • Score them with two evaluators, an LLM-as-a-judge for claim accuracy and a Python evaluator for format compliance
  • Run one experiment over 15 test cases covering covered / not-covered / edge / incomplete scenarios, and compare accuracy, cost, and latency in a single pass
  • Pick a winner and ship it via AI Chat, Python, TypeScript, or curl once the experiment tells you which architecture earns its complexity

What you’ll build

A working auto insurance claims system, built twice: once as a single end-to-end agent and once as a multi-agent orchestrator. You’ll score both with the same evaluators against the same 15-case dataset, then ship the winner. The insurance domain is just the vehicle, the focus is the architecture comparison loop for deciding whether a multi-agent system earns its orchestration overhead.

What you’ll learn

In this guide we walk through how to:
  • Create and configure agents and sub-agents in orq.ai via MCP, directly from your coding assistant
  • Wire sub-agents into an orchestrator and give it the tools it needs to route messages
  • Build evaluators (LLM-as-a-judge and Python) that score agent outputs against an expected answer
  • Create a dataset and run one experiment that scores both architectures with both evaluators in a single run
  • Invoke the winning agent from AI Chat or programmatically via the Python, TypeScript, or REST API
Core takeaway: don’t default to multi-agent. Build both, score both, and let the experiment tell you whether the orchestration complexity actually earns you anything. Simpler architectures ship faster and fail less often, reach for the orchestrator when the evaluator says you need it.

Pre-requisites

  • An orq.ai workspace and API key
  • A project named 00-insurance-claims in the orq.ai dashboard (Projects, New Project). Every agent, evaluator, and dataset in this cookbook lives under this project.
  • A coding assistant with the orq.ai MCP server connected (Claude Code, Cursor, or any MCP-compatible assistant)
Need to set up MCP? See the MCP integration guide first.
Time: ~20 minutes setup plus 2-5 minutes experiment execution. Region: Netherlands auto insurance rules (€ currency, WA / WA+ / allrisk tiers). Cost: Roughly 0.50to0.50 to 2.00 per full experiment run.

Architecture overview

You’ll build two architectures and compare them head to head. Architecture A: single agent Architecture B: multi-agent system Testing and measurement
When to use a workflow insteadThis cookbook uses an orchestrator agent that dynamically decides which sub-agent to call. That’s ideal when the conversation is open-ended, the policyholder might ask questions, provide info in any order, or need follow-ups.If you already know the execution order (e.g. your UI collects all claim data upfront in a form), you can skip the orchestrator entirely and chain deployments and agents in code:
Form data → Document Extractor → Claim Calculator → Decision template
In that pattern you invoke each deployment or agent sequentially via the SDK, passing the output of one as input to the next. You get deterministic execution order, full control over the flow, and easier error handling at each step.When to choose which:
Orchestrator (this cookbook)Workflow in code
Best forChat interfaces, conversational UXForms, APIs, batch processing
Execution orderDynamic, agent decidesFixed, you control
FlexibilityHandles unexpected inputsDeterministic, fewer failure modes
ReliabilityTestable, but harder to reach high reliability with complex multi-agent routingFaster to production-grade reliability. Each step is isolated and independently verifiable
See the Chaining Deployments tutorial for a step-by-step example of the workflow approach.
Create folders in the UI first. MCP can create agents, evaluators, and datasets, but it can’t create the folders they live in. If you want these to land in specific folders like 00-insurance-claims/single-agent, 00-insurance-claims/multi-agent, 00-insurance-claims/evaluators, and 00-insurance-claims/datasets, create those folders in the orq.ai dashboard before running Step 1. Otherwise, drop the path from the prompts and the assistant will create them under Default or whichever folder you prefer.

Step 1: Create the single agent

The single agent handles the entire claims workflow end to end: incident intake, document processing, coverage verification, payout calculation, and decision communication. It runs on GPT-5.2 with conservative sampling settings (low temperature, top_p: 0) so the financial calculations stay consistent. Ask your coding assistant:
Claude Code
Create a single agent `claims-assessor` under `00-insurance-claims/single-agent`
running on openai/gpt-5.2 with conservative sampling (temperature 0.2, top_p 0).
It should be an end-to-end Dutch auto insurance claims assistant that handles
intake, document collection, coverage check, payout calculation, and decision
communication, with Netherlands rules (WA/WA+/allrisk tiers, €150 / €300
deductibles, 75% total-loss threshold, 10% depreciation on 5+ year vehicles).
Expected outcome The MCP tool returns a confirmation with the agent’s unique ID (a long string like 01KJQ8...). You’ll reference this agent by its key (claims-assessor) in later steps. Key terms used in the instructions:
  • WA (Wettelijke Aansprakelijkheid): liability-only, the mandatory minimum coverage in the Netherlands
  • WA+ / Collision: adds own-vehicle collision damage to liability coverage
  • Allrisk / Comprehensive: full coverage including theft, fire, storm, and vandalism
  • Total loss: when repair cost exceeds 75% of the vehicle’s market value
  • Temperature / top_p: sampling knobs that control randomness. Lower means more predictable, which matters for financial calculations.

Step 2: Create the sub-agents

The multi-agent architecture splits work across three specialized agents. These three calls are independent, so you can fire them in parallel if your assistant supports it.

2a: Document extractor

Parses policyholder messages and extracts structured claim data.
Claude Code
Create a `claims-document-extractor` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5-mini. Its job is to read policyholder messages and pull out
structured claim data (policy number, incident details, repair cost, market
value, policy type, fault, etc.), leaving any missing fields as null and
listing them under `missing_fields`.

2b: FAQ assistant

Handles policyholder questions about the claims process and coverage.
Claude Code
Create a `claims-faq-assistant` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5-mini. It answers policyholder questions about Dutch auto
insurance (WA / WA+ / allrisk coverage, deductibles, claim process, required
documents, standard exclusions) in plain language and never gives specific
legal advice.

2c: Claim calculator

The calculation engine. Performs all payout math on structured data.
Claude Code
Create a `claims-calculator` sub-agent under `00-insurance-claims/multi-agent`
on openai/gpt-5.2. It takes structured claim data and applies the Dutch rules
(coverage check, fault handling, €150/€300 deductible, 75% total-loss threshold,
10% depreciation on 5+ year vehicles) to return a precise payout in whole euros
with a covered / not covered / partial decision.
Expected outcome Each call returns a confirmation with the sub-agent’s ID. Save the key values, you’ll need them in Step 3. Why different models? The document extractor and FAQ assistant use GPT-5 Mini because they’re cheaper and fast enough for focused tasks. The claim calculator uses GPT-5.2 because financial calculations are where precision buys its keep.

Step 3: Create the multi-agent system

The orchestrator coordinates the three sub-agents, deciding which one to call based on the policyholder’s message.
Claude Code
Create a `claims-orchestrator` agent under `00-insurance-claims/multi-agent`
on openai/gpt-5.2 that coordinates the three sub-agents from Step 2. Give it
the `call_sub_agent` and `retrieve_agents` tools and add all three sub-agents
to its team. It should route questions to the FAQ assistant, new info to the
document extractor, and run the calculator once extraction is complete, never
calculating or guessing itself.
Expected outcome The orchestrator is created with the three sub-agents wired to it. It will automatically route messages to the appropriate sub-agent based on conversation context. You now have two complete systems ready to test.

Step 4: Create the evaluators

Evaluators automatically score agent responses. You’ll create two, one LLM-as-a-judge and one deterministic Python check.

4a: LLM evaluator, claim accuracy

Claude Code
Create a boolean LLM-as-a-judge evaluator `claim-accuracy` under
`00-insurance-claims/evaluators` on anthropic/claude-sonnet-4-5 that compares
the agent's response (`{{log.output}}`) against the expected output
(`{{log.reference}}`) for the original message (`{{log.messages}}`) and returns
true only if coverage, fault, deductible, total-loss, depreciation, and final
payout are all correct.
Template variables are filled in automatically with the policyholder’s message, the agent’s response, and the expected answer from the dataset.

4b: Python evaluator, format compliance

Claude Code
Create a boolean Python evaluator `format-compliance` under
`00-insurance-claims/evaluators` that does a deterministic check for four
required elements in the agent's response: a clear decision, a euro payout
amount, the deductible (or "waived"), and next-step guidance.
Expected outcome Each evaluator returns a confirmation with its unique ID. Save these IDs, you’ll need them in Step 6.

Step 5: Create the dataset and add test cases

The test dataset covers 15 claims across four scenarios:
CategoryCountTests
Covered5Agent correctly approves and calculates payout
Not covered5Agent identifies correct denial reason
Edge cases3Handles borderline or unusual scenarios
Incomplete2Asks follow-up questions instead of guessing

5a: Create the dataset

Claude Code
Create a dataset called "Insurance Claims Test Cases" under
`00-insurance-claims/datasets`.
Expected outcome: returns the dataset’s unique ID. Copy this ID, you’ll need it in the next step and in Step 6.

5b: Add test cases

Claude Code
Add 15 datapoints to that dataset: 5 covered, 5 not covered, 3 edge cases,
2 incomplete. Each row needs a `user_input` column (the policyholder's
message, not `input` since that's reserved) and an `expected_output` column
with the correct assessment. Make sure the cases exercise every rule:
WA / WA+ / allrisk, own-fault vs other-party, standard vs under-24
deductible, total loss, and depreciation.

Step 6: Create and run the experiment

Claude Code
Create an experiment that runs both `claims-assessor` and `claims-orchestrator`
against the dataset from Step 5, scored by both evaluators from Step 4, and
auto-run it.
The experiment runs each of the 15 test cases through both agents and scores the outputs with both evaluators, giving you a side-by-side comparison in a single pass. Expected outcome: returns the experiment ID and a run ID. Typical runtime is 2-5 minutes.
Want to compare models directly instead of agents? Create experiments with task.type: "prompt" and a models array to test different models against the same instructions.

Step 7: Get the experiment results

Claude Code
Fetch the results for that experiment run.
Expected outcome: a download URL (valid for 1 hour) pointing to a JSON/JSONL file with each agent’s response, evaluator scores, cost, and latency for every test case. Compare the architectures on:
  • Accuracy rate (% of correct claim decisions)
  • Format compliance (% with all required elements)
  • Average cost per call
  • Average response time

Step 8: Invoke the winner in production

Once you’ve picked the best-performing agent, you can test it conversationally in AI Chat or integrate it programmatically via the Python SDK, TypeScript SDK, or REST API.
from orq_ai_sdk import OrqAI

client = OrqAI(api_key="your-api-key")

response = client.agents.invoke(
    key="claims-assessor",
    messages=[{
        "role": "user",
        "content": "Policy NL-2024-88431. Jan de Vries, allrisk policy. Rear-ended at traffic light on A2, 12 March. Other driver admitted fault. Repair estimate €3,200. Car is 2022 VW Golf, market value €24,000. Police report filed. Age 35."
    }]
)

print(response.choices[0].message.content)
Find your API key in orq.ai dashboard → Settings → API Keys.

The architecture comparison loop

You now have a repeatable pattern for any “should this be multi-agent?” decision: build simple first → build the orchestrated version → score both with the same evaluators on the same dataset → compare accuracy, cost, and latency in one experiment → ship the winner. The orchestration overhead of a multi-agent system is real (more prompts to tune, more places to fail, more latency per turn), so you should only pay it when the evaluator says you’re getting more accuracy in return.

Troubleshooting

  • Verify the key matches exactly (case-sensitive)
  • Check that you’re using the agent key (e.g. claims-assessor), not the ID
  • Confirm the agent exists in your workspace via the orq.ai dashboard
  • Check experiment status in the orq.ai dashboard (Experiments section)
  • Experiments typically take 2-5 minutes for 15 datapoints
  • If stuck past 10 minutes, create a new experiment run
  • Check your workspace API rate limits in Settings
  • Verify the dataset expected_output column is populated for every row
  • Check that the evaluator IDs in the experiment config are correct
  • Review evaluator prompts or code for syntax errors
  • Test evaluators individually in the orq.ai dashboard first
  • Confirm team_of_agents keys match the sub-agent keys exactly
  • Verify settings.tools includes call_sub_agent and retrieve_agents
  • Check that orchestrator instructions clearly specify when to use each sub-agent
  • Review agent traces in the orq.ai dashboard to see the execution flow
  • Consider cheaper models (e.g. GPT-5 Mini) for non-critical sub-agents
  • Reduce max_tokens if responses are longer than needed
  • Check whether agents are making unnecessary tool calls (review traces)
  • Use streaming for better perceived latency
  • Verify the MCP server is running and connected
  • Check your coding assistant’s MCP configuration
  • Restart your coding assistant to reload MCP connections
  • See the MCP setup guide for configuration details

Next steps