TL;DR
- Build two architectures side by side from your coding assistant: one monolithic claims-assessor and one multi-agent orchestrator with document extraction, FAQ, and claim calculation sub-agents
- Score them with two evaluators, an LLM-as-a-judge for claim accuracy and a Python evaluator for format compliance
- Run one experiment over 15 test cases covering covered / not-covered / edge / incomplete scenarios, and compare accuracy, cost, and latency in a single pass
- Pick a winner and ship it via AI Chat, Python, TypeScript, or curl once the experiment tells you which architecture earns its complexity
What you’ll build
A working auto insurance claims system, built twice: once as a single end-to-end agent and once as a multi-agent orchestrator. You’ll score both with the same evaluators against the same 15-case dataset, then ship the winner. The insurance domain is just the vehicle, the focus is the architecture comparison loop for deciding whether a multi-agent system earns its orchestration overhead.What you’ll learn
In this guide we walk through how to:- Create and configure agents and sub-agents in orq.ai via MCP, directly from your coding assistant
- Wire sub-agents into an orchestrator and give it the tools it needs to route messages
- Build evaluators (LLM-as-a-judge and Python) that score agent outputs against an expected answer
- Create a dataset and run one experiment that scores both architectures with both evaluators in a single run
- Invoke the winning agent from AI Chat or programmatically via the Python, TypeScript, or REST API
Pre-requisites
- An orq.ai workspace and API key
- A project named
00-insurance-claimsin the orq.ai dashboard (Projects, New Project). Every agent, evaluator, and dataset in this cookbook lives under this project. - A coding assistant with the orq.ai MCP server connected (Claude Code, Cursor, or any MCP-compatible assistant)
Need to set up MCP? See the MCP integration guide first.
Architecture overview
You’ll build two architectures and compare them head to head. Architecture A: single agent Architecture B: multi-agent system Testing and measurementWhen to use a workflow insteadThis cookbook uses an orchestrator agent that dynamically decides which sub-agent to call. That’s ideal when the conversation is open-ended, the policyholder might ask questions, provide info in any order, or need follow-ups.If you already know the execution order (e.g. your UI collects all claim data upfront in a form), you can skip the orchestrator entirely and chain deployments and agents in code:In that pattern you invoke each deployment or agent sequentially via the SDK, passing the output of one as input to the next. You get deterministic execution order, full control over the flow, and easier error handling at each step.When to choose which:
See the Chaining Deployments tutorial for a step-by-step example of the workflow approach.
| Orchestrator (this cookbook) | Workflow in code | |
|---|---|---|
| Best for | Chat interfaces, conversational UX | Forms, APIs, batch processing |
| Execution order | Dynamic, agent decides | Fixed, you control |
| Flexibility | Handles unexpected inputs | Deterministic, fewer failure modes |
| Reliability | Testable, but harder to reach high reliability with complex multi-agent routing | Faster to production-grade reliability. Each step is isolated and independently verifiable |
Create folders in the UI first. MCP can create agents, evaluators, and datasets, but it can’t create the folders they live in. If you want these to land in specific folders like
00-insurance-claims/single-agent, 00-insurance-claims/multi-agent, 00-insurance-claims/evaluators, and 00-insurance-claims/datasets, create those folders in the orq.ai dashboard before running Step 1. Otherwise, drop the path from the prompts and the assistant will create them under Default or whichever folder you prefer.Step 1: Create the single agent
The single agent handles the entire claims workflow end to end: incident intake, document processing, coverage verification, payout calculation, and decision communication. It runs on GPT-5.2 with conservative sampling settings (low temperature,top_p: 0) so the financial calculations stay consistent.
Ask your coding assistant:
Claude Code
01KJQ8...). You’ll reference this agent by its key (claims-assessor) in later steps.
Key terms used in the instructions:
- WA (Wettelijke Aansprakelijkheid): liability-only, the mandatory minimum coverage in the Netherlands
- WA+ / Collision: adds own-vehicle collision damage to liability coverage
- Allrisk / Comprehensive: full coverage including theft, fire, storm, and vandalism
- Total loss: when repair cost exceeds 75% of the vehicle’s market value
- Temperature / top_p: sampling knobs that control randomness. Lower means more predictable, which matters for financial calculations.
Step 2: Create the sub-agents
The multi-agent architecture splits work across three specialized agents. These three calls are independent, so you can fire them in parallel if your assistant supports it.2a: Document extractor
Parses policyholder messages and extracts structured claim data.Claude Code
2b: FAQ assistant
Handles policyholder questions about the claims process and coverage.Claude Code
2c: Claim calculator
The calculation engine. Performs all payout math on structured data.Claude Code
key values, you’ll need them in Step 3.
Why different models? The document extractor and FAQ assistant use GPT-5 Mini because they’re cheaper and fast enough for focused tasks. The claim calculator uses GPT-5.2 because financial calculations are where precision buys its keep.
Step 3: Create the multi-agent system
The orchestrator coordinates the three sub-agents, deciding which one to call based on the policyholder’s message.Claude Code
Step 4: Create the evaluators
Evaluators automatically score agent responses. You’ll create two, one LLM-as-a-judge and one deterministic Python check.4a: LLM evaluator, claim accuracy
Claude Code
4b: Python evaluator, format compliance
Claude Code
Step 5: Create the dataset and add test cases
The test dataset covers 15 claims across four scenarios:| Category | Count | Tests |
|---|---|---|
| Covered | 5 | Agent correctly approves and calculates payout |
| Not covered | 5 | Agent identifies correct denial reason |
| Edge cases | 3 | Handles borderline or unusual scenarios |
| Incomplete | 2 | Asks follow-up questions instead of guessing |
5a: Create the dataset
Claude Code
5b: Add test cases
Claude Code
Step 6: Create and run the experiment
Claude Code
Step 7: Get the experiment results
Claude Code
- Accuracy rate (% of correct claim decisions)
- Format compliance (% with all required elements)
- Average cost per call
- Average response time
Step 8: Invoke the winner in production
Once you’ve picked the best-performing agent, you can test it conversationally in AI Chat or integrate it programmatically via the Python SDK, TypeScript SDK, or REST API.The architecture comparison loop
You now have a repeatable pattern for any “should this be multi-agent?” decision: build simple first → build the orchestrated version → score both with the same evaluators on the same dataset → compare accuracy, cost, and latency in one experiment → ship the winner. The orchestration overhead of a multi-agent system is real (more prompts to tune, more places to fail, more latency per turn), so you should only pay it when the evaluator says you’re getting more accuracy in return.Troubleshooting
Agent not found error
Agent not found error
- Verify the
keymatches exactly (case-sensitive) - Check that you’re using the agent key (e.g.
claims-assessor), not the ID - Confirm the agent exists in your workspace via the orq.ai dashboard
Experiment stuck or taking too long
Experiment stuck or taking too long
- Check experiment status in the orq.ai dashboard (Experiments section)
- Experiments typically take 2-5 minutes for 15 datapoints
- If stuck past 10 minutes, create a new experiment run
- Check your workspace API rate limits in Settings
Evaluator returns null or unexpected results
Evaluator returns null or unexpected results
- Verify the dataset
expected_outputcolumn is populated for every row - Check that the evaluator IDs in the experiment config are correct
- Review evaluator prompts or code for syntax errors
- Test evaluators individually in the orq.ai dashboard first
Sub-agents not being called by orchestrator
Sub-agents not being called by orchestrator
- Confirm
team_of_agentskeys match the sub-agent keys exactly - Verify
settings.toolsincludescall_sub_agentandretrieve_agents - Check that orchestrator instructions clearly specify when to use each sub-agent
- Review agent traces in the orq.ai dashboard to see the execution flow
High costs or slow performance
High costs or slow performance
- Consider cheaper models (e.g. GPT-5 Mini) for non-critical sub-agents
- Reduce
max_tokensif responses are longer than needed - Check whether agents are making unnecessary tool calls (review traces)
- Use streaming for better perceived latency
MCP tools not available in coding assistant
MCP tools not available in coding assistant
- Verify the MCP server is running and connected
- Check your coding assistant’s MCP configuration
- Restart your coding assistant to reload MCP connections
- See the MCP setup guide for configuration details
Next steps
- Multi-agent HR system, a deeper dive into building multi-agent orchestrators with memory and knowledge bases
- Chaining deployments, the deterministic alternative when you already know the execution order
- Automate evals with Claude Code, close the loop by optimizing your evaluators the same way you optimized your agents
- Red teaming, probe your claims agent for safety and policy-bypass failures before shipping