Orq MCP is live: Use natural language to interrogate traces, spot regressions, and experiment your way to optimal AI configurations. Available in Claude Desktop, Claude Code, Cursor, and more. Start now →
Access Claude models through Orq.ai. Use Claude 4.6 Opus, Sonnet, and Claude 4.5 Haiku with enhanced routing, caching, and prompt management capabilities.
Access Claude models (Claude 4.6 Opus, Sonnet, and Claude 4.5 Haiku) through the AI Router with advanced message APIs, tool use capabilities, and intelligent model routing. All Claude models are available with consistent formatting and pricing across multiple providers.
Claude models use the provider slug format: anthropic/model-name. For example: anthropic/claude-sonnet-4-6
Ensure that Prompt Caching is compatible with the chosen model.
For a full guide, see Prompt Caching.Cache frequently used context (system prompts, large documents, code bases) to reduce costs by up to 90% and latency by up to 85%.
curl -X POST https://api.orq.ai/v2/router/chat/completions \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "anthropic/claude-sonnet-4-6", "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are an expert Python developer with deep knowledge of best practices.", "cache_control": { "type": "ephemeral" } } ] }, { "role": "user", "content": "Write a function to parse JSON" } ], "max_tokens": 1024 }'
How It WorksPrompt caching stores frequently used content blocks on Anthropic’s servers for reuse across requests:
Mark content for caching: Add cache_control: { type: "ephemeral" } to text blocks
First request: Content is processed normally and cached (cache write)
Subsequent requests: Cached content is reused (cache read)
Cache lifetime: 5 minutes from last use (automatically managed)
ConfigurationMark content blocks for caching by adding the cache_control parameter:
Parameter
Type
Required
Description
type
"ephemeral"
Yes
Only supported cache type
ttl
"5m" | "1h"
No
Cache duration (default: "5m")
Cache TTL OptionsThe ttl parameter controls how long cached content persists:
"5m" (5 minutes) - Default cache duration
"1h" (1 hour) - Extended cache duration for longer-running workflows
Add cache_control to the last message or content block you want cached
Everything up to that point is included in the cache
Maximum: 4 cache breakpoints per request
Minimum token thresholdsCaching only activates once the marked content meets the model’s minimum. Requests below the threshold are processed normally at full cost.
Model
Minimum tokens
Claude Opus 4.6, Opus 4.5
4,096
Claude Sonnet 4.6
2,048
Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7
1,024
Claude Haiku 4.5
4,096
Claude Haiku 3.5, Haiku 3
2,048
Use Cases
Static System Prompts
Cache role definitions and instructions that don’t change.
curl -X POST https://api.orq.ai/v2/router/chat/completions \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "anthropic/claude-sonnet-4-6", "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are an expert software engineer specializing in Python.\nYour responses should be:\n- Clear and concise\n- Include code examples\n- Follow PEP 8 style guidelines\n- Include error handling", "cache_control": { "type": "ephemeral" } } ] }, { "role": "user", "content": "How do I read a CSV file?" } ], "max_tokens": 1024 }'
Large Document Context
Cache documents, codebases, or knowledge bases for reuse across multiple queries.
curl -X POST https://api.orq.ai/v2/router/chat/completions \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "anthropic/claude-sonnet-4-6", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Here is our API documentation:\n\n[Large documentation content here...]", "cache_control": { "type": "ephemeral" } }, { "type": "text", "text": "How do I authenticate with the API?" } ] } ], "max_tokens": 1024 }'
Multi-turn Conversations
Cache conversation history for long interactions to reduce processing time and costs on subsequent messages.
curl -X POST https://api.orq.ai/v2/router/chat/completions \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "anthropic/claude-sonnet-4-6", "messages": [ { "role": "user", "content": "What is Python?" }, { "role": "assistant", "content": "Python is a high-level programming language..." }, { "role": "user", "content": [ { "type": "text", "text": "What are its main features?", "cache_control": { "type": "ephemeral" } } ] }, { "role": "assistant", "content": "Pythons main features include..." }, { "role": "user", "content": "Can you give me a code example?" } ], "max_tokens": 1024 }'
RAG with Document Collections
Cache retrieved documents for multiple queries in retrieval-augmented generation scenarios.
curl -X POST https://api.orq.ai/v2/router/chat/completions \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "anthropic/claude-sonnet-4-6", "messages": [ { "role": "system", "content": "You are a helpful assistant that answers based on provided context." }, { "role": "user", "content": [ { "type": "text", "text": "Context:\n[Retrieved document content here...]", "cache_control": { "type": "ephemeral" } }, { "type": "text", "text": "Question: What is the main topic of these documents?" } ] } ], "max_tokens": 1024 }'
Important: Always include the signature field when passing reasoning content back to the API. The signature cryptographically verifies the reasoning was generated by the model and is required for multi-turn conversations.
Combine with prompt caching for repeated contexts
Cache system prompts and context to reduce costs and latency when using extended thinking:
const response = await openai.chat.completions.create({ model: "anthropic/claude-opus-4-6", messages: [ { role: "system", content: [{ type: "text", text: "You are a system architect...", // Cache this cache_control: { type: "ephemeral" } }] }, { role: "user", content: "Design a notification system" } ], thinking: { type: "enabled", budget_tokens: 8000 }});
Configuration & Best Practices
Aspect
Guidance
Details
thinking.type
Set to "enabled"
Enables extended thinking with manual budget
thinking.budget_tokens
Set based on complexity
Min: 1024, must be < max_tokens. Billed as output tokens.
Supported Models: Extended thinking with budget_tokens is available on Claude Opus 4.5, Sonnet 4.5, and newer models. For Claude Opus 4.6 and Sonnet 4.6, consider using adaptive thinking instead (see below). Available through anthropic/, aws/, and google/ providers.
Adaptive thinking is the recommended way to use extended thinking with Claude Opus 4.6 and Sonnet 4.6. Instead of manually setting a thinking token budget, adaptive thinking lets Claude dynamically determine when and how much to think based on the complexity of each request.
Recommended for Claude 4.6 models. Claude determines thinking depth automatically.
Manual
thinking: { type: "enabled", budget_tokens: N }
When you need precise control over thinking token spend. Supported on all thinking-capable models.
Disabled
Omit thinking parameter
When you don’t need extended thinking and want the lowest latency.
Supported Models: Adaptive thinking is available on Claude Opus 4.6 and Claude Sonnet 4.6 only. Older models (Opus 4.5, Sonnet 4.5, etc.) require type: "enabled" with budget_tokens.
Note: max_tokens is required for Anthropic models. Typical values: 1024 for responses, 4096+ for long content.
Do not use temperature and top_p together on newer Anthropic models. Using both parameters simultaneously will result in an API error. Choose one or the other.