Skip to main content

Overview

Prompt Caching is a provider-level feature that caches prompts so that repeated requests are charged at a reduced rate. This is most effective when your requests share a large, stable prefix:
  • a long system prompt
  • a reference document
  • a tool definition list
Unlike Response Caching, which serves a stored response for identical requests, Prompt Caching still calls the model on every request, at a reduced cost. Both can be used together.

Supported Models

Prompt Caching is available on Anthropic models: Claude Haiku, Sonnet, and Opus.

How to Enable Prompt Caching

Add a cache_control object to any message part you want to mark as cacheable:
{
  "cache_control": { "type": "ephemeral" }
}
"ephemeral" is the only supported type. You can place it on:
  • System message text parts
  • User message text parts
  • User message images, documents, and files (including PDFs)
  • Tool result content

Minimum Token Thresholds

Caching only activates once the marked content exceeds a minimum token count. Requests below the threshold are processed normally at full cost.
ModelMinimum tokens
Claude Opus 4.6, Opus 4.54,096
Claude Sonnet 4.62,048
Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.71,024
Claude Haiku 4.54,096
Claude Haiku 3.5, Haiku 32,048

Cache TTL

The ttl parameter controls how long cached content persists before expiring.
ValueDuration
"5m" (default)5 minutes from last use
"1h"1 hour
{
  "cache_control": {
    "type": "ephemeral",
    "ttl": "1h"
  }
}

Example

curl -X POST https://api.orq.ai/v3/router/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4-6",
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a senior legal assistant. The following is our complete contract template library...",
            "cache_control": { "type": "ephemeral" }
          }
        ]
      },
      {
        "role": "user",
        "content": "Summarize clause 7 of the NDA template."
      }
    ]
  }'

Usage in the response

When a cache hit occurs, the response usage object reflects it under prompt_tokens_details.cached_tokens:
{
  "usage": {
    "prompt_tokens": 1200,
    "completion_tokens": 180,
    "total_tokens": 1380,
    "prompt_tokens_details": {
      "cached_tokens": 1024
    }
  }
}