Prompt caching for reduced token costs

Use Cases

Reusing long system prompts across many requests to cut input token costs.
Referencing large documents or codebases without re-sending them every call.
Multi-turn conversations with a large, stable context that doesn’t change between turns.
RAG pipelines where the same retrieved context is shared across many user queries.

Overview

Prompt Caching is a provider-level feature that caches prompts so that repeated requests are charged at a reduced rate. This is most effective when your requests share a large, stable prefix:

a long system prompt.
a reference document.
a tool definition list. Unlike Response Caching, which serves a stored response for identical requests, Prompt Caching still calls the model on every request, at a reduced cost. Both can be used together.

How caching is enabled and what gets cached varies by provider. See the provider sections below.

Anthropic

Prompt caching on Anthropic models requires explicit opt-in via cache_control markers on individual message parts.

Supported models

Claude Haiku, Sonnet, and Opus across all current versions.

Enabling caching

Add a cache_control object to any message part you want to mark as cacheable:

{
  "cache_control": { "type": "ephemeral" }
}

"ephemeral" is the only supported type. You can place it on:

System message text parts.
User message text parts.
User message images, documents, and files (including PDFs).
Tool result content.

Minimum token thresholds

Caching only activates once the marked content exceeds a minimum token count. Requests below the threshold are processed normally at full cost.

Model	Minimum tokens
Claude Opus 4.6, Opus 4.5	4,096
Claude Sonnet 4.6	2,048
Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7	1,024
Claude Haiku 4.5	4,096
Claude Haiku 3.5, Haiku 3	2,048

Cache TTL

The ttl parameter controls how long cached content persists before expiring.

Value	Duration
`"5m"` (default)	5 minutes from last use
`"1h"`	1 hour

{
  "cache_control": {
    "type": "ephemeral",
    "ttl": "1h"
  }
}

Example

curl -X POST https://api.orq.ai/v3/router/responses \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4-6",
    "input": [
      {
        "role": "system",
        "content": [
          {
            "type": "input_text",
            "text": "You are a senior legal assistant. The following is our complete contract template library...",
            "cache_control": { "type": "ephemeral" }
          }
        ]
      },
      {
        "role": "user",
        "content": "Summarize clause 7 of the NDA template."
      }
    ]
  }'

OpenAI

Prompt caching on OpenAI models is fully automatic. No cache_control or any request changes are required. The AI Gateway forwards requests normally; OpenAI caches the prompt prefix on its side and applies the discount transparently. Caching activates on prompts longer than 1,024 tokens, in 128-token increments from that threshold. The API caches the longest matching prefix from prior requests on the same machine. Cache retention duration is model-dependent and determined by OpenAI. Refer to OpenAI’s prompt caching documentation for the current retention policy per model. Cache hits are reflected in the response usage object the same way as Anthropic. See Usage in the response below.

OpenAI

Set up your OpenAI API key to use GPT models with automatic prompt caching.

Google Gemini

Google Gemini offers two caching modes. Which one applies depends on which model generation you use. Implicit caching is enabled by default on Gemini 2.0 and newer models. No request changes are needed. The AI Gateway forwards requests normally and Google applies the cache discount automatically when a matching prefix exists. Cached tokens are discounted by up to 90% on Gemini 2.5 models and up to 75% on Gemini 2.0 models. Verify current rates in Google’s pricing documentation. Implicit caching activates at a minimum of 2,048 tokens. Explicit caching (creating a named cache object and referencing it by ID in subsequent requests) is a separate Google API that is not currently exposed through the AI Gateway.

Google AI

Set up your Google AI API key to use Gemini models with implicit prompt caching.

Usage in the response

When a cache hit occurs, the response usage object reflects it under prompt_tokens_details.cached_tokens:

{
  "usage": {
    "prompt_tokens": 1200,
    "completion_tokens": 180,
    "total_tokens": 1380,
    "prompt_tokens_details": {
      "cached_tokens": 1024
    }
  }
}

​Overview

​Anthropic

​Supported models

​Enabling caching

​Minimum token thresholds

​Cache TTL

​Example

​OpenAI

OpenAI

​Google Gemini

Google AI

​Usage in the response

Overview

Anthropic

Supported models

Enabling caching

Minimum token thresholds

Cache TTL

Example

OpenAI

Google Gemini

Usage in the response