- Reusing long system prompts across many requests to cut input token costs.
- Referencing large documents or codebases without re-sending them every call.
- Multi-turn conversations with a large, stable context that doesn’t change between turns.
- RAG pipelines where the same retrieved context is shared across many user queries.
Overview
Prompt Caching is a provider-level feature that caches prompts so that repeated requests are charged at a reduced rate. This is most effective when your requests share a large, stable prefix:- a long system prompt.
- a reference document.
- a tool definition list. Unlike Response Caching, which serves a stored response for identical requests, Prompt Caching still calls the model on every request, at a reduced cost. Both can be used together.
Anthropic
Prompt caching on Anthropic models requires explicit opt-in viacache_control markers on individual message parts.
Supported models
Claude Haiku, Sonnet, and Opus across all current versions.Enabling caching
Add acache_control object to any message part you want to mark as cacheable:
"ephemeral" is the only supported type. You can place it on:
- System message text parts.
- User message text parts.
- User message images, documents, and files (including PDFs).
- Tool result content.
Minimum token thresholds
Caching only activates once the marked content exceeds a minimum token count. Requests below the threshold are processed normally at full cost.| Model | Minimum tokens |
|---|---|
| Claude Opus 4.6, Opus 4.5 | 4,096 |
| Claude Sonnet 4.6 | 2,048 |
| Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7 | 1,024 |
| Claude Haiku 4.5 | 4,096 |
| Claude Haiku 3.5, Haiku 3 | 2,048 |
Cache TTL
Thettl parameter controls how long cached content persists before expiring.
| Value | Duration |
|---|---|
"5m" (default) | 5 minutes from last use |
"1h" | 1 hour |
Example
OpenAI
Prompt caching on OpenAI models is fully automatic. Nocache_control or any request changes are required. The AI Gateway forwards requests normally; OpenAI caches the prompt prefix on its side and applies the discount transparently.
Caching activates on prompts longer than 1,024 tokens, in 128-token increments from that threshold. The API caches the longest matching prefix from prior requests on the same machine.
Cache retention duration is model-dependent and determined by OpenAI. Refer to OpenAI’s prompt caching documentation for the current retention policy per model.
Cache hits are reflected in the response usage object the same way as Anthropic. See Usage in the response below.
OpenAI
Set up your OpenAI API key to use GPT models with automatic prompt caching.
Google Gemini
Google Gemini offers two caching modes. Which one applies depends on which model generation you use. Implicit caching is enabled by default on Gemini 2.0 and newer models. No request changes are needed. The AI Gateway forwards requests normally and Google applies the cache discount automatically when a matching prefix exists. Cached tokens are discounted by up to 90% on Gemini 2.5 models and up to 75% on Gemini 2.0 models. Verify current rates in Google’s pricing documentation. Implicit caching activates at a minimum of 2,048 tokens. Explicit caching (creating a named cache object and referencing it by ID in subsequent requests) is a separate Google API that is not currently exposed through the AI Gateway.Google AI
Set up your Google AI API key to use Gemini models with implicit prompt caching.
Usage in the response
When a cache hit occurs, the responseusage object reflects it under prompt_tokens_details.cached_tokens: