> ## Documentation Index
> Fetch the complete documentation index at: https://docs.orq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prompt caching for reduced token costs

> Cache repeated prompt prefixes at the provider level to cut input token costs and latency on Anthropic, OpenAI, and Google Gemini models.

**Use Cases**

* Reusing long system prompts across many requests to cut input token costs.
* Referencing large documents or codebases without re-sending them every call.
* Multi-turn conversations with a large, stable context that doesn't change between turns.
* RAG pipelines where the same retrieved context is shared across many user queries.

***

## Overview

Prompt Caching is a provider-level feature that caches prompts so that **repeated requests** are charged at a reduced rate.

This is most effective when your requests share a **large, stable prefix**:

* a long system prompt.
* a reference document.
* a tool definition list.
  Unlike [Response Caching](/docs/ai-studio/ai-gateway/cache), which serves a stored response for identical requests, Prompt Caching still calls the model on every request, at a reduced cost. Both can be used together.

How caching is enabled and what gets cached varies by provider. See the provider sections below.

## Anthropic

Prompt caching on Anthropic models requires explicit opt-in via `cache_control` markers on individual message parts.

### Supported models

Claude Haiku, Sonnet, and Opus across all current versions.

### Enabling caching

Add a `cache_control` object to any message part you want to mark as cacheable:

```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
{
  "cache_control": { "type": "ephemeral" }
}
```

`"ephemeral"` is the only supported type. You can place it on:

* System message text parts.
* User message text parts.
* User message images, documents, and files (including PDFs).
* Tool result content.

### Minimum token thresholds

Caching only activates once the marked content exceeds a minimum token count. Requests below the threshold are processed normally at full cost.

| Model                                                     | Minimum tokens |
| --------------------------------------------------------- | -------------- |
| Claude Opus 4.6, Opus 4.5                                 | 4,096          |
| Claude Sonnet 4.6                                         | 2,048          |
| Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7 | 1,024          |
| Claude Haiku 4.5                                          | 4,096          |
| Claude Haiku 3.5, Haiku 3                                 | 2,048          |

### Cache TTL

The `ttl` parameter controls how long cached content persists before expiring.

| Value            | Duration                |
| ---------------- | ----------------------- |
| `"5m"` (default) | 5 minutes from last use |
| `"1h"`           | 1 hour                  |

```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
{
  "cache_control": {
    "type": "ephemeral",
    "ttl": "1h"
  }
}
```

### Example

<CodeGroup>
  ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl -X POST https://api.orq.ai/v3/router/responses \
    -H "Authorization: Bearer $ORQ_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "anthropic/claude-sonnet-4-6",
      "input": [
        {
          "role": "system",
          "content": [
            {
              "type": "input_text",
              "text": "You are a senior legal assistant. The following is our complete contract template library...",
              "cache_control": { "type": "ephemeral" }
            }
          ]
        },
        {
          "role": "user",
          "content": "Summarize clause 7 of the NDA template."
        }
      ]
    }'
  ```

  ```bash cURL (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl -X POST https://api.orq.ai/v3/router/chat/completions \
    -H "Authorization: Bearer $ORQ_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "anthropic/claude-sonnet-4-6",
      "messages": [
        {
          "role": "system",
          "content": [
            {
              "type": "text",
              "text": "You are a senior legal assistant. The following is our complete contract template library...",
              "cache_control": { "type": "ephemeral" }
            }
          ]
        },
        {
          "role": "user",
          "content": "Summarize clause 7 of the NDA template."
        }
      ]
    }'
  ```

  ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import OpenAI from "openai";

  const client = new OpenAI({
    apiKey: process.env.ORQ_API_KEY,
    baseURL: "https://api.orq.ai/v3/router",
  });

  const response = await client.responses.create({
    model: "anthropic/claude-sonnet-4-6",
    input: [
      {
        role: "system",
        content: [
          {
            type: "input_text",
            text: "You are a senior legal assistant. The following is our complete contract template library...",
            cache_control: { type: "ephemeral" },
          },
        ],
      },
      { role: "user", content: "Summarize clause 7 of the NDA template." },
    ],
  });

  console.log(response.output_text);
  ```

  ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from openai import OpenAI
  import os

  client = OpenAI(
      api_key=os.environ.get("ORQ_API_KEY"),
      base_url="https://api.orq.ai/v3/router",
  )

  response = client.responses.create(
      model="anthropic/claude-sonnet-4-6",
      input=[
          {
              "role": "system",
              "content": [
                  {
                      "type": "input_text",
                      "text": "You are a senior legal assistant. The following is our complete contract template library...",
                      "cache_control": {"type": "ephemeral"},
                  }
              ],
          },
          {"role": "user", "content": "Summarize clause 7 of the NDA template."},
      ],
  )

  print(response.output_text)
  ```

  ```typescript TypeScript (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import OpenAI from "openai";

  const client = new OpenAI({
    apiKey: process.env.ORQ_API_KEY,
    baseURL: "https://api.orq.ai/v3/router",
  });

  const response = await client.chat.completions.create({
    model: "anthropic/claude-sonnet-4-6",
    messages: [
      {
        role: "system",
        content: [
          {
            type: "text",
            text: "You are a senior legal assistant. The following is our complete contract template library...",
            cache_control: { type: "ephemeral" },
          },
        ],
      },
      { role: "user", content: "Summarize clause 7 of the NDA template." },
    ],
  });

  console.log(response.choices[0].message.content);
  ```

  ```python Python (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from openai import OpenAI
  import os

  client = OpenAI(
      api_key=os.environ.get("ORQ_API_KEY"),
      base_url="https://api.orq.ai/v3/router",
  )

  response = client.chat.completions.create(
      model="anthropic/claude-sonnet-4-6",
      messages=[
          {
              "role": "system",
              "content": [
                  {
                      "type": "text",
                      "text": "You are a senior legal assistant. The following is our complete contract template library...",
                      "cache_control": {"type": "ephemeral"},
                  }
              ],
          },
          {"role": "user", "content": "Summarize clause 7 of the NDA template."},
      ],
  )

  print(response.choices[0].message.content)
  ```
</CodeGroup>

## OpenAI

Prompt caching on OpenAI models is **fully automatic**. No `cache_control` or any request changes are required. The router forwards requests normally; OpenAI caches the prompt prefix on its side and applies the discount transparently.

Caching activates on prompts longer than 1,024 tokens, in 128-token increments from that threshold. The API caches the longest matching prefix from prior requests on the same machine.

Cache retention duration is model-dependent and determined by OpenAI. Refer to [OpenAI's prompt caching documentation](https://platform.openai.com/docs/guides/prompt-caching) for the current retention policy per model.

Cache hits are reflected in the response `usage` object the same way as Anthropic. See [Usage in the response](#usage-in-the-response) below.

<Card title="OpenAI" icon="openai" href="/docs/ai-studio/integrations/providers/openai" horizontal>
  Set up your OpenAI API key to use GPT models with automatic prompt caching.
</Card>

## Google Gemini

Google Gemini offers two caching modes. Which one applies depends on which model generation you use.

**Implicit caching** is enabled by default on Gemini 2.0 and newer models. No request changes are needed. The router forwards requests normally and Google applies the cache discount automatically when a matching prefix exists. Cached tokens are discounted by up to 90% on Gemini 2.5 models and up to 75% on Gemini 2.0 models. Verify current rates in [Google's pricing documentation](https://ai.google.dev/pricing).

Implicit caching activates at a minimum of 2,048 tokens.

**Explicit caching** (creating a named cache object and referencing it by ID in subsequent requests) is a separate Google API that is not currently exposed through the **AI Router**.

<Card title="Google AI" icon="https://mintcdn.com/orqai/d-t0Z04KwFlGVsS1/images/logos/google_ai_studio.svg?fit=max&auto=format&n=d-t0Z04KwFlGVsS1&q=85&s=eac05c3f32c81d329e7645eed547f5c0" href="/docs/ai-studio/integrations/providers/google-ai" horizontal width="48" height="48" data-path="images/logos/google_ai_studio.svg">
  Set up your Google AI API key to use Gemini models with implicit prompt caching.
</Card>

## Usage in the response

When a cache hit occurs, the response `usage` object reflects it under `prompt_tokens_details.cached_tokens`:

```json theme={"theme":{"light":"github-light","dark":"github-dark"}}
{
  "usage": {
    "prompt_tokens": 1200,
    "completion_tokens": 180,
    "total_tokens": 1380,
    "prompt_tokens_details": {
      "cached_tokens": 1024
    }
  }
}
```