> ## Documentation Index
> Fetch the complete documentation index at: https://docs.orq.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM response caching

> Cache identical LLM requests to reduce latency by 95% and cut API costs. Configure TTL, exact match caching, and optimize response times for repeated queries.

**Use Cases**

* Eliminating redundant costs on repeated identical queries (FAQs, product lookups).
* Speeding up development and test loops by caching fixture requests.
* Serving the same prompt to many concurrent users without paying per call.
* Reducing tail latency on frequently-called endpoints.

***

## Quick Start

Cache identical requests to reduce latency by \~95% and save costs.

<CodeGroup>
  ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl -X POST https://api.orq.ai/v3/router/chat/completions \
    -H "Authorization: Bearer $ORQ_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "openai/gpt-4o",
      "messages": [{ "role": "user", "content": "Explain renewable energy" }],
      "cache": { "type": "exact_match", "ttl": 3600 }
    }'
  ```

  ```typescript TypeScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import OpenAI from "openai";

  const client = new OpenAI({
    apiKey: process.env.ORQ_API_KEY,
    baseURL: "https://api.orq.ai/v3/router",
  });

  const response = await client.responses.create({
    model: "openai/gpt-4o",
    input: "Explain renewable energy",
    cache: {
      type: "exact_match",
      ttl: 3600,
    },
  });

  console.log(response.output_text);
  ```

  ```typescript TypeScript (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import OpenAI from "openai";

  const client = new OpenAI({
    apiKey: process.env.ORQ_API_KEY,
    baseURL: "https://api.orq.ai/v3/router",
  });

  const response = await client.chat.completions.create({
    model: "openai/gpt-4o",
    messages: [{ role: "user", content: "Explain renewable energy" }],
    cache: {
      type: "exact_match",
      ttl: 3600,
    },
  });

  console.log(response.choices[0].message.content);
  ```
</CodeGroup>

## Configuration

| Parameter | Type            | Required | Description                                              | Example         |
| --------- | --------------- | -------- | -------------------------------------------------------- | --------------- |
| `type`    | `"exact_match"` | Yes      | Only supported cache type                                | `"exact_match"` |
| `ttl`     | number          | No       | Cache expiration in seconds (default: 1800, max: 259200) | `3600`          |

**Cache Key**: Generated from model + input + all parameters. Identical requests share the same key.

## TTL Recommendations

| Use Case            | TTL (seconds)  | Reason                  |
| ------------------- | -------------- | ----------------------- |
| FAQ responses       | `86400` (24h)  | Static content          |
| Content generation  | `3600` (1h)    | Moderate freshness      |
| Development/testing | `300` (5min)   | Rapid iteration         |
| Data analysis       | `1800` (30min) | Balance speed/freshness |

## Code examples

<Note>
  The examples below use the Chat Completions endpoint. The same `cache` parameter applies to the Responses API: replace `chat.completions.create(...)` with `responses.create(...)`.
</Note>

<CodeGroup>
  ```bash cURL (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl -X POST https://api.orq.ai/v3/router/chat/completions \
    -H "Authorization: Bearer $ORQ_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "openai/gpt-4o",
      "messages": [
        {
          "role": "user",
          "content": "Explain the benefits of renewable energy for businesses"
        }
      ],
      "cache": {
        "type": "exact_match",
        "ttl": 3600
      }
    }'
  ```

  ```python Python (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from openai import OpenAI
  import os

  client = OpenAI(
    api_key=os.environ.get("ORQ_API_KEY"),
    base_url="https://api.orq.ai/v3/router"
  )

  response = client.chat.completions.create(
      model="openai/gpt-4o",
      messages=[
          {
              "role": "user",
              "content": "Explain the benefits of renewable energy for businesses"
          }
      ],
      extra_body={
          "cache": {
              "type": "exact_match",
              "ttl": 3600
          }
      }
  )
  ```

  ```typescript TypeScript (Chat Completions) theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import OpenAI from "openai";

  const client = new OpenAI({
    apiKey: process.env.ORQ_API_KEY,
    baseURL: "https://api.orq.ai/v3/router",
  });

  const response = await client.chat.completions.create({
    model: "openai/gpt-4o",
    messages: [
      {
        role: "user",
        content: "Explain the benefits of renewable energy for businesses",
      },
    ],
    cache: {
      type: "exact_match",
      ttl: 3600,
    },
  });
  ```
</CodeGroup>

## Troubleshooting

**Low cache hit rate**

* Ensure identical parameters (temperature, max\_tokens, etc.).
* Check TTL isn't too short for your use case.
* Verify requests are truly identical (case-sensitive).

**Cache not working**

* Confirm `type: "exact_match"` is specified.
* Check response headers for cache status.

**Performance issues**

* Use shorter TTL for dynamic content.
* Consider cache warming for predictable requests.
* Monitor cache hit/miss ratios.

## Limitations

* **Exact match only**: Any parameter change creates new cache key.
* **Case sensitive**: "Hello" and "hello" are different cache keys.
* **No semantic matching**: Similar but not identical requests won't match.
* **Storage limits**: Very large responses consume more cache space.
* **TTL constraints**: Minimum 1 second, maximum 259200 seconds (3 days).

## Best Practices

* Set TTL based on content freshness requirements.
* Use cache for repeated, deterministic requests.
* Monitor cache hit rates to optimize TTL values.
* Avoid caching personalized or time-sensitive content.
* Test cache behavior in development before production.
