LLM response caching - Orq.ai Documentation

Use Cases

Eliminating redundant costs on repeated identical queries (FAQs, product lookups).
Speeding up development and test loops by caching fixture requests.
Serving the same prompt to many concurrent users without paying per call.
Reducing tail latency on frequently-called endpoints.

Quick Start

Cache identical requests to reduce latency by ~95% and save costs.

curl -X POST https://api.orq.ai/v3/router/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{ "role": "user", "content": "Explain renewable energy" }],
    "cache": { "type": "exact_match", "ttl": 3600 }
  }'

Configuration

Parameter	Type	Required	Description	Example
`type`	`"exact_match"`	Yes	Only supported cache type	`"exact_match"`
`ttl`	number	No	Cache expiration in seconds (default: 1800, max: 259200)	`3600`

Cache Key: Generated from model + input + all parameters. Identical requests share the same key.

TTL Recommendations

Use Case	TTL (seconds)	Reason
FAQ responses	`86400` (24h)	Static content
Content generation	`3600` (1h)	Moderate freshness
Development/testing	`300` (5min)	Rapid iteration
Data analysis	`1800` (30min)	Balance speed/freshness

Code examples

The examples below use the Chat Completions endpoint. The same cache parameter applies to the Responses API: replace chat.completions.create(...) with responses.create(...).

curl -X POST https://api.orq.ai/v3/router/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": "Explain the benefits of renewable energy for businesses"
      }
    ],
    "cache": {
      "type": "exact_match",
      "ttl": 3600
    }
  }'

Troubleshooting

Low cache hit rate

Ensure identical parameters (temperature, max_tokens, etc.).
Check TTL isn’t too short for your use case.
Verify requests are truly identical (case-sensitive).

Cache not working

Confirm type: "exact_match" is specified.
Check response headers for cache status.

Performance issues

Use shorter TTL for dynamic content.
Consider cache warming for predictable requests.
Monitor cache hit/miss ratios.

Limitations

Exact match only: Any parameter change creates new cache key.
Case sensitive: “Hello” and “hello” are different cache keys.
No semantic matching: Similar but not identical requests won’t match.
Storage limits: Very large responses consume more cache space.
TTL constraints: Minimum 1 second, maximum 259200 seconds (3 days).

Best Practices

Set TTL based on content freshness requirements.
Use cache for repeated, deterministic requests.
Monitor cache hit rates to optimize TTL values.
Avoid caching personalized or time-sensitive content.
Test cache behavior in development before production.

​Quick Start

​Configuration

​TTL Recommendations

​Code examples

​Troubleshooting

​Limitations

​Best Practices

Quick Start

Configuration

TTL Recommendations

Code examples

Troubleshooting

Limitations

Best Practices