This page describes features extending the AI Gatewway, which provides a unified API for accessing multiple AI providers. To learn more, see AI Gateway.
Quick Start
Cache identical requests to reduce latency by ~95% and save costs.Configuration
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
type | "exact_match" | Yes | Only supported cache type | "exact_match" |
ttl | number | Yes | Cache expiration in seconds | 3600 |
TTL Recommendations
| Use Case | TTL (seconds) | Reason |
|---|---|---|
| FAQ responses | 86400 (24h) | Static content |
| Content generation | 3600 (1h) | Moderate freshness |
| Development/testing | 300 (5min) | Rapid iteration |
| Data analysis | 1800 (30min) | Balance speed/freshness |
Code examples
Troubleshooting
**Low cache hit rate- Ensure identical parameters (temperature, max_tokens, etc.)
- Check TTL isn’t too short for your use case
- Verify requests are truly identical (case-sensitive)
- Confirm
type: "exact_match"is specified - Verify TTL is set (required parameter)
- Check response headers for cache status
- Use shorter TTL for dynamic content
- Consider cache warming for predictable requests
- Monitor cache hit/miss ratios
Limitations
- Exact match only: Any parameter change creates new cache key
- Case sensitive: “Hello” and “hello” are different cache keys
- No semantic matching: Similar but not identical requests won’t match
- Storage limits: Very large responses consume more cache space
- TTL constraints: Minimum 1 second, maximum 86400 seconds (24 hours)
Best Practices
- Set TTL based on content freshness requirements
- Use cache for repeated, deterministic requests
- Monitor cache hit rates to optimize TTL values
- Avoid caching personalized or time-sensitive content
- Test cache behavior in development before production