Load Balancing
This page describes features extending the AI Proxy, which provides a unified API for accessing multiple AI providers. To learn more, see AI Proxy.
Quick Start
Distribute requests across multiple providers using weighted routing.
const response = await openai.chat.completions.create({
model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
messages: [{ role: "user", content: "Write a marketing slogan" }],
orq: {
load_balancer: [
{ model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
{ model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
],
},
});
Configuration
Parameter | Type | Required | Description |
---|---|---|---|
load_balancer | Array | Yes | List of models with weights |
model | string | Yes | Model identifier |
weight | number | Yes | Relative weight (0.0 - 1.0) |
Weight Calculation:
- Weights are normalized:
[0.4, 0.8]
→[33%, 67%]
- Higher weight = more traffic
- Minimum weight:
0.1
(10%)
Common Patterns
// Equal distribution
load_balancer: [
{ model: "openai/gpt-4o", weight: 1.0 },
{ model: "anthropic/claude-3", weight: 1.0 },
];
// Cost optimization (cheap model primary)
load_balancer: [
{ model: "openai/gpt-3.5-turbo", weight: 0.8 },
{ model: "openai/gpt-4o", weight: 0.2 },
];
// A/B testing
load_balancer: [
{ model: "current-model", weight: 0.9 },
{ model: "experimental-model", weight: 0.1 },
];
// Multi-provider redundancy
load_balancer: [
{ model: "openai/gpt-4o", weight: 0.5 },
{ model: "anthropic/claude-3", weight: 0.3 },
{ model: "azure/gpt-4o", weight: 0.2 },
];
Use Cases
Scenario | Weight Strategy | Example |
---|---|---|
Cost optimization | Heavy on cheaper models | 80% GPT-3.5, 20% GPT-4 |
Performance testing | Small traffic to new model | 95% current, 5% experimental |
Provider redundancy | Split across providers | 60% OpenAI, 40% Anthropic |
Capacity management | Distribute during peaks | Even split across models |
Code examples
curl -X POST https://api.orq.ai/v2/proxy/chat/completions \
-H "Authorization: Bearer $ORQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
}
],
"orq": {
"load_balancer": [
{
"model": "openai/gpt-3.5-turbo",
"weight": 0.4
},
{
"model": "anthropic/claude-3-haiku-20240307",
"weight": 0.8
}
]
}
}'
from openai import OpenAI
import os
openai = OpenAI(
api_key=os.environ.get("ORQ_API_KEY"),
base_url="https://api.orq.ai/v2/proxy"
)
response = openai.chat.completions.create(
model="openai/gpt-4o",
messages=[
{
"role": "user",
"content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
}
],
extra_body={
"orq": {
"load_balancer": [
{
"model": "openai/gpt-3.5-turbo",
"weight": 0.4
},
{
"model": "anthropic/claude-3-haiku-20240307",
"weight": 0.8
}
]
}
}
)
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: process.env.ORQ_API_KEY,
baseURL: "https://api.orq.ai/v2/proxy",
});
const response = await openai.chat.completions.create({
model: "openai/gpt-4o",
messages: [
{
role: "user",
content:
"WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand",
},
],
orq: {
load_balancer: [
{
model: "openai/gpt-3.5-turbo",
weight: 0.4,
},
{
model: "anthropic/claude-3-haiku-20240307",
weight: 0.8,
},
],
},
});
Monitoring
Track these metrics for optimal load balancing:
// Example monitoring setup
const metrics = {
requestsByModel: {}, // Count per model
costsByModel: {}, // Cost per model
latencyByModel: {}, // Response time per model
errorsByModel: {}, // Error rate per model
};
Key Metrics:
- Traffic distribution: Actual vs expected percentages
- Cost per model: Monitor spending across providers
- Response times: Compare latency by model
- Error rates: Track failures by provider
Troubleshooting
Uneven distribution
- Check if weights are normalized correctly
- Verify sufficient request volume (min 100 requests for accuracy)
- Monitor over longer time periods
Unexpected costs
- Track actual vs expected cost distribution
- Monitor for expensive model overuse
- Set up cost alerts per provider
Performance issues
- Check latency differences between models
- Monitor for provider-specific slowdowns
- Adjust weights based on performance data
Limitations
- Probabilistic routing: Short-term traffic may not match exact weights
- Minimum volume needed: Requires sufficient requests for statistical accuracy
- Response variations: Different models may return varying output quality
- Cost complexity: Managing billing across multiple providers
- Provider dependencies: Requires API access to all models
Advanced Usage
Environment-specific weights:
const weights = {
development: [
{ model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
],
production: [
{ model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
{ model: "anthropic/claude-3", weight: 0.3 }, // Backup
],
};
Dynamic weight adjustment:
// Adjust weights based on performance
const adjustWeights = (metrics) => {
return models.map((model) => ({
model: model.name,
weight: calculateWeight(model.latency, model.cost, model.quality),
}));
};
With other features:
{
orq: {
load_balancer: [
{model: "openai/gpt-4o", weight: 0.6},
{model: "anthropic/claude-3", weight: 0.4}
],
retries: {count: 2, on_codes: [429]},
timeout: {call_timeout: 15000}
}
}
Updated 4 days ago