Load Balancing

📖
This page describes features extending the AI Proxy, which provides a unified API for accessing multiple AI providers. To learn more, see AI Proxy.

Quick Start

Distribute requests across multiple providers using weighted routing.

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
  messages: [{ role: "user", content: "Write a marketing slogan" }],
  orq: {
    load_balancer: [
      { model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
      { model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
    ],
  },
});

Configuration

Parameter	Type	Required	Description
`load_balancer`	Array	Yes	List of models with weights
`model`	string	Yes	Model identifier
`weight`	number	Yes	Relative weight (0.0 - 1.0)

Weight Calculation:

Weights are normalized: [0.4, 0.8] → [33%, 67%]
Higher weight = more traffic
Minimum weight: 0.1 (10%)

Common Patterns

// Equal distribution
load_balancer: [
  { model: "openai/gpt-4o", weight: 1.0 },
  { model: "anthropic/claude-3", weight: 1.0 },
];

// Cost optimization (cheap model primary)
load_balancer: [
  { model: "openai/gpt-3.5-turbo", weight: 0.8 },
  { model: "openai/gpt-4o", weight: 0.2 },
];

// A/B testing
load_balancer: [
  { model: "current-model", weight: 0.9 },
  { model: "experimental-model", weight: 0.1 },
];

// Multi-provider redundancy
load_balancer: [
  { model: "openai/gpt-4o", weight: 0.5 },
  { model: "anthropic/claude-3", weight: 0.3 },
  { model: "azure/gpt-4o", weight: 0.2 },
];

Use Cases

Scenario	Weight Strategy	Example
Cost optimization	Heavy on cheaper models	80% GPT-3.5, 20% GPT-4
Performance testing	Small traffic to new model	95% current, 5% experimental
Provider redundancy	Split across providers	60% OpenAI, 40% Anthropic
Capacity management	Distribute during peaks	Even split across models

Code examples

curl -X POST https://api.orq.ai/v2/proxy/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
      }
    ],
    "orq": {
      "load_balancer": [
          {
              "model": "openai/gpt-3.5-turbo",
              "weight": 0.4
          },
                      {
              "model": "anthropic/claude-3-haiku-20240307",
              "weight": 0.8
          }
      ]
    }
  }'

from openai import OpenAI
import os

openai = OpenAI(
  api_key=os.environ.get("ORQ_API_KEY"),
  base_url="https://api.orq.ai/v2/proxy"
)

response = openai.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
        }
    ],
    extra_body={
        "orq": {
            "load_balancer": [
                {
                    "model": "openai/gpt-3.5-turbo",
                    "weight": 0.4
                },
                {
                    "model": "anthropic/claude-3-haiku-20240307",
                    "weight": 0.8
                }
            ]
        }
    }
)

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v2/proxy",
});

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content:
        "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand",
    },
  ],
  orq: {
    load_balancer: [
      {
        model: "openai/gpt-3.5-turbo",
        weight: 0.4,
      },
      {
        model: "anthropic/claude-3-haiku-20240307",
        weight: 0.8,
      },
    ],
  },
});

Monitoring

Track these metrics for optimal load balancing:

// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};

Key Metrics:

Traffic distribution: Actual vs expected percentages
Cost per model: Monitor spending across providers
Response times: Compare latency by model
Error rates: Track failures by provider

Troubleshooting

Uneven distribution

Check if weights are normalized correctly
Verify sufficient request volume (min 100 requests for accuracy)
Monitor over longer time periods

Unexpected costs

Track actual vs expected cost distribution
Monitor for expensive model overuse
Set up cost alerts per provider

Performance issues

Check latency differences between models
Monitor for provider-specific slowdowns
Adjust weights based on performance data

Limitations

Probabilistic routing: Short-term traffic may not match exact weights
Minimum volume needed: Requires sufficient requests for statistical accuracy
Response variations: Different models may return varying output quality
Cost complexity: Managing billing across multiple providers
Provider dependencies: Requires API access to all models

Advanced Usage

Environment-specific weights:

const weights = {
  development: [
    { model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
  ],
  production: [
    { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
    { model: "anthropic/claude-3", weight: 0.3 }, // Backup
  ],
};

Dynamic weight adjustment:

// Adjust weights based on performance
const adjustWeights = (metrics) => {
  return models.map((model) => ({
    model: model.name,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  }));
};

With other features:

{
  orq: {
    load_balancer: [
      {model: "openai/gpt-4o", weight: 0.6},
      {model: "anthropic/claude-3", weight: 0.4}
    ],
    retries: {count: 2, on_codes: [429]},
    timeout: {call_timeout: 15000}
  }
}