Load Balancing

📖

This page describes features extending the AI Proxy, which provides a unified API for accessing multiple AI providers. To learn more, see AI Proxy.

Quick Start

Distribute requests across multiple providers using weighted routing.

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
  messages: [{ role: "user", content: "Write a marketing slogan" }],
  orq: {
    load_balancer: [
      { model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
      { model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
    ],
  },
});

Configuration

ParameterTypeRequiredDescription
load_balancerArrayYesList of models with weights
modelstringYesModel identifier
weightnumberYesRelative weight (0.0 - 1.0)

Weight Calculation:

  • Weights are normalized: [0.4, 0.8][33%, 67%]
  • Higher weight = more traffic
  • Minimum weight: 0.1 (10%)

Common Patterns

// Equal distribution
load_balancer: [
  { model: "openai/gpt-4o", weight: 1.0 },
  { model: "anthropic/claude-3", weight: 1.0 },
];

// Cost optimization (cheap model primary)
load_balancer: [
  { model: "openai/gpt-3.5-turbo", weight: 0.8 },
  { model: "openai/gpt-4o", weight: 0.2 },
];

// A/B testing
load_balancer: [
  { model: "current-model", weight: 0.9 },
  { model: "experimental-model", weight: 0.1 },
];

// Multi-provider redundancy
load_balancer: [
  { model: "openai/gpt-4o", weight: 0.5 },
  { model: "anthropic/claude-3", weight: 0.3 },
  { model: "azure/gpt-4o", weight: 0.2 },
];

Use Cases

ScenarioWeight StrategyExample
Cost optimizationHeavy on cheaper models80% GPT-3.5, 20% GPT-4
Performance testingSmall traffic to new model95% current, 5% experimental
Provider redundancySplit across providers60% OpenAI, 40% Anthropic
Capacity managementDistribute during peaksEven split across models

Code examples

curl -X POST https://api.orq.ai/v2/proxy/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
      }
    ],
    "orq": {
      "load_balancer": [
          {
              "model": "openai/gpt-3.5-turbo",
              "weight": 0.4
          },
                      {
              "model": "anthropic/claude-3-haiku-20240307",
              "weight": 0.8
          }
      ]
    }
  }'
from openai import OpenAI
import os

openai = OpenAI(
  api_key=os.environ.get("ORQ_API_KEY"),
  base_url="https://api.orq.ai/v2/proxy"
)

response = openai.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
        }
    ],
    extra_body={
        "orq": {
            "load_balancer": [
                {
                    "model": "openai/gpt-3.5-turbo",
                    "weight": 0.4
                },
                {
                    "model": "anthropic/claude-3-haiku-20240307",
                    "weight": 0.8
                }
            ]
        }
    }
)
import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v2/proxy",
});

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content:
        "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand",
    },
  ],
  orq: {
    load_balancer: [
      {
        model: "openai/gpt-3.5-turbo",
        weight: 0.4,
      },
      {
        model: "anthropic/claude-3-haiku-20240307",
        weight: 0.8,
      },
    ],
  },
});

Monitoring

Track these metrics for optimal load balancing:

// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};

Key Metrics:

  • Traffic distribution: Actual vs expected percentages
  • Cost per model: Monitor spending across providers
  • Response times: Compare latency by model
  • Error rates: Track failures by provider

Troubleshooting

Uneven distribution
  • Check if weights are normalized correctly
  • Verify sufficient request volume (min 100 requests for accuracy)
  • Monitor over longer time periods
Unexpected costs
  • Track actual vs expected cost distribution
  • Monitor for expensive model overuse
  • Set up cost alerts per provider
Performance issues
  • Check latency differences between models
  • Monitor for provider-specific slowdowns
  • Adjust weights based on performance data

Limitations

  • Probabilistic routing: Short-term traffic may not match exact weights
  • Minimum volume needed: Requires sufficient requests for statistical accuracy
  • Response variations: Different models may return varying output quality
  • Cost complexity: Managing billing across multiple providers
  • Provider dependencies: Requires API access to all models

Advanced Usage

Environment-specific weights:

const weights = {
  development: [
    { model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
  ],
  production: [
    { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
    { model: "anthropic/claude-3", weight: 0.3 }, // Backup
  ],
};

Dynamic weight adjustment:

// Adjust weights based on performance
const adjustWeights = (metrics) => {
  return models.map((model) => ({
    model: model.name,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  }));
};

With other features:

{
  orq: {
    load_balancer: [
      {model: "openai/gpt-4o", weight: 0.6},
      {model: "anthropic/claude-3", weight: 0.4}
    ],
    retries: {count: 2, on_codes: [429]},
    timeout: {call_timeout: 15000}
  }
}