Skip to main content
This page describes features extending the AI Gateway, which provides a unified API for accessing multiple AI providers. To learn more, see AI Gateway.

Quick Start

Distribute requests across multiple providers using weighted routing.
const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
  messages: [{ role: "user", content: "Write a marketing slogan" }],
  orq: {
    load_balancer: [
      { model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
      { model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
    ],
  },
});

Configuration

ParameterTypeRequiredDescription
load_balancerArrayYesList of models with weights
modelstringYesModel identifier
weightnumberYesRelative weight (0.0 - 1.0)
Weight Calculation:
  • Weights are normalized: [0.4, 0.8][33%, 67%]
  • Higher weight = more traffic
  • Minimum weight: 0.1 (10%)

Common Patterns

// Equal distribution
load_balancer: [
  { model: "openai/gpt-4o", weight: 1.0 },
  { model: "anthropic/claude-3", weight: 1.0 },
];

// Cost optimization (cheap model primary)
load_balancer: [
  { model: "openai/gpt-3.5-turbo", weight: 0.8 },
  { model: "openai/gpt-4o", weight: 0.2 },
];

// A/B testing
load_balancer: [
  { model: "current-model", weight: 0.9 },
  { model: "experimental-model", weight: 0.1 },
];

// Multi-provider redundancy
load_balancer: [
  { model: "openai/gpt-4o", weight: 0.5 },
  { model: "anthropic/claude-3", weight: 0.3 },
  { model: "azure/gpt-4o", weight: 0.2 },
];

Use Cases

ScenarioWeight StrategyExample
Cost optimizationHeavy on cheaper models80% GPT-3.5, 20% GPT-4
Performance testingSmall traffic to new model95% current, 5% experimental
Provider redundancySplit across providers60% OpenAI, 40% Anthropic
Capacity managementDistribute during peaksEven split across models

Code examples

curl -X POST https://api.orq.ai/v2/proxy/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
      }
    ],
    "orq": {
      "load_balancer": [
          {
              "model": "openai/gpt-3.5-turbo",
              "weight": 0.4
          },
                      {
              "model": "anthropic/claude-3-haiku-20240307",
              "weight": 0.8
          }
      ]
    }
  }'

Monitoring

Track these metrics for optimal load balancing:
// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};
Key Metrics:
  • Traffic distribution: Actual vs expected percentages
  • Cost per model: Monitor spending across providers
  • Response times: Compare latency by model
  • Error rates: Track failures by provider

Troubleshooting

**Uneven distribution
  • Check if weights are normalized correctly
  • Verify sufficient request volume (min 100 requests for accuracy)
  • Monitor over longer time periods
**Unexpected costs
  • Track actual vs expected cost distribution
  • Monitor for expensive model overuse
  • Set up cost alerts per provider
**Performance issues
  • Check latency differences between models
  • Monitor for provider-specific slowdowns
  • Adjust weights based on performance data

Limitations

  • Probabilistic routing: Short-term traffic may not match exact weights
  • Minimum volume needed: Requires sufficient requests for statistical accuracy
  • Response variations: Different models may return varying output quality
  • Cost complexity: Managing billing across multiple providers
  • Provider dependencies: Requires API access to all models

Advanced Usage

Environment-specific weights:
const weights = {
  development: [
    { model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
  ],
  production: [
    { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
    { model: "anthropic/claude-3", weight: 0.3 }, // Backup
  ],
};
Dynamic weight adjustment:
// Adjust weights based on performance
const adjustWeights = (metrics) => {
  return models.map((model) => ({
    model: model.name,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  }));
};
With other features:
{
  orq: {
    load_balancer: [
      {model: "openai/gpt-4o", weight: 0.6},
      {model: "anthropic/claude-3", weight: 0.4}
    ],
    retries: {count: 2, on_codes: [429]},
    timeout: {call_timeout: 15000}
  }
}