Load balancing across providers - Orq.ai Documentation

Use Cases

Distributing traffic across multiple provider accounts to stay within per-key rate limits.
A/B testing providers by routing a configurable percentage of traffic to each.
Reducing blast radius from a single provider outage without code changes.
Maximizing throughput when one provider’s capacity is a bottleneck.

Quick Start

Distribute requests across multiple providers using weighted routing.

curl -X POST https://api.orq.ai/v3/router/responses \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "input": "Write a marketing slogan",
    "load_balancer": {
      "type": "weight_based",
      "models": [
        {"model": "openai/gpt-5-mini", "weight": 0.7},
        {"model": "anthropic/claude-haiku-4-5-20251001", "weight": 0.3}
      ]
    }
  }'

Configuration

Parameter	Type	Required	Description
`load_balancer`	Object	Yes	Load balancer configuration (top-level)
`load_balancer.type`	string	Yes	Strategy type (`weight_based` or `round_robin`)
`load_balancer.models`	Array	Yes	List of models with weights
`models[].model`	string	Yes	Model identifier
`models[].weight`	number	No	Relative weight (0.001 - 1.0, default 0.5)

Weight Calculation:

Weights are normalized: [0.4, 0.8] → [33%, 67%].
Higher weight = more traffic.
Minimum weight: 0.001.
Default weight: 0.5.

Common Patterns

// Equal distribution
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-4o", weight: 1.0 },
    { model: "anthropic/claude-sonnet-4-6", weight: 1.0 },
  ],
}

// Cost optimization (cheap model primary)
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-5-mini", weight: 0.8 },
    { model: "openai/gpt-4o", weight: 0.2 },
  ],
}

// A/B testing
load_balancer: {
  type: "weight_based",
  models: [
    { model: "current-model", weight: 0.9 },
    { model: "experimental-model", weight: 0.1 },
  ],
}

// Multi-provider redundancy
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-4o", weight: 0.5 },
    { model: "anthropic/claude-sonnet-4-6", weight: 0.3 },
    { model: "azure/gpt-4o", weight: 0.2 },
  ],
}

Use Cases

Scenario	Weight Strategy	Example
Cost optimization	Heavy on cheaper models	80% GPT-3.5, 20% GPT-4
Performance testing	Small traffic to new model	95% current, 5% experimental
Provider redundancy	Split across providers	60% OpenAI, 40% Anthropic
Capacity management	Distribute during peaks	Even split across models

Code examples

curl -X POST https://api.orq.ai/v3/router/responses \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "input": "Write a creative marketing slogan for an eco-friendly coffee brand",
    "load_balancer": {
      "type": "weight_based",
      "models": [
        {"model": "openai/gpt-5-mini", "weight": 0.4},
        {"model": "anthropic/claude-haiku-4-5-20251001", "weight": 0.6}
      ]
    }
  }'

Monitoring

Track these metrics for optimal load balancing:

// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};

Key Metrics:

Traffic distribution: Actual vs expected percentages.
Cost per model: Monitor spending across providers.
Response times: Compare latency by model.
Error rates: Track failures by provider.

Troubleshooting

Uneven distribution

Check if weights are normalized correctly.
Verify sufficient request volume (min 100 requests for accuracy).
Monitor over longer time periods. Unexpected costs
Track actual vs expected cost distribution.
Monitor for expensive model overuse.
Set up cost alerts per provider. Performance issues
Check latency differences between models.
Monitor for provider-specific slowdowns.
Adjust weights based on performance data.

Limitations

Probabilistic routing: Short-term traffic may not match exact weights.
Minimum volume needed: Requires sufficient requests for statistical accuracy.
Response variations: Different models may return varying output quality.
Cost complexity: Managing billing across multiple providers.
Provider dependencies: Requires API access to all models.

Advanced Usage

Environment-specific weights:

const weights = {
  development: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-5-mini", weight: 1.0 }, // Cheap for dev
    ],
  },
  production: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
      { model: "anthropic/claude-sonnet-4-6", weight: 0.3 }, // Backup
    ],
  },
};

Dynamic weight adjustment:

// Adjust weights based on performance
const calculateWeight = (latency: number, cost: number, quality: number) =>
  quality / (latency * cost);

const adjustWeights = (models) => ({
  type: "weight_based",
  models: models.map((model) => ({
    model: model.model,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  })),
});

With other features:

{
  "model": "openai/gpt-4o",
  "load_balancer": {
    "type": "weight_based",
    "models": [
      { "model": "openai/gpt-4o", "weight": 0.6 },
      { "model": "anthropic/claude-sonnet-4-6", "weight": 0.4 }
    ]
  },
  "retry": { "count": 2, "on_codes": [429] },
  "timeout": { "call_timeout": 15000 }
}

​Quick Start

​Configuration

​Common Patterns

​Use Cases

See also: Organization-level load balancing

​Code examples

​Monitoring

​Troubleshooting

​Limitations

​Advanced Usage