- Distributing traffic across multiple provider accounts to stay within per-key rate limits.
- A/B testing providers by routing a configurable percentage of traffic to each.
- Reducing blast radius from a single provider outage without code changes.
- Maximizing throughput when one provider’s capacity is a bottleneck.
Quick Start
Distribute requests across multiple providers using weighted routing.Configuration
| Parameter | Type | Required | Description |
|---|---|---|---|
load_balancer | Object | Yes | Load balancer configuration (top-level) |
load_balancer.type | string | Yes | Strategy type (weight_based or round_robin) |
load_balancer.models | Array | Yes | List of models with weights |
models[].model | string | Yes | Model identifier |
models[].weight | number | No | Relative weight (0.001 - 1.0, default 0.5) |
- Weights are normalized:
[0.4, 0.8]→[33%, 67%]. - Higher weight = more traffic.
- Minimum weight:
0.001. - Default weight:
0.5.
Common Patterns
Use Cases
| Scenario | Weight Strategy | Example |
|---|---|---|
| Cost optimization | Heavy on cheaper models | 80% GPT-3.5, 20% GPT-4 |
| Performance testing | Small traffic to new model | 95% current, 5% experimental |
| Provider redundancy | Split across providers | 60% OpenAI, 40% Anthropic |
| Capacity management | Distribute during peaks | Even split across models |
See also: Organization-level load balancing
To apply load balancing across your organization without changing request code, use Routing Rules to configure Fallback, Weighted, and Round Robin strategies at the workspace level.
Code examples
Monitoring
Track these metrics for optimal load balancing:- Traffic distribution: Actual vs expected percentages.
- Cost per model: Monitor spending across providers.
- Response times: Compare latency by model.
- Error rates: Track failures by provider.
Troubleshooting
Uneven distribution- Check if weights are normalized correctly.
- Verify sufficient request volume (min 100 requests for accuracy).
- Monitor over longer time periods. Unexpected costs
- Track actual vs expected cost distribution.
- Monitor for expensive model overuse.
- Set up cost alerts per provider. Performance issues
- Check latency differences between models.
- Monitor for provider-specific slowdowns.
- Adjust weights based on performance data.
Limitations
- Probabilistic routing: Short-term traffic may not match exact weights.
- Minimum volume needed: Requires sufficient requests for statistical accuracy.
- Response variations: Different models may return varying output quality.
- Cost complexity: Managing billing across multiple providers.
- Provider dependencies: Requires API access to all models.