Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

This page describes features extending the AI Gateway, which provides a unified API for accessing multiple AI providers. To learn more, see AI Gateway.

Quick Start

Distribute requests across multiple providers using weighted routing.

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
  messages: [{ role: "user", content: "Write a marketing slogan" }],
  orq: {
    load_balancer: [
      { model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
      { model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
    ],
  },
});

Configuration

Parameter	Type	Required	Description
`load_balancer`	Array	Yes	List of models with weights
`model`	string	Yes	Model identifier
`weight`	number	Yes	Relative weight (0.0 - 1.0)

Weight Calculation:

Weights are normalized: [0.4, 0.8] → [33%, 67%]
Higher weight = more traffic
Minimum weight: 0.1 (10%)

Common Patterns

// Equal distribution
load_balancer: [
  { model: "openai/gpt-4o", weight: 1.0 },
  { model: "anthropic/claude-3", weight: 1.0 },
];

// Cost optimization (cheap model primary)
load_balancer: [
  { model: "openai/gpt-3.5-turbo", weight: 0.8 },
  { model: "openai/gpt-4o", weight: 0.2 },
];

// A/B testing
load_balancer: [
  { model: "current-model", weight: 0.9 },
  { model: "experimental-model", weight: 0.1 },
];

// Multi-provider redundancy
load_balancer: [
  { model: "openai/gpt-4o", weight: 0.5 },
  { model: "anthropic/claude-3", weight: 0.3 },
  { model: "azure/gpt-4o", weight: 0.2 },
];

Use Cases

Scenario	Weight Strategy	Example
Cost optimization	Heavy on cheaper models	80% GPT-3.5, 20% GPT-4
Performance testing	Small traffic to new model	95% current, 5% experimental
Provider redundancy	Split across providers	60% OpenAI, 40% Anthropic
Capacity management	Distribute during peaks	Even split across models

Code examples

curl -X POST https://api.orq.ai/v2/proxy/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "WriteWrite aa creativecreative marketingmarketing sloganslogan forfor anan eco-friendlyeco-friendly coffeecoffee brandbrand"
      }
    ],
    "orq": {
      "load_balancer": [
          {
              "model": "openai/gpt-3.5-turbo",
              "weight": 0.4
          },
                      {
              "model": "anthropic/claude-3-haiku-20240307",
              "weight": 0.8
          }
      ]
    }
  }'

Monitoring

Track these metrics for optimal load balancing:

// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};

Key Metrics:

Traffic distribution: Actual vs expected percentages
Cost per model: Monitor spending across providers
Response times: Compare latency by model
Error rates: Track failures by provider

Troubleshooting

**Uneven distribution

Check if weights are normalized correctly
Verify sufficient request volume (min 100 requests for accuracy)
Monitor over longer time periods

**Unexpected costs

Track actual vs expected cost distribution
Monitor for expensive model overuse
Set up cost alerts per provider

**Performance issues

Check latency differences between models
Monitor for provider-specific slowdowns
Adjust weights based on performance data

Limitations

Probabilistic routing: Short-term traffic may not match exact weights
Minimum volume needed: Requires sufficient requests for statistical accuracy
Response variations: Different models may return varying output quality
Cost complexity: Managing billing across multiple providers
Provider dependencies: Requires API access to all models

Advanced Usage

Environment-specific weights:

const weights = {
  development: [
    { model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
  ],
  production: [
    { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
    { model: "anthropic/claude-3", weight: 0.3 }, // Backup
  ],
};

Dynamic weight adjustment:

// Adjust weights based on performance
const adjustWeights = (metrics) => {
  return models.map((model) => ({
    model: model.name,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  }));
};

With other features:

{
  orq: {
    load_balancer: [
      {model: "openai/gpt-4o", weight: 0.6},
      {model: "anthropic/claude-3", weight: 0.4}
    ],
    retries: {count: 2, on_codes: [429]},
    timeout: {call_timeout: 15000}
  }
}

Getting Started

Reference

Administer

Load Balancing

Quick Start

Configuration

Common Patterns

Use Cases

Code examples

Monitoring

Troubleshooting

Limitations

Advanced Usage

Getting Started

Reference

Administer

​Quick Start

​Configuration

​Common Patterns

​Use Cases

​Code examples

​Monitoring

​Troubleshooting

​Limitations

​Advanced Usage

Quick Start

Configuration

Common Patterns

Use Cases

Code examples

Monitoring

Troubleshooting

Limitations

Advanced Usage