Router.Chat.Completions
Create a Completion
Creates a model response for the given chat conversation with support for retries, fallbacks, prompts, and variables.from orq_ai_sdk import Orq
import os
with Orq(
api_key=os.getenv("ORQ_API_KEY", ""),
) as orq:
res = orq.router.chat.completions.create(messages=[], model="Model 3", fallbacks=[
{
"model": "openai/gpt-4o-mini",
},
], retry={
"on_codes": [
429,
500,
502,
503,
504,
],
}, cache={
"ttl": 3600,
"type": "exact_match",
}, load_balancer={
"type": "weight_based",
"models": [
{
"model": "openai/gpt-4o",
"weight": 0.7,
},
{
"model": "anthropic/claude-3-5-sonnet",
"weight": 0.3,
},
],
}, timeout={
"call_timeout": 30000,
}, variables={
"customer_name": "John Smith",
"product_name": "Premium Plan",
}, stream=False)
with res as event_stream:
for event in event_stream:
# handle event
print(event, flush=True)
Show Parameters
Show Parameters
A list of messages comprising the conversation so far.
Model ID used to generate the response, like
openai/gpt-4o or anthropic/claude-haiku-4-5-20251001. The AI Gateway offers a wide range of models with different capabilities, performance characteristics, and price points. Refer to the (Supported models)[/docs/proxy/supported-models] to browse available models.Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys can have a maximum length of 64 characters and values can have a maximum length of 512 characters.
The name to display on the trace. If not specified, the default system name will be used.
Parameters for audio output. Required when audio output is requested with modalities: [“audio”]. Learn more.
Show Properties of audio
Show Properties of audio
The voice the model uses to respond. Supported voices are alloy, echo, fable, onyx, nova, and shimmer.
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.
[Deprecated]. The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API. This value is now deprecated in favor of max_completion_tokens, and is not compatible with o1 series models.An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens
Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.
An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep n as 1 to minimize costs.
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.
An object specifying the format that the model must output
Constrains effort on reasoning for reasoning models. Currently supported values are
none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response. - gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1. - All models before gpt-5.1 default to medium reasoning effort, and do not support none. - The gpt-5-pro model defaults to (and only supports) high reasoning effort. - xhigh is currently only supported for gpt-5.1-codex-max. Any of “none”, “minimal”, “low”, “medium”, “high”, “xhigh”.Adjusts response verbosity. Lower levels yield shorter answers.
If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result.
Up to 4 sequences where the API will stop generating further tokens.
Options for streaming response. Only set this when you set stream: true.
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
Limits the model to consider only the top k most likely tokens at each step.
A list of tools the model may call.
Show Properties of tools
Show Properties of tools
The type of the tool. Currently, only function is supported.
Controls which (if any) tool is called by the model.
Whether to enable parallel function calling during tool use.
Output types that you would like the model to generate. Most models are capable of generating text, which is the default: [“text”]. The gpt-4o-audio-preview model can also be used to generate audio. To request that this model generate both text and audio responses, you can use: [“text”, “audio”].
A list of guardrails to apply to the request.
Show Properties of guardrails
Show Properties of guardrails
Array of fallback models to use if primary model fails
Retry configuration for the request
Show Properties of retry
Show Properties of retry
Number of retry attempts (1-5)
Cache configuration for the request.
Show Properties of cache
Show Properties of cache
Time to live for cached responses in seconds. Maximum 259200 seconds (3 days).
Load balancer configuration for the request.
Timeout configuration to apply to the request. If the request exceeds the timeout, it will be retried or fallback to the next model if configured.
Variables to substitute in message templates. Uses f-string syntax ({{variableName}}) by default. For advanced templating with Jinja or Mustache syntax, use in conjunction with
template_engine.Leverage Orq’s intelligent routing capabilities to enhance your AI application with enterprise-grade reliability and observability. Orq provides automatic request management including retries on failures, model fallbacks for high availability, identity-level analytics tracking, conversation threading, and dynamic prompt templating with variable substitution.
Show Properties of ~~`orq`~~
Show Properties of ~~`orq`~~
The name to display on the trace. If not specified, the default system name will be used.
Retry configuration for the request
Show Properties of retry
Show Properties of retry
Number of retry attempts (1-5)
Array of fallback models to use if primary model fails
Prompt configuration for the request
Show Properties of prompt
Show Properties of prompt
Unique identifier of the prompt to use
Information about the identity making the request. If the identity does not exist, it will be created automatically.
Show Properties of identity
Show Properties of identity
@deprecated Use identity instead. Information about the contact making the request.
Show Properties of ~~`contact`~~
Show Properties of ~~`contact`~~
Thread information to group related requests
Show Properties of thread
Show Properties of thread
Unique thread identifier to group related invocations.
@deprecated Use top-level
variables field instead. Values to replace in the prompt messages using {{variableName}} syntax.Cache configuration for the request.
Show Properties of cache
Show Properties of cache
Time to live for cached responses in seconds. Maximum 259200 seconds (3 days).
Show Properties of knowledgeBases
Show Properties of knowledgeBases
The number of results to return. If not provided, will default to the knowledge base configured
top_k.The threshold to apply to the search. If not provided, will default to the knowledge base configured
thresholdThe type of search to perform. If not provided, will default to the knowledge base configured
retrieval_typeThe metadata filter to apply to the search. Check the Searching a Knowledge Base for more information.
Override the rerank configuration for this search. If not provided, will use the knowledge base configured rerank settings.
Show Properties of rerankConfig
Show Properties of rerankConfig
The name of the rerank model to use. Refer to the model list.
The threshold value used to filter the rerank results, only documents with a relevance score greater than the threshold will be returned
Override the agentic RAG configuration for this search. If not provided, will use the knowledge base configured agentic RAG settings.
Unique identifier of the knowledge base to search
Array of models with weights for load balancing requests