LLM response streaming - Orq.ai Documentation

Use Cases

Chat UIs that show responses as they arrive, before generation completes.
Long-form generation (reports, code) where waiting for the full output hurts UX.
Agent workflows that surface reasoning steps or tool calls in real time.
Reducing perceived latency on slow models or large outputs.

Quick Start

Enable real-time response streaming for better user experience.

curl -N -X POST https://api.orq.ai/v3/router/responses \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.4",
    "input": "Write a story about space exploration",
    "stream": true
  }'

Configuration

Parameter	Type	Required	Description
`stream`	boolean	Yes	Enable streaming responses

All models support streaming: no additional configuration needed.

Response Format

Streaming chunks:

{
  "type": "response.output_text.delta",
  "delta": "Hello"
}

Final chunk:

{
  "type": "response.output_text.done"
}

Code examples

curl -N -X POST https://api.orq.ai/v3/router/responses \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.4",
    "input": "Write a detailed explanation of quantum computing",
    "stream": true
  }'

Stream Processing Patterns

The examples in this section use the Chat Completions endpoint. The same patterns apply to the Responses API: replace chat.completions.create(...) with responses.create(...), update the endpoint to /v3/router/responses, and handle response.output_text.delta events instead of choices[0].delta.content.

Basic processing

Accumulate deltas into a full string and detect completion via finish_reason.

const processStream = async (stream) => {
  let fullResponse = "";

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || "";
    if (content) {
      fullResponse += content;
      console.log(content); // Real-time output
    }

    // Check for completion
    if (chunk.choices[0]?.finish_reason) {
      console.log(`\nStream finished: ${chunk.choices[0].finish_reason}`);
      break;
    }
  }

  return fullResponse;
};

With error handling

Guard against network drops and unexpected errors by wrapping the stream loop in a try/except. The TypeScript example additionally resets a timeout on each chunk.

const updateUI = (content: string) => { process.stdout.write(content); }; // replace with your UI update logic

const robustStreamProcessing = async (stream) => {
  try {
    let response = "";
    const timeout = setTimeout(() => {
      throw new Error("Stream timeout");
    }, 30000);

    for await (const chunk of stream) {
      clearTimeout(timeout);

      if (chunk.choices[0]?.delta?.content) {
        response += chunk.choices[0].delta.content;
        // Update UI with new content
        updateUI(chunk.choices[0].delta.content);
      }

      if (chunk.choices[0]?.finish_reason) {
        break;
      }
    }

    return response;
  } catch (error) {
    console.error("Streaming error:", error);
    throw error;
  }
};

Function Calling with Streaming

Stream tool calls as they’re generated:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v3/router",
});

const tools = [
  {
    type: "function" as const,
    function: {
      name: "get_weather",
      description: "Get current weather",
      parameters: {
        type: "object",
        properties: { location: { type: "string" } },
        required: ["location"],
      },
    },
  },
];

const stream = await client.chat.completions.create({
  model: "openai/gpt-5.4",
  messages: [{ role: "user", content: "What's the weather in Paris?" }],
  tools,
  stream: true,
});

for await (const chunk of stream) {
  if (!chunk.choices.length) continue;
  const delta = chunk.choices[0].delta;
  if (delta.tool_calls?.[0]?.function?.arguments) {
    process.stdout.write(delta.tool_calls[0].function.arguments);
  } else if (delta.content) {
    process.stdout.write(delta.content);
  }
}

UI Integration Examples

React hook for streaming

Encapsulate streaming state in a hook so components receive response and isStreaming without managing the event loop themselves.

import OpenAI from "openai";
import { useState, useCallback } from "react";

const client = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v3/router",
});

const useStreamingChat = () => {
  const [response, setResponse] = useState("");
  const [isStreaming, setIsStreaming] = useState(false);

  const streamChat = useCallback(async (message) => {
    setIsStreaming(true);
    setResponse("");

    try {
      const stream = await client.chat.completions.create({
        model: "openai/gpt-5.4",
        messages: [{ role: "user", content: message }],
        stream: true,
      });

      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || "";
        if (content) {
          setResponse((prev) => prev + content);
        }

        if (chunk.choices[0]?.finish_reason) {
          setIsStreaming(false);
          break;
        }
      }
    } catch (error) {
      console.error("Streaming failed:", error);
      setIsStreaming(false);
    }
  }, []);

  return { response, isStreaming, streamChat };
};

Server-Sent Events (Browser):

const streamWithSSE = async (message: string): Promise<void> => {
  const response = await fetch("/api/chat-stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message }),
  });

  if (!response.ok || !response.body) {
    throw new Error(`Request failed: ${response.status}`);
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  const output = document.getElementById("response")!;
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop() ?? "";

    for (const line of lines) {
      if (line === "data: [DONE]") break;
      if (!line.startsWith("data: ")) continue;
      const data = JSON.parse(line.slice(6));
      const content = data.choices[0]?.delta?.content;
      if (content) output.innerHTML += content;
    }
  }
};

Performance Optimization

Chunk buffering

Batching small chunks before flushing to the UI reduces render cycles and smooths perceived output.

class StreamBuffer {
  private buffer: string;
  private flushInterval: number;
  private lastFlush: number;

  constructor(flushInterval = 50) {
    this.buffer = "";
    this.flushInterval = flushInterval;
    this.lastFlush = Date.now();
  }

  add(content: string): void {
    this.buffer += content;

    // Flush periodically or when buffer is large
    if (
      Date.now() - this.lastFlush > this.flushInterval ||
      this.buffer.length > 100
    ) {
      this.flush();
    }
  }

  flush(): void {
    if (this.buffer) {
      this.onFlush(this.buffer);
      this.buffer = "";
      this.lastFlush = Date.now();
    }
  }

  onFlush(content: string): void {
    // Override this method
    console.log(content);
  }
}

Memory management

For long completions, cap accumulation to avoid unbounded memory growth.

const processLargeStream = async (stream, maxMemory = 1000000) => {
  let totalLength = 0;
  const chunks = [];

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || "";

    if (content) {
      totalLength += content.length;
      chunks.push(content);

      // Prevent memory overflow
      if (totalLength > maxMemory) {
        console.warn("Stream too large, truncating");
        break;
      }
    }

    if (chunk.choices[0]?.finish_reason) {
      break;
    }
  }

  return chunks.join("");
};

Best Practices

Stream management

Set reasonable timeouts (30-60 seconds).
Implement proper error boundaries.
Handle network interruptions gracefully.
Provide user cancellation options.

UI/UX considerations

Show typing indicators during streaming.
Allow users to stop generation.
Buffer small chunks for smoother display.
Handle rapid updates efficiently.

Error recovery example

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v3/router",
});

const streamWithRetry = async (input: string, maxRetries = 3) => {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const stream = await client.responses.create({
        model: "openai/gpt-5.4",
        input,
        stream: true,
      });

      let fullResponse = "";
      for await (const event of stream) {
        if (event.type === "response.output_text.delta") {
          fullResponse += event.delta;
          process.stdout.write(event.delta);
        }
      }
      return fullResponse;
    } catch (error) {
      if (attempt === maxRetries) throw error;

      console.log(`Stream attempt ${attempt} failed, retrying...`);
      await new Promise((resolve) => setTimeout(resolve, 1000 * attempt));
    }
  }
};

Troubleshooting

Stream cuts off unexpectedly

Check network stability.
Verify timeout settings.
Monitor for rate limiting.
Check model-specific limits. Slow streaming performance
Optimize chunk processing.
Reduce buffer flush frequency.
Check network latency.
Consider model selection. Memory issues
Implement chunk size limits.
Use streaming parsers.
Clear processed chunks.
Monitor memory usage.

Limitations

Limitation	Impact	Workaround
Network interruption	Stream breaks	Implement reconnection logic
Processing overhead	Slight performance cost	Optimize chunk handling
Model variations	Different chunk sizes	Handle variable chunk lengths
Rate limiting	Stream throttling	Implement backoff strategies

Advanced Features

Stream with other Gateway features

AI Gateway features like caching, timeouts, and deployment names compose directly with streaming: add them to the same request object.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v3/router",
});

const advancedStream = await client.chat.completions.create({
  model: "openai/gpt-5.4",
  messages: [{ role: "user", content: "Explain machine learning" }],
  stream: true,
  name: "StreamingBot-v1",
  cache: { type: "exact_match", ttl: 3600 },
  timeout: { call_timeout: 30000 },
});

Parallel streaming

Fire multiple streams concurrently using Promise.all in TypeScript and asyncio.gather in Python to get independent responses without waiting for each to finish.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.ORQ_API_KEY,
  baseURL: "https://api.orq.ai/v3/router",
});

const processQuery = async (query) => {
  const stream = await client.chat.completions.create({
    model: "openai/gpt-5.4",
    messages: [{ role: "user", content: query }],
    stream: true,
  });
  let fullResponse = "";
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || "";
    if (content) fullResponse += content;
  }
  return fullResponse;
};

const parallelStreaming = async (queries) => Promise.all(queries.map(processQuery));

​Quick Start

​Configuration

​Response Format

​Code examples

​Stream Processing Patterns

​Basic processing

​With error handling

​Function Calling with Streaming

​UI Integration Examples

​React hook for streaming

​Performance Optimization

​Chunk buffering

​Memory management

​Best Practices

​Stream management

​UI/UX considerations

​Error recovery example

​Troubleshooting

​Limitations

​Advanced Features

​Stream with other Gateway features

​Parallel streaming

Quick Start

Configuration

Response Format

Code examples

Stream Processing Patterns

Basic processing

With error handling

Function Calling with Streaming

UI Integration Examples

React hook for streaming

Performance Optimization

Chunk buffering

Memory management

Best Practices

Stream management

UI/UX considerations

Error recovery example

Troubleshooting

Limitations

Advanced Features

Stream with other Gateway features

Parallel streaming