Streaming

Streaming

Overview

Who is this for? Developers building conversational AI applications who need real-time, token-by-token responses for better user experience.

What you'll achieve: Implement streaming responses that display AI-generated content as it's being produced, creating more engaging and responsive applications.

The AI Proxy supports Server-Sent Events (SSE) streaming for both chat completions and text completions across all supported providers, with automatic chunk combination and error handling.

Supported Streaming Types

EndpointDescriptionResponse Format
/v2/chat/completionsConversational AI with streamingdata: {"choices":[{"delta":{"content":"token"}}]}
/v2/completionsText generation with streamingdata: {"choices":[{"text":"token"}]}

Basic Streaming

Chat Completions

<CODE_PLACEHOLDER>

Response:
<CODE_PLACEHOLDER>

Text Completions

<CODE_PLACEHOLDER>

Advanced Streaming Features

Streaming with Tool Calls

<CODE_PLACEHOLDER>

Response includes tool call chunks:
<CODE_PLACEHOLDER>

Multi-Provider Streaming

<CODE_PLACEHOLDER>

Implementation Examples

JavaScript/Node.js

<CODE_PLACEHOLDER>

Python

<CODE_PLACEHOLDER>

React Streaming Component

<CODE_PLACEHOLDER>

Provider-Specific Streaming

OpenAI & Compatible Providers

  • Supports all streaming parameters (stream, stream_options)
  • Compatible with Groq, Perplexity, NVIDIA, TogetherAI, etc.
  • Tool calling streams function arguments incrementally

Anthropic Claude

  • Automatic conversion from Anthropic's streaming format
  • Preserves Claude's reasoning tokens in stream
  • Maintains message structure compatibility

Google AI (Gemini)

  • Converts Google's streaming format to OpenAI-compatible
  • Handles Gemini's candidate structure automatically
  • Supports streaming with function calling

Error Handling in Streams

Network Interruption Recovery

<CODE_PLACEHOLDER>

Handling Malformed Chunks

<CODE_PLACEHOLDER>

Best Practices

Performance Optimization

  • Buffer Management: Process chunks in batches to avoid UI lag
  • Memory Usage: Clear processed chunks to prevent memory leaks
  • Rate Limiting: Implement client-side throttling for rapid updates

User Experience

  • Loading Indicators: Show typing indicators during streaming
  • Cancellation: Allow users to stop generation early
  • Error Recovery: Gracefully handle stream interruptions

Security Considerations

  • Input Validation: Validate streaming parameters
  • Rate Limiting: Implement per-user streaming limits
  • Content Filtering: Apply real-time content moderation

Troubleshooting

Common Issues

Stream Never Ends
<CODE_PLACEHOLDER>

Missing Content Chunks

  • Ensure proper UTF-8 decoding
  • Handle partial JSON chunks correctly
  • Implement chunk buffering for incomplete data

Provider-Specific Errors

  • Check provider status endpoints
  • Implement provider-specific retry logic
  • Monitor rate limit headers in error responses

Next Steps