Streaming
Streaming
Overview
Who is this for? Developers building conversational AI applications who need real-time, token-by-token responses for better user experience.
What you'll achieve: Implement streaming responses that display AI-generated content as it's being produced, creating more engaging and responsive applications.
The AI Proxy supports Server-Sent Events (SSE) streaming for both chat completions and text completions across all supported providers, with automatic chunk combination and error handling.
Supported Streaming Types
Endpoint | Description | Response Format |
---|---|---|
/v2/chat/completions | Conversational AI with streaming | data: {"choices":[{"delta":{"content":"token"}}]} |
/v2/completions | Text generation with streaming | data: {"choices":[{"text":"token"}]} |
Basic Streaming
Chat Completions
<CODE_PLACEHOLDER>
Response:
<CODE_PLACEHOLDER>
Text Completions
<CODE_PLACEHOLDER>
Advanced Streaming Features
Streaming with Tool Calls
<CODE_PLACEHOLDER>
Response includes tool call chunks:
<CODE_PLACEHOLDER>
Multi-Provider Streaming
<CODE_PLACEHOLDER>
Implementation Examples
JavaScript/Node.js
<CODE_PLACEHOLDER>
Python
<CODE_PLACEHOLDER>
React Streaming Component
<CODE_PLACEHOLDER>
Provider-Specific Streaming
OpenAI & Compatible Providers
- Supports all streaming parameters (
stream
,stream_options
) - Compatible with Groq, Perplexity, NVIDIA, TogetherAI, etc.
- Tool calling streams function arguments incrementally
Anthropic Claude
- Automatic conversion from Anthropic's streaming format
- Preserves Claude's reasoning tokens in stream
- Maintains message structure compatibility
Google AI (Gemini)
- Converts Google's streaming format to OpenAI-compatible
- Handles Gemini's candidate structure automatically
- Supports streaming with function calling
Error Handling in Streams
Network Interruption Recovery
<CODE_PLACEHOLDER>
Handling Malformed Chunks
<CODE_PLACEHOLDER>
Best Practices
Performance Optimization
- Buffer Management: Process chunks in batches to avoid UI lag
- Memory Usage: Clear processed chunks to prevent memory leaks
- Rate Limiting: Implement client-side throttling for rapid updates
User Experience
- Loading Indicators: Show typing indicators during streaming
- Cancellation: Allow users to stop generation early
- Error Recovery: Gracefully handle stream interruptions
Security Considerations
- Input Validation: Validate streaming parameters
- Rate Limiting: Implement per-user streaming limits
- Content Filtering: Apply real-time content moderation
Troubleshooting
Common Issues
Stream Never Ends
<CODE_PLACEHOLDER>
Missing Content Chunks
- Ensure proper UTF-8 decoding
- Handle partial JSON chunks correctly
- Implement chunk buffering for incomplete data
Provider-Specific Errors
- Check provider status endpoints
- Implement provider-specific retry logic
- Monitor rate limit headers in error responses
Next Steps
- Tool Calling: Combine streaming with function calls
- Vision: Stream responses for image analysis
- Error Handling: Advanced retry strategies for streams
Updated about 6 hours ago