Max Tokens & Context Window
There is usually some confusion between Max tokens, and Context window. This article will explain and differentiate the two.
Context window
The context window is the number of tokens the model can take as input. For example, you can attach a book with 10,000 tokens in an LLM call. Each model has a different context window.
Model | Context window |
---|---|
Gemini 1.5 Flash | 1.000.000 |
Claude 3 (all variants) | 200.000 |
Gemini 1.5 Pro | 128.000 |
GPT-4 Turbo | 128.000 |
GPT-4o and 4o-mini | 128.000 |
GPT-4 32k | 32.000 |
Gemini Pro | 32.000 |
Mistral large | 32.000 |
GPT-4 | 8.000 |
Llama 3 (all variants) | 8.000 |
Max tokens
The max tokens are the number of tokens the model can use to output a response.
This means that when I set the max tokens parameter to 256, the model will never use more than 256 tokens. When a model intends to output more tokens, it will cut off, resulting in incomplete generations.
Making sure to provide the model with an adequate number of tokens is crucial to generating a complete answer.
If you want to influence the model to generate shorter answers, you can specify this in the system prompt
The max tokens are configurable in Orq. The context window is not because it's fixed.
Updated about 1 month ago