Max tokens & context window

There is usually some confusion between Max tokens, and Context window. This article will explain and differentiate the two.

Context window

The context window is the number of tokens the model can take as input. For example, you can attach a book with 10,000 tokens in an LLM call. Each model has a different context window.


ModelContext window
Gemini 1.5 Flash1.000.000
Claude 3 (all variants)200.000
Gemini 1.5 Pro128.000
GPT-4 Turbo128.000
GPT-4o and 4o-mini128.000
GPT-4 32k32.000
Gemini Pro32.000
Mistral large32.000
GPT-48.000
Llama 3 (all variants)8.000

Max tokens

The max tokens are the number of tokens the model can use to output a response.

This means that when I set the max tokens parameter to 256, the model will never use more than 256 tokens. When a model intends to output more tokens, it will cut off, resulting in incomplete generations.

Making sure to provide the model with an adequate number of tokens is crucial to generating a complete answer.

If you want to influence the model to generate shorter answers, you can specify this in the system prompt

The max tokens are configurable in Orq. The context window is not because it's fixed.