Max Tokens & Context Window

There is usually some confusion between Maxtokens, and Context window. This article will explain and differentiate the two.

Context window

The context window is the number of tokens the model can take as input. For example, you can attach a book with 10,000 tokens in an LLM call. Each model has a different context window.

Model	Context window
Gemini 1.5 Flash	1.000.000
Claude 3 (all variants)	200.000
Gemini 1.5 Pro	128.000
GPT-4 Turbo	128.000
GPT-4o and 4o-mini	128.000
GPT-4 32k	32.000
Gemini Pro	32.000
Mistral large	32.000
GPT-4	8.000
Llama 3 (all variants)	8.000

Max tokens

The max tokens are the number of tokens the model can use to output a response.

This means that when I set the max tokens parameter to 256, the model will never use more than 256 tokens. When a model intends to output more tokens, it will cut off, resulting in incomplete generations.

Making sure to provide the model with an adequate number of tokens is crucial to generating a complete answer.

If you want to influence the model to generate shorter answers, you can specify this in the system prompt

The max tokens are configurable in Orq. The context window is not because it's fixed.