Context Window
The context window in a large language model (LLM) refers to the amount of text the model can consider when generating a response, influencing its ability to understand and generate coherent text.
The context window of a Large Language Model (LLM) defines the maximum number of tokens (words, sub-words, or characters) that the model can process and consider simultaneously when generating output. This window acts as the model's short-term memory, encompassing the input prompt and any preceding generated text. A larger context window allows the LLM to retain more information from the conversation or document, leading to improved coherence, relevance, and understanding of complex instructions or lengthy narratives. For instance, a model with a 4,096-token context window can "remember" up to approximately 3,000 words of text. The architecture of the LLM, particularly the attention mechanism (e.g., self-attention in Transformers), dictates how efficiently it can utilize this window. Trade-offs exist: larger context windows require significantly more computational resources (memory and processing power) and can lead to increased latency during inference. Furthermore, models may struggle with "lost in the middle" phenomena, where information in the middle of a very long context is less effectively utilized than information at the beginning or end. Techniques like sliding window attention, sparse attention, or retrieval-augmented generation (RAG) are employed to mitigate these limitations and extend effective context handling.
graph LR
Center["Context Window"]:::main
Pre_computer_science["computer-science"]:::pre --> Center
click Pre_computer_science "/terms/computer-science"
Rel_large_language_model["large-language-model"]:::related -.-> Center
click Rel_large_language_model "/terms/large-language-model"
Rel_hallucination_ai["hallucination-ai"]:::related -.-> Center
click Rel_hallucination_ai "/terms/hallucination-ai"
Rel_token["token"]:::related -.-> Center
click Rel_token "/terms/token"
classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
linkStyle default stroke:#4b5563,stroke-width:2px;
🧠 Knowledge Check
🧒 Explain Like I'm 5
It's like the [LLM](/en/terms/llm)'s notepad; it can only remember what fits on the current page when writing its answer.
🤓 Expert Deep Dive
The context window's size, typically measured in tokens, is a critical architectural parameter directly impacting an LLM's ability to perform tasks requiring long-range dependencies. Transformer-based architectures, dominant in modern LLMs, employ self-attention mechanisms. The computational complexity of standard self-attention scales quadratically ($O(N^2)$) with the sequence length $N$ (context window size), making very large windows prohibitively expensive in terms of memory and computation. This has spurred research into efficient attention variants, such as sparse attention (e.g., Longformer, BigBird), linear attention, and retrieval-augmented generation (RAG). RAG, for example, augments the LLM with an external knowledge retrieval system, effectively extending its accessible "context" beyond the fixed window by dynamically fetching relevant information. Architectural choices like positional encodings (e.g., absolute, relative, rotary) also influence how well the model can interpret token positions within the window. Edge cases include catastrophic forgetting when fine-tuning on new data and the aforementioned "lost in the middle" problem, where attention scores may degrade for tokens positioned far from the prompt's beginning or end.