Attention Mechanism

An attention mechanism is a technique in neural networks that allows the model to focus on the most relevant parts of the input when producing an output.

🌐 Terms in other languages:

English Deutsch Español Français 日本語 한국어 Polski Português Русский Türkçe Українська

The attention mechanism is a technique used primarily in deep learning models, especially for sequence-to-sequence tasks like machine translation, text summarization, and image captioning. It addresses the limitation of traditional models (like basic RNNs) that struggle to handle long input sequences by compressing all information into a fixed-size context vector. Attention allows the model to dynamically focus on specific parts of the input sequence when generating each part of the output sequence. Architecturally, it involves calculating 'attention scores' between the current state of the decoder and each element of the encoded input sequence. These scores are then normalized (often using a softmax function) to produce 'attention weights'. A weighted sum of the input elements, using these weights, forms a context vector that is specific to the current decoding step. This allows the model to 'attend' to the most relevant input information, improving performance on tasks requiring understanding of long-range dependencies. Trade-offs include increased computational complexity and memory usage compared to non-attentional models, but the gains in accuracy and the ability to handle longer sequences are often significant.

        graph LR
  Center["Attention Mechanism"]:::main
  Pre_deep_learning["deep-learning"]:::pre --> Center
  click Pre_deep_learning "/terms/deep-learning"
  Pre_linear_algebra["linear-algebra"]:::pre --> Center
  click Pre_linear_algebra "/terms/linear-algebra"
  Rel_transformer_architecture["transformer-architecture"]:::related -.-> Center
  click Rel_transformer_architecture "/terms/transformer-architecture"
  Rel_context_window["context-window"]:::related -.-> Center
  click Rel_context_window "/terms/context-window"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

🕸️ Open in Universe

🧠 Knowledge Check

1 / 5

🧒 Explain Like I'm 5

It's like when you're reading a long story and you look back at a specific sentence you read earlier to help you understand what's happening right now.

🤓 Expert Deep Dive

Attention mechanisms fundamentally enhance sequence modeling by enabling variable-length, context-dependent information retrieval. Architecturally, they introduce a query-key-value (QKV) paradigm. The decoder's current state typically serves as the query, while the encoder's hidden states act as keys and values. The alignment score between the query and each key determines the attention weight via a scoring function (e.g., dot product, additive). This weight distribution allows the model to compute a context vector as a weighted sum of values, effectively retrieving relevant information. Self-attention, as used in Transformers, applies this mechanism not just between encoder and decoder but also within the input sequence itself, allowing the model to weigh the importance of different words relative to each other. This parallelizable computation and ability to capture long-range dependencies without recurrence are key architectural advantages. Vulnerabilities can arise from adversarial attacks manipulating attention patterns or biases learned from skewed data distributions.

🔗 Related Terms

Prerequisites:

📚 Sources

1. Neural Machine Translation by Jointly Learning to Align and Translate

2. Transformer model for language understanding

3. Attention is All You Need

4. The Transformer: Novel Neural Network Architecture for Language Understanding

5. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention