Transformer

A Transformer is a deep learning model that employs self-attention mechanisms to weigh the importance of different parts of the input data when processing it, excelling in tasks like natural language processing.

A Transformer, in the context of deep learning, is a specific neural network architecture distinguished by its reliance on the self-attention mechanism. Unlike sequential models like Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), Transformers process input data in parallel, enabling them to capture long-range dependencies more effectively. The core innovation is the self-attention mechanism, which allows the model to dynamically weigh the importance of different parts of the input sequence when producing an output for a specific part. For example, when translating a sentence, the model can determine which source words are most relevant to translating a particular target word. The architecture typically comprises an encoder stack and a decoder stack. The encoder processes the input sequence and generates a rich contextual representation, while the decoder uses this representation, along with previously generated outputs, to produce the final output sequence. Positional encodings are crucial additions to the input embeddings, as the self-attention mechanism itself does not inherently understand the order of elements. This architecture has become the de facto standard for many Natural Language Processing (NLP) tasks, including machine translation, text generation, and sentiment analysis, due to its scalability and performance.

        graph LR
  Center["Transformer"]:::main
  Pre_neural_network["neural-network"]:::pre --> Center
  click Pre_neural_network "/terms/neural-network"
  Pre_linear_algebra["linear-algebra"]:::pre --> Center
  click Pre_linear_algebra "/terms/linear-algebra"
  Pre_deep_learning["deep-learning"]:::pre --> Center
  click Pre_deep_learning "/terms/deep-learning"
  Rel_transformer_architecture["transformer-architecture"]:::related -.-> Center
  click Rel_transformer_architecture "/terms/transformer-architecture"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

      

🧒 Explain Like I'm 5

It's like a super-smart reader that can look at all the words in a sentence at once and figure out which words are most important to understand the meaning of each individual word.

🤓 Expert Deep Dive

The Transformer model's success stems from its ability to model dependencies without regard to their distance in the input or output sequences. The self-attention mechanism computes a weighted sum of value vectors, where the weight assigned to each value is determined by the compatibility (dot product) of its corresponding key vector with a query vector. This allows for direct modeling of relationships between any two positions in the sequence. Multi-head attention further enhances this by allowing the model to jointly attend to information from different representation subspaces at different positions. The encoder uses stacked self-attention and point-wise feed-forward layers, while the decoder adds masked self-attention (to prevent attending to future tokens) and encoder-decoder attention. The absence of recurrence makes it highly parallelizable, leading to faster training times on modern hardware compared to RNNs. However, the quadratic complexity of self-attention with respect to sequence length ($O(n^2)$) remains a bottleneck for very long sequences, prompting research into more efficient variants.

🔗 Related Terms

📚 Sources