18. Transformer
18. Transformer¶
Previous: Attention Deep Dive | Next: BERT
Overview¶
Transformer is the architecture proposed in "Attention Is All You Need" (Vaswani et al., 2017) and is the core of modern deep learning. It processes sequences using only Self-Attention without RNNs.
Learning Objectives¶
- Self-Attention: Understanding Query, Key, Value operations
- Multi-Head Attention: Parallel processing of multiple attention heads
- Positional Encoding: Injecting position information
- Encoder-Decoder: Overall architecture structure
Mathematical Background¶
1. Scaled Dot-Product Attention¶
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Where:
- Q (Query): what to look for
- K (Key): matching target
- V (Value): actual value to retrieve
- d_k: dimension of Key (scaling factor)
Formula breakdown:
1. QK^T: compute similarity between Query and Key → (seq_len, seq_len)
2. / √d_k: prevent large values (softmax stability)
3. softmax: convert to probability distribution
4. × V: weighted average
2. Multi-Head Attention¶
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
Features:
- Learn attention from multiple "perspectives"
- Each head captures different patterns
- Can be parallelized
3. Positional Encoding¶
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Purpose:
- Transformer has no order information
- Explicitly inject position information
- Sinusoidal: generated without training, can extrapolate
File Structure¶
07_Transformer/
├── README.md
├── pytorch_lowlevel/
│ ├── attention_lowlevel.py # Basic Attention implementation
│ ├── multihead_attention.py # Multi-Head Attention
│ ├── positional_encoding.py # Positional encoding
│ └── transformer_lowlevel.py # Complete Transformer
├── paper/
│ ├── transformer_paper.py # Paper reproduction
│ └── transformer_xl.py # Transformer-XL variant
└── exercises/
├── 01_flash_attention.md # Flash Attention implementation
├── 02_rotary_embeddings.md # RoPE implementation
└── 03_kv_cache.md # KV Cache implementation
Core Concepts¶
1. Self-Attention vs Cross-Attention¶
Self-Attention:
- Q, K, V all from same sequence
- Used inside Encoder, Decoder
Cross-Attention:
- Q from Decoder, K, V from Encoder
- Connects Encoder-Decoder
2. Masking¶
# Padding mask: ignore padding tokens
padding_mask = (input_ids == pad_token_id) # (batch, seq_len)
# Causal mask: prevent seeing future tokens
causal_mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# Set upper triangular matrix to -inf
3. Feed-Forward Network¶
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
Or (using GELU):
FFN(x) = GELU(xW_1)W_2
Features:
- Position-wise: applied independently to each position
- Expansion: usually 4x expansion (d_model → 4*d_model → d_model)
Practice Problems¶
Basic¶
- Directly implement Scaled Dot-Product Attention
- Visualize Positional Encoding
- Visualize Self-Attention patterns
Intermediate¶
- Implement Multi-Head Attention
- Complete Encoder block
- Complete Decoder block (including causal mask)
Advanced¶
- Optimize autoregressive generation with KV Cache
- Implement Flash Attention (memory efficient)
- Implement Rotary Position Embedding (RoPE)
References¶
- Vaswani et al. (2017). "Attention Is All You Need"
- The Annotated Transformer
- The Illustrated Transformer