20. GPT
20. GPT¶
Previous: BERT | Next: Vision Transformer
Overview¶
GPT (Generative Pre-trained Transformer) is an autoregressive language model developed by OpenAI. It generates text left-to-right and became the foundation of modern LLMs.
Mathematical Background¶
1. Causal Language Modeling¶
Objective function:
L = -Σ log P(x_t | x_<t)
Autoregressive model:
P(x_1, x_2, ..., x_n) = Π P(x_t | x_1, ..., x_{t-1})
Features:
- Cannot reference future tokens (causal mask)
- All tokens are training signals
- Natural for text generation
2. Causal Self-Attention¶
Standard Attention:
Attention(Q, K, V) = softmax(QK^T / √d) V
Causal Attention (future masking):
mask = upper_triangular(-∞)
Attention(Q, K, V) = softmax((QK^T + mask) / √d) V
Matrix visualization:
Q\K | t1 t2 t3 t4
---------------------
t1 | ✓ × × ×
t2 | ✓ ✓ × ×
t3 | ✓ ✓ ✓ ×
t4 | ✓ ✓ ✓ ✓
3. GPT vs BERT¶
BERT (Bidirectional):
- Masked LM: 15% masking
- Bidirectional context
- Strong at classification/understanding tasks
GPT (Autoregressive):
- Causal LM: predict next token
- Left context only
- Strong at generation tasks
GPT-2 Architecture¶
GPT-2 Small (117M):
- Hidden size: 768
- Layers: 12
- Attention heads: 12
GPT-2 Medium (345M):
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
GPT-2 Large (774M):
- Hidden size: 1280
- Layers: 36
- Attention heads: 20
GPT-2 XL (1.5B):
- Hidden size: 1600
- Layers: 48
- Attention heads: 25
Structure:
Token Embedding + Position Embedding
↓
Transformer Decoder × L layers (Pre-LN)
↓
Layer Norm
↓
LM Head (shared with embedding)
File Structure¶
09_GPT/
├── README.md
├── pytorch_lowlevel/
│ └── gpt_lowlevel.py # Direct GPT Decoder implementation
├── paper/
│ └── gpt2_paper.py # GPT-2 paper reproduction
└── exercises/
├── 01_text_generation.md # Text generation practice
└── 02_kv_cache.md # KV Cache implementation
Core Concepts¶
1. Pre-LN vs Post-LN¶
Post-LN (original Transformer):
x → Attention → Add → LayerNorm → FFN → Add → LayerNorm
Pre-LN (GPT-2):
x → LayerNorm → Attention → Add → LayerNorm → FFN → Add
Pre-LN advantages:
- Improved training stability
- Enables deeper networks
2. Weight Tying¶
Share weights between Embedding and LM Head:
E = Embedding matrix (vocab_size × hidden_size)
LM_head = E.T (or shared)
Advantages:
- Saves parameters
- Learns consistent representations
3. Generation Strategies¶
Greedy: argmax(P(x_t | x_<t))
- Deterministic, repetition problems
Sampling: x_t ~ P(x_t | x_<t)
- Diversity, potential quality degradation
Top-K: sample from top K
- Balance quality and diversity
Top-P (Nucleus): up to cumulative probability P
- Dynamic candidate size
Temperature: softmax(logits / T)
- T < 1: more deterministic
- T > 1: more diverse
Implementation Levels¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- Direct Causal Attention implementation
- Pre-LN structure
- Text generation function
Level 3: Paper Implementation (paper/)¶
- Exact GPT-2 specifications
- WebText style training
- Various generation strategies
Level 4: Code Analysis (separate document)¶
- Analyze HuggingFace GPT2
- Analyze nanoGPT code
Learning Checklist¶
- [ ] Implement causal mask
- [ ] Understand Pre-LN structure
- [ ] Understand weight tying
- [ ] Implement various generation strategies
- [ ] KV Cache optimization
- [ ] Differences between GPT vs BERT
References¶
- Radford et al. (2018). "Improving Language Understanding by Generative Pre-Training" (GPT-1)
- Radford et al. (2019). "Language Models are Unsupervised Multitask Learners" (GPT-2)
- nanoGPT
- ../LLM_and_NLP/03_BERT_GPT_Architecture.md