09. GPT
09. GPT¶
κ°μ¶
GPT (Generative Pre-trained Transformer)λ OpenAIκ° κ°λ°ν μκΈ°νκ·(autoregressive) μΈμ΄ λͺ¨λΈμ λλ€. μΌμͺ½μμ μ€λ₯Έμͺ½μΌλ‘ ν μ€νΈλ₯Ό μμ±νλ©°, νλ LLMμ κΈ°λ°μ΄ λμμ΅λλ€.
μνμ λ°°κ²½¶
1. Causal Language Modeling¶
λͺ©μ ν¨μ:
L = -Ξ£ log P(x_t | x_<t)
μκΈ°νκ· λͺ¨λΈ:
P(x_1, x_2, ..., x_n) = Ξ P(x_t | x_1, ..., x_{t-1})
νΉμ§:
- λ―Έλ ν ν° μ°Έμ‘° λΆκ° (causal mask)
- λͺ¨λ ν ν°μ΄ νμ΅ μ νΈ
- ν
μ€νΈ μμ±μ μμ°μ€λ¬μ
2. Causal Self-Attention¶
νμ€ Attention:
Attention(Q, K, V) = softmax(QK^T / βd) V
Causal Attention (λ―Έλ λ§μ€νΉ):
mask = upper_triangular(-β)
Attention(Q, K, V) = softmax((QK^T + mask) / βd) V
νλ ¬ μκ°ν:
Q\K | t1 t2 t3 t4
---------------------
t1 | β Γ Γ Γ
t2 | β β Γ Γ
t3 | β β β Γ
t4 | β β β β
3. GPT vs BERT¶
BERT (Bidirectional):
- Masked LM: 15% λ§μ€νΉ
- μλ°©ν₯ 컨ν
μ€νΈ
- λΆλ₯/μ΄ν΄ νμ€ν¬μ κ°ν¨
GPT (Autoregressive):
- Causal LM: λ€μ ν ν° μμΈ‘
- μΌμͺ½ 컨ν
μ€νΈλ§
- μμ± νμ€ν¬μ κ°ν¨
GPT-2 μν€ν μ²¶
GPT-2 Small (117M):
- Hidden size: 768
- Layers: 12
- Attention heads: 12
GPT-2 Medium (345M):
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
GPT-2 Large (774M):
- Hidden size: 1280
- Layers: 36
- Attention heads: 20
GPT-2 XL (1.5B):
- Hidden size: 1600
- Layers: 48
- Attention heads: 25
ꡬ쑰:
Token Embedding + Position Embedding
β
Transformer Decoder Γ L layers (Pre-LN)
β
Layer Norm
β
LM Head (shared with embedding)
νμΌ κ΅¬μ‘°¶
09_GPT/
βββ README.md
βββ pytorch_lowlevel/
β βββ gpt_lowlevel.py # GPT Decoder μ§μ ꡬν
βββ paper/
β βββ gpt2_paper.py # GPT-2 λ
Όλ¬Έ μ¬ν
βββ exercises/
βββ 01_text_generation.md # ν
μ€νΈ μμ± μ€μ΅
βββ 02_kv_cache.md # KV Cache ꡬν
ν΅μ¬ κ°λ ¶
1. Pre-LN vs Post-LN¶
Post-LN (μλ³Έ Transformer):
x β Attention β Add β LayerNorm β FFN β Add β LayerNorm
Pre-LN (GPT-2):
x β LayerNorm β Attention β Add β LayerNorm β FFN β Add
Pre-LN μ₯μ :
- νμ΅ μμ μ± ν₯μ
- λ κΉμ λ€νΈμν¬ κ°λ₯
2. Weight Tying¶
Embeddingκ³Ό LM Head κ°μ€μΉ 곡μ :
E = Embedding matrix (vocab_size Γ hidden_size)
LM_head = E.T (λλ 곡μ )
μ₯μ :
- νλΌλ―Έν° μ μ½
- μΌκ΄λ νν νμ΅
3. μμ± μ λ΅¶
Greedy: argmax(P(x_t | x_<t))
- κ²°μ μ , λ°λ³΅ λ¬Έμ
Sampling: x_t ~ P(x_t | x_<t)
- λ€μμ±, νμ§ μ ν κ°λ₯
Top-K: μμ Kκ°μμ μνλ§
- νμ§κ³Ό λ€μμ± κ· ν
Top-P (Nucleus): λμ νλ₯ PκΉμ§λ§
- λμ ν보 ν¬κΈ°
Temperature: softmax(logits / T)
- T < 1: λ κ²°μ μ
- T > 1: λ λ€μ
ꡬν λ 벨¶
Level 2: PyTorch Low-Level (pytorch_lowlevel/)¶
- Causal Attention μ§μ ꡬν
- Pre-LN ꡬ쑰
- ν μ€νΈ μμ± ν¨μ
Level 3: Paper Implementation (paper/)¶
- GPT-2 μ νν μ¬μ
- WebText μ€νμΌ νμ΅
- λ€μν μμ± μ λ΅
Level 4: Code Analysis (λ³λ λ¬Έμ)¶
- HuggingFace GPT2 λΆμ
- nanoGPT μ½λ λΆμ
νμ΅ μ²΄ν¬λ¦¬μ€νΈ¶
- [ ] Causal mask ꡬν
- [ ] Pre-LN ꡬ쑰 μ΄ν΄
- [ ] Weight tying μ΄ν΄
- [ ] λ€μν μμ± μ λ΅ κ΅¬ν
- [ ] KV Cache μ΅μ ν
- [ ] GPT vs BERT μ°¨μ΄μ
μ°Έκ³ μλ£¶
- Radford et al. (2018). "Improving Language Understanding by Generative Pre-Training" (GPT-1)
- Radford et al. (2019). "Language Models are Unsupervised Multitask Learners" (GPT-2)
- nanoGPT
- ../LLM_and_NLP/03_BERT_GPT_Architecture.md