09. GPT

09. GPT

κ°œμš”

GPT (Generative Pre-trained Transformer)λŠ” OpenAIκ°€ κ°œλ°œν•œ μžκΈ°νšŒκ·€(autoregressive) μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. μ™Όμͺ½μ—μ„œ 였λ₯Έμͺ½μœΌλ‘œ ν…μŠ€νŠΈλ₯Ό μƒμ„±ν•˜λ©°, ν˜„λŒ€ LLM의 기반이 λ˜μ—ˆμŠ΅λ‹ˆλ‹€.


μˆ˜ν•™μ  λ°°κ²½

1. Causal Language Modeling

λͺ©μ ν•¨μˆ˜:
L = -Ξ£ log P(x_t | x_<t)

μžκΈ°νšŒκ·€ λͺ¨λΈ:
P(x_1, x_2, ..., x_n) = Ξ  P(x_t | x_1, ..., x_{t-1})

νŠΉμ§•:
- 미래 토큰 μ°Έμ‘° λΆˆκ°€ (causal mask)
- λͺ¨λ“  토큰이 ν•™μŠ΅ μ‹ ν˜Έ
- ν…μŠ€νŠΈ 생성에 μžμ—°μŠ€λŸ¬μ›€

2. Causal Self-Attention

ν‘œμ€€ Attention:
Attention(Q, K, V) = softmax(QK^T / √d) V

Causal Attention (미래 λ§ˆμŠ€ν‚Ή):
mask = upper_triangular(-∞)
Attention(Q, K, V) = softmax((QK^T + mask) / √d) V

ν–‰λ ¬ μ‹œκ°ν™”:
Q\K  | t1  t2  t3  t4
---------------------
t1   |  βœ“   Γ—   Γ—   Γ—
t2   |  βœ“   βœ“   Γ—   Γ—
t3   |  βœ“   βœ“   βœ“   Γ—
t4   |  βœ“   βœ“   βœ“   βœ“

3. GPT vs BERT

BERT (Bidirectional):
- Masked LM: 15% λ§ˆμŠ€ν‚Ή
- μ–‘λ°©ν–₯ μ»¨ν…μŠ€νŠΈ
- λΆ„λ₯˜/이해 νƒœμŠ€ν¬μ— 강함

GPT (Autoregressive):
- Causal LM: λ‹€μŒ 토큰 예츑
- μ™Όμͺ½ μ»¨ν…μŠ€νŠΈλ§Œ
- 생성 νƒœμŠ€ν¬μ— 강함

GPT-2 μ•„ν‚€ν…μ²˜

GPT-2 Small (117M):
- Hidden size: 768
- Layers: 12
- Attention heads: 12

GPT-2 Medium (345M):
- Hidden size: 1024
- Layers: 24
- Attention heads: 16

GPT-2 Large (774M):
- Hidden size: 1280
- Layers: 36
- Attention heads: 20

GPT-2 XL (1.5B):
- Hidden size: 1600
- Layers: 48
- Attention heads: 25

ꡬ쑰:
Token Embedding + Position Embedding
  ↓
Transformer Decoder Γ— L layers (Pre-LN)
  ↓
Layer Norm
  ↓
LM Head (shared with embedding)

파일 ꡬ쑰

09_GPT/
β”œβ”€β”€ README.md
β”œβ”€β”€ pytorch_lowlevel/
β”‚   └── gpt_lowlevel.py         # GPT Decoder 직접 κ΅¬ν˜„
β”œβ”€β”€ paper/
β”‚   └── gpt2_paper.py           # GPT-2 λ…Όλ¬Έ μž¬ν˜„
└── exercises/
    β”œβ”€β”€ 01_text_generation.md   # ν…μŠ€νŠΈ 생성 μ‹€μŠ΅
    └── 02_kv_cache.md          # KV Cache κ΅¬ν˜„

핡심 κ°œλ…

1. Pre-LN vs Post-LN

Post-LN (원본 Transformer):
x β†’ Attention β†’ Add β†’ LayerNorm β†’ FFN β†’ Add β†’ LayerNorm

Pre-LN (GPT-2):
x β†’ LayerNorm β†’ Attention β†’ Add β†’ LayerNorm β†’ FFN β†’ Add

Pre-LN μž₯점:
- ν•™μŠ΅ μ•ˆμ •μ„± ν–₯상
- 더 κΉŠμ€ λ„€νŠΈμ›Œν¬ κ°€λŠ₯

2. Weight Tying

Embeddingκ³Ό LM Head κ°€μ€‘μΉ˜ 곡유:

E = Embedding matrix (vocab_size Γ— hidden_size)
LM_head = E.T (λ˜λŠ” 곡유)

μž₯점:
- νŒŒλΌλ―Έν„° μ ˆμ•½
- μΌκ΄€λœ ν‘œν˜„ ν•™μŠ΅

3. 생성 μ „λž΅

Greedy: argmax(P(x_t | x_<t))
- 결정적, 반볡 문제

Sampling: x_t ~ P(x_t | x_<t)
- λ‹€μ–‘μ„±, ν’ˆμ§ˆ μ €ν•˜ κ°€λŠ₯

Top-K: μƒμœ„ Kκ°œμ—μ„œ μƒ˜ν”Œλ§
- ν’ˆμ§ˆκ³Ό λ‹€μ–‘μ„± κ· ν˜•

Top-P (Nucleus): λˆ„μ  ν™•λ₯  PκΉŒμ§€λ§Œ
- 동적 후보 크기

Temperature: softmax(logits / T)
- T < 1: 더 결정적
- T > 1: 더 λ‹€μ–‘

κ΅¬ν˜„ 레벨

Level 2: PyTorch Low-Level (pytorch_lowlevel/)

  • Causal Attention 직접 κ΅¬ν˜„
  • Pre-LN ꡬ쑰
  • ν…μŠ€νŠΈ 생성 ν•¨μˆ˜

Level 3: Paper Implementation (paper/)

  • GPT-2 μ •ν™•ν•œ 사양
  • WebText μŠ€νƒ€μΌ ν•™μŠ΅
  • λ‹€μ–‘ν•œ 생성 μ „λž΅

Level 4: Code Analysis (별도 λ¬Έμ„œ)

  • HuggingFace GPT2 뢄석
  • nanoGPT μ½”λ“œ 뢄석

ν•™μŠ΅ 체크리슀트

  • [ ] Causal mask κ΅¬ν˜„
  • [ ] Pre-LN ꡬ쑰 이해
  • [ ] Weight tying 이해
  • [ ] λ‹€μ–‘ν•œ 생성 μ „λž΅ κ΅¬ν˜„
  • [ ] KV Cache μ΅œμ ν™”
  • [ ] GPT vs BERT 차이점

참고 자료

to navigate between lessons