07. Transformer

07. Transformer

๊ฐœ์š”

Transformer๋Š” "Attention Is All You Need" (Vaswani et al., 2017) ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์•„ํ‚คํ…์ฒ˜๋กœ, ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. RNN ์—†์ด Self-Attention๋งŒ์œผ๋กœ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ

  1. Self-Attention: Query, Key, Value ์—ฐ์‚ฐ ์ดํ•ด
  2. Multi-Head Attention: ์—ฌ๋Ÿฌ attention head์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
  3. Positional Encoding: ์œ„์น˜ ์ •๋ณด ์ฃผ์ž…
  4. Encoder-Decoder: ์ „์ฒด ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์กฐ

์ˆ˜ํ•™์  ๋ฐฐ๊ฒฝ

1. Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / โˆšd_k) V

์—ฌ๊ธฐ์„œ:
- Q (Query): ๋ฌด์—‡์„ ์ฐพ์„์ง€
- K (Key): ๋งค์นญํ•  ๋Œ€์ƒ
- V (Value): ์‹ค์ œ ๊ฐ€์ ธ์˜ฌ ๊ฐ’
- d_k: Key์˜ ์ฐจ์› (scaling factor)

์ˆ˜์‹ ๋ถ„ํ•ด:
1. QK^T: Query์™€ Key์˜ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ โ†’ (seq_len, seq_len)
2. / โˆšd_k: ํฐ ๊ฐ’ ๋ฐฉ์ง€ (softmax ์•ˆ์ •์„ฑ)
3. softmax: ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
4. ร— V: ๊ฐ€์ค‘ ํ‰๊ท 

2. Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O

where head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

ํŠน์ง•:
- ์—ฌ๋Ÿฌ "๊ด€์ "์—์„œ attention ํ•™์Šต
- ๊ฐ head๊ฐ€ ๋‹ค๋ฅธ ํŒจํ„ด ํฌ์ฐฉ
- ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

3. Positional Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

๋ชฉ์ :
- Transformer๋Š” ์ˆœ์„œ ์ •๋ณด๊ฐ€ ์—†์Œ
- ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ฃผ์ž…
- Sinusoidal: ํ•™์Šต ์—†์ด ์ƒ์„ฑ, ์™ธ์‚ฝ ๊ฐ€๋Šฅ

ํŒŒ์ผ ๊ตฌ์กฐ

07_Transformer/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ pytorch_lowlevel/
โ”‚   โ”œโ”€โ”€ attention_lowlevel.py      # Attention ๊ธฐ๋ณธ ๊ตฌํ˜„
โ”‚   โ”œโ”€โ”€ multihead_attention.py     # Multi-Head Attention
โ”‚   โ”œโ”€โ”€ positional_encoding.py     # ์œ„์น˜ ์ธ์ฝ”๋”ฉ
โ”‚   โ””โ”€โ”€ transformer_lowlevel.py    # ์ „์ฒด Transformer
โ”œโ”€โ”€ paper/
โ”‚   โ”œโ”€โ”€ transformer_paper.py       # ๋…ผ๋ฌธ ์žฌํ˜„
โ”‚   โ””โ”€โ”€ transformer_xl.py          # Transformer-XL ๋ณ€ํ˜•
โ””โ”€โ”€ exercises/
    โ”œโ”€โ”€ 01_flash_attention.md      # Flash Attention ๊ตฌํ˜„
    โ”œโ”€โ”€ 02_rotary_embeddings.md    # RoPE ๊ตฌํ˜„
    โ””โ”€โ”€ 03_kv_cache.md             # KV Cache ๊ตฌํ˜„

ํ•ต์‹ฌ ๊ฐœ๋…

1. Self-Attention vs Cross-Attention

Self-Attention:
- Q, K, V ๋ชจ๋‘ ๊ฐ™์€ ์‹œํ€€์Šค์—์„œ
- Encoder, Decoder ๋‚ด๋ถ€์—์„œ ์‚ฌ์šฉ

Cross-Attention:
- Q๋Š” Decoder์—์„œ, K, V๋Š” Encoder์—์„œ
- Encoder-Decoder ์—ฐ๊ฒฐ

2. Masking

# Padding mask: ํŒจ๋”ฉ ํ† ํฐ ๋ฌด์‹œ
padding_mask = (input_ids == pad_token_id)  # (batch, seq_len)

# Causal mask: ๋ฏธ๋ž˜ ํ† ํฐ ๋ชป ๋ณด๊ฒŒ
causal_mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# ์ƒ์‚ผ๊ฐ ํ–‰๋ ฌ์„ -inf๋กœ ์„ค์ •

3. Feed-Forward Network

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

๋˜๋Š” (GELU ์‚ฌ์šฉ):
FFN(x) = GELU(xW_1)W_2

ํŠน์ง•:
- Position-wise: ๊ฐ ์œ„์น˜ ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ
- Expansion: ๋ณดํ†ต 4๋ฐฐ ํ™•์žฅ (d_model โ†’ 4*d_model โ†’ d_model)

์—ฐ์Šต ๋ฌธ์ œ

๊ธฐ์ดˆ

  1. Scaled Dot-Product Attention ์ง์ ‘ ๊ตฌํ˜„
  2. Positional Encoding ์‹œ๊ฐํ™”
  3. Self-Attention ํŒจํ„ด ์‹œ๊ฐํ™”

์ค‘๊ธ‰

  1. Multi-Head Attention ๊ตฌํ˜„
  2. Encoder ๋ธ”๋ก ์™„์„ฑ
  3. Decoder ๋ธ”๋ก ์™„์„ฑ (causal mask ํฌํ•จ)

๊ณ ๊ธ‰

  1. KV Cache๋กœ autoregressive ์ƒ์„ฑ ์ตœ์ ํ™”
  2. Flash Attention ๊ตฌํ˜„ (๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ)
  3. Rotary Position Embedding (RoPE) ๊ตฌํ˜„

์ฐธ๊ณ  ์ž๋ฃŒ

to navigate between lessons