10. Attentionκ³Ό Transformer

10. Attentionκ³Ό Transformer

ν•™μŠ΅ λͺ©ν‘œ

  • Attention λ©”μ»€λ‹ˆμ¦˜μ˜ 원리
  • Self-Attention 이해
  • Transformer μ•„ν‚€ν…μ²˜
  • PyTorch κ΅¬ν˜„

1. Attention의 ν•„μš”μ„±

Seq2Seq의 ν•œκ³„

인코더: "λ‚˜λŠ” 학ꡐ에 κ°„λ‹€" β†’ κ³ μ • 크기 벑터
                              ↓
디코더: κ³ μ • 벑터 β†’ "I go to school"

문제: κΈ΄ λ¬Έμž₯의 정보가 μ••μΆ•λ˜μ–΄ 손싀

Attention의 ν•΄κ²°

디코더가 각 좜λ ₯ 단어λ₯Ό 생성할 λ•Œ
μΈμ½”λ”μ˜ λͺ¨λ“  단어λ₯Ό "μ£Όλͺ©"ν•  수 있음

"I" 생성 μ‹œ β†’ "λ‚˜λŠ”"에 높은 attention
"school" 생성 μ‹œ β†’ "학ꡐ"에 높은 attention

2. Attention λ©”μ»€λ‹ˆμ¦˜

μˆ˜μ‹

# Query, Key, Value
Q = ν˜„μž¬ 디코더 μƒνƒœ
K = 인코더 λͺ¨λ“  μƒνƒœ
V = 인코더 λͺ¨λ“  μƒνƒœ (보톡 K와 동일)

# Attention Score
score = Q @ K.T  # (query_len, key_len)

# Attention Weight (softmax)
weight = softmax(score / sqrt(d_k))  # μŠ€μΌ€μΌλ§

# Context
context = weight @ V  # 가쀑 ν•©

Scaled Dot-Product Attention

def attention(Q, K, V, mask=None):
    d_k = K.size(-1)
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights

3. Self-Attention

κ°œλ…

같은 μ‹œν€€μŠ€ λ‚΄μ—μ„œ 각 단어가 λ‹€λ₯Έ λͺ¨λ“  단어에 attention

"The cat sat on the mat because it was tired"
"it"이 "cat"에 높은 attention β†’ λŒ€λͺ…사 해석

μˆ˜μ‹

# μž…λ ₯ Xμ—μ„œ Q, K, V 생성
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Self-Attention
output = attention(Q, K, V)

4. Multi-Head Attention

아이디어

μ—¬λŸ¬ 개의 attention headκ°€ μ„œλ‘œ λ‹€λ₯Έ 관계 ν•™μŠ΅

Head 1: 문법적 관계
Head 2: 의미적 관계
Head 3: μœ„μΉ˜ 관계
...

μˆ˜μ‹

def multi_head_attention(Q, K, V, num_heads):
    d_model = Q.size(-1)
    d_k = d_model // num_heads

    # ν—€λ“œ λΆ„ν• 
    Q = Q.view(batch, seq, num_heads, d_k).transpose(1, 2)
    K = K.view(batch, seq, num_heads, d_k).transpose(1, 2)
    V = V.view(batch, seq, num_heads, d_k).transpose(1, 2)

    # 각 ν—€λ“œμ—μ„œ attention
    attn_output, _ = attention(Q, K, V)

    # ν—€λ“œ κ²°ν•©
    output = attn_output.transpose(1, 2).contiguous().view(batch, seq, d_model)
    return output

5. Transformer μ•„ν‚€ν…μ²˜

ꡬ쑰

μž…λ ₯ β†’ Embedding β†’ Positional Encoding
                      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Multi-Head Self-Attention          β”‚
β”‚           ↓                         β”‚
β”‚  Add & LayerNorm                    β”‚
β”‚           ↓                         β”‚
β”‚  Feed Forward Network               β”‚
β”‚           ↓                         β”‚
β”‚  Add & LayerNorm                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            Γ— N layers
                ↓
             좜λ ₯

핡심 μ»΄ν¬λ„ŒνŠΈ

  1. Multi-Head Attention
  2. Position-wise Feed Forward
  3. Residual Connection
  4. Layer Normalization
  5. Positional Encoding

6. Positional Encoding

ν•„μš”μ„±

Attention은 μˆœμ„œ 정보가 μ—†μŒ
β†’ μœ„μΉ˜ 정보λ₯Ό λͺ…μ‹œμ μœΌλ‘œ μΆ”κ°€

Sinusoidal Encoding

def positional_encoding(seq_len, d_model):
    PE = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000) / d_model))

    PE[:, 0::2] = torch.sin(position * div_term)
    PE[:, 1::2] = torch.cos(position * div_term)
    return PE

7. PyTorch Transformer

κΈ°λ³Έ μ‚¬μš©

import torch.nn as nn

# Transformer 인코더
encoder_layer = nn.TransformerEncoderLayer(
    d_model=512,
    nhead=8,
    dim_feedforward=2048,
    dropout=0.1
)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

# μˆœμ „νŒŒ
x = torch.randn(10, 32, 512)  # (seq, batch, d_model)
output = encoder(x)

λΆ„λ₯˜ λͺ¨λΈ

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_layers, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)

        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)

        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # x: (batch, seq)
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoder(x)
        x = x.transpose(0, 1)  # (seq, batch, d_model)
        x = self.transformer(x)
        x = x.mean(dim=0)  # 평균 풀링
        return self.fc(x)

8. Vision Transformer (ViT)

아이디어

이미지λ₯Ό 패치둜 λΆ„ν•  β†’ μ‹œν€€μŠ€λ‘œ 처리

이미지 (224Γ—224) β†’ 16Γ—16 패치 196개 β†’ Transformer

ꡬ쑰

class VisionTransformer(nn.Module):
    def __init__(self, img_size, patch_size, num_classes, d_model, nhead, num_layers):
        super().__init__()
        num_patches = (img_size // patch_size) ** 2
        patch_dim = 3 * patch_size ** 2

        self.patch_embed = nn.Linear(patch_dim, d_model)
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, d_model))

        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)

        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # 패치 μΆ”μΆœ 및 μž„λ² λ”©
        patches = extract_patches(x)
        x = self.patch_embed(patches)

        # CLS 토큰 μΆ”κ°€
        cls_tokens = self.cls_token.expand(x.size(0), -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)

        # μœ„μΉ˜ μž„λ² λ”©
        x = x + self.pos_embed

        # Transformer
        x = self.transformer(x.transpose(0, 1))

        # λΆ„λ₯˜ (CLS 토큰 μ‚¬μš©)
        return self.fc(x[0])

9. Attention vs RNN 비ꡐ

ν•­λͺ© RNN/LSTM Transformer
병렬화 어렀움 용이
μž₯거리 μ˜μ‘΄μ„± 어렀움 용이
ν•™μŠ΅ 속도 느림 빠름
λ©”λͺ¨λ¦¬ O(n) O(nΒ²)
μœ„μΉ˜ 정보 μ•”μ‹œμ  λͺ…μ‹œμ 

10. μ‹€μ „ ν™œμš©

NLP

  • BERT: μ–‘λ°©ν–₯ 인코더
  • GPT: 디코더 기반 생성
  • T5: 인코더-디코더

Vision

  • ViT: 이미지 λΆ„λ₯˜
  • DETR: 객체 κ²€μΆœ
  • Swin Transformer: 계측적 ꡬ쑰

정리

핡심 κ°œλ…

  1. Attention: Query-Key-Value둜 κ΄€λ ¨μ„± 계산
  2. Self-Attention: μ‹œν€€μŠ€ λ‚΄ λͺ¨λ“  μœ„μΉ˜ μ°Έμ‘°
  3. Multi-Head: λ‹€μ–‘ν•œ 관계 λ™μ‹œ ν•™μŠ΅
  4. Positional Encoding: μˆœμ„œ 정보 μΆ”κ°€

핡심 μ½”λ“œ

# Scaled Dot-Product Attention
scores = Q @ K.T / sqrt(d_k)
weights = softmax(scores)
output = weights @ V

# PyTorch Transformer
encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8),
    num_layers=6
)

λ‹€μŒ 단계

23_Training_Optimization.mdμ—μ„œ κ³ κΈ‰ ν•™μŠ΅ 기법을 λ°°μ›λ‹ˆλ‹€.

to navigate between lessons