10. Attentionκ³Ό Transformer
10. Attentionκ³Ό Transformer¶
νμ΅ λͺ©ν¶
- Attention λ©μ»€λμ¦μ μ리
- Self-Attention μ΄ν΄
- Transformer μν€ν μ²
- PyTorch ꡬν
1. Attentionμ νμμ±¶
Seq2Seqμ νκ³¶
μΈμ½λ: "λλ νκ΅μ κ°λ€" β κ³ μ ν¬κΈ° 벑ν°
β
λμ½λ: κ³ μ λ²‘ν° β "I go to school"
λ¬Έμ : κΈ΄ λ¬Έμ₯μ μ λ³΄κ° μμΆλμ΄ μμ€
Attentionμ ν΄κ²°¶
λμ½λκ° κ° μΆλ ₯ λ¨μ΄λ₯Ό μμ±ν λ
μΈμ½λμ λͺ¨λ λ¨μ΄λ₯Ό "μ£Όλͺ©"ν μ μμ
"I" μμ± μ β "λλ"μ λμ attention
"school" μμ± μ β "νκ΅"μ λμ attention
2. Attention λ©μ»€λ즶
μμ¶
# Query, Key, Value
Q = νμ¬ λμ½λ μν
K = μΈμ½λ λͺ¨λ μν
V = μΈμ½λ λͺ¨λ μν (λ³΄ν΅ Kμ λμΌ)
# Attention Score
score = Q @ K.T # (query_len, key_len)
# Attention Weight (softmax)
weight = softmax(score / sqrt(d_k)) # μ€μΌμΌλ§
# Context
context = weight @ V # κ°μ€ ν©
Scaled Dot-Product Attention¶
def attention(Q, K, V, mask=None):
d_k = K.size(-1)
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
weights = F.softmax(scores, dim=-1)
return weights @ V, weights
3. Self-Attention¶
κ°λ ¶
κ°μ μνμ€ λ΄μμ κ° λ¨μ΄κ° λ€λ₯Έ λͺ¨λ λ¨μ΄μ attention
"The cat sat on the mat because it was tired"
"it"μ΄ "cat"μ λμ attention β λλͺ
μ¬ ν΄μ
μμ¶
# μ
λ ₯ Xμμ Q, K, V μμ±
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Self-Attention
output = attention(Q, K, V)
4. Multi-Head Attention¶
μμ΄λμ΄¶
μ¬λ¬ κ°μ attention headκ° μλ‘ λ€λ₯Έ κ΄κ³ νμ΅
Head 1: λ¬Έλ²μ κ΄κ³
Head 2: μλ―Έμ κ΄κ³
Head 3: μμΉ κ΄κ³
...
μμ¶
def multi_head_attention(Q, K, V, num_heads):
d_model = Q.size(-1)
d_k = d_model // num_heads
# ν€λ λΆν
Q = Q.view(batch, seq, num_heads, d_k).transpose(1, 2)
K = K.view(batch, seq, num_heads, d_k).transpose(1, 2)
V = V.view(batch, seq, num_heads, d_k).transpose(1, 2)
# κ° ν€λμμ attention
attn_output, _ = attention(Q, K, V)
# ν€λ κ²°ν©
output = attn_output.transpose(1, 2).contiguous().view(batch, seq, d_model)
return output
5. Transformer μν€ν μ²¶
ꡬ쑰¶
μ
λ ₯ β Embedding β Positional Encoding
β
βββββββββββββββββββββββββββββββββββββββ
β Multi-Head Self-Attention β
β β β
β Add & LayerNorm β
β β β
β Feed Forward Network β
β β β
β Add & LayerNorm β
βββββββββββββββββββββββββββββββββββββββ
Γ N layers
β
μΆλ ₯
ν΅μ¬ μ»΄ν¬λνΈ¶
- Multi-Head Attention
- Position-wise Feed Forward
- Residual Connection
- Layer Normalization
- Positional Encoding
6. Positional Encoding¶
νμμ±¶
Attentionμ μμ μ λ³΄κ° μμ
β μμΉ μ 보λ₯Ό λͺ
μμ μΌλ‘ μΆκ°
Sinusoidal Encoding¶
def positional_encoding(seq_len, d_model):
PE = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000) / d_model))
PE[:, 0::2] = torch.sin(position * div_term)
PE[:, 1::2] = torch.cos(position * div_term)
return PE
7. PyTorch Transformer¶
κΈ°λ³Έ μ¬μ©¶
import torch.nn as nn
# Transformer μΈμ½λ
encoder_layer = nn.TransformerEncoderLayer(
d_model=512,
nhead=8,
dim_feedforward=2048,
dropout=0.1
)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
# μμ ν
x = torch.randn(10, 32, 512) # (seq, batch, d_model)
output = encoder(x)
λΆλ₯ λͺ¨λΈ¶
class TransformerClassifier(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoder = PositionalEncoding(d_model)
encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
self.fc = nn.Linear(d_model, num_classes)
def forward(self, x):
# x: (batch, seq)
x = self.embedding(x) * math.sqrt(self.d_model)
x = self.pos_encoder(x)
x = x.transpose(0, 1) # (seq, batch, d_model)
x = self.transformer(x)
x = x.mean(dim=0) # νκ· νλ§
return self.fc(x)
8. Vision Transformer (ViT)¶
μμ΄λμ΄¶
μ΄λ―Έμ§λ₯Ό ν¨μΉλ‘ λΆν β μνμ€λ‘ μ²λ¦¬
μ΄λ―Έμ§ (224Γ224) β 16Γ16 ν¨μΉ 196κ° β Transformer
ꡬ쑰¶
class VisionTransformer(nn.Module):
def __init__(self, img_size, patch_size, num_classes, d_model, nhead, num_layers):
super().__init__()
num_patches = (img_size // patch_size) ** 2
patch_dim = 3 * patch_size ** 2
self.patch_embed = nn.Linear(patch_dim, d_model)
self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, d_model))
encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
self.fc = nn.Linear(d_model, num_classes)
def forward(self, x):
# ν¨μΉ μΆμΆ λ° μλ² λ©
patches = extract_patches(x)
x = self.patch_embed(patches)
# CLS ν ν° μΆκ°
cls_tokens = self.cls_token.expand(x.size(0), -1, -1)
x = torch.cat([cls_tokens, x], dim=1)
# μμΉ μλ² λ©
x = x + self.pos_embed
# Transformer
x = self.transformer(x.transpose(0, 1))
# λΆλ₯ (CLS ν ν° μ¬μ©)
return self.fc(x[0])
9. Attention vs RNN λΉκ΅¶
| νλͺ© | RNN/LSTM | Transformer |
|---|---|---|
| λ³λ ¬ν | μ΄λ €μ | μ©μ΄ |
| μ₯거리 μμ‘΄μ± | μ΄λ €μ | μ©μ΄ |
| νμ΅ μλ | λλ¦Ό | λΉ λ¦ |
| λ©λͺ¨λ¦¬ | O(n) | O(nΒ²) |
| μμΉ μ 보 | μμμ | λͺ μμ |
10. μ€μ νμ©¶
NLP¶
- BERT: μλ°©ν₯ μΈμ½λ
- GPT: λμ½λ κΈ°λ° μμ±
- T5: μΈμ½λ-λμ½λ
Vision¶
- ViT: μ΄λ―Έμ§ λΆλ₯
- DETR: κ°μ²΄ κ²μΆ
- Swin Transformer: κ³μΈ΅μ ꡬ쑰
μ 리¶
ν΅μ¬ κ°λ ¶
- Attention: Query-Key-Valueλ‘ κ΄λ ¨μ± κ³μ°
- Self-Attention: μνμ€ λ΄ λͺ¨λ μμΉ μ°Έμ‘°
- Multi-Head: λ€μν κ΄κ³ λμ νμ΅
- Positional Encoding: μμ μ 보 μΆκ°
ν΅μ¬ μ½λ¶
# Scaled Dot-Product Attention
scores = Q @ K.T / sqrt(d_k)
weights = softmax(scores)
output = weights @ V
# PyTorch Transformer
encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=512, nhead=8),
num_layers=6
)
λ€μ λ¨κ³¶
23_Training_Optimization.mdμμ κ³ κΈ νμ΅ κΈ°λ²μ λ°°μλλ€.