05. GPT μ΄ν΄
νμ΅ λͺ©ν
- GPT μν€ν
μ² μ΄ν΄
- μκΈ°νκ· μΈμ΄ λͺ¨λΈλ§
- ν
μ€νΈ μμ± κΈ°λ²
- GPT μλ¦¬μ¦ λ°μ
1. GPT κ°μ
GPT = Transformer λμ½λ μ€ν
νΉμ§:
- λ¨λ°©ν₯ (μΌμͺ½βμ€λ₯Έμͺ½)
- μκΈ°νκ· μμ±
- λ€μ ν ν° μμΈ‘μΌλ‘ νμ΅
BERT vs GPT
| νλͺ© |
BERT |
GPT |
| ꡬ쑰 |
μΈμ½λ |
λμ½λ |
| λ°©ν₯ |
μλ°©ν₯ |
λ¨λ°©ν₯ |
| νμ΅ |
MLM |
λ€μ ν ν° μμΈ‘ |
| μ©λ |
μ΄ν΄ (λΆλ₯, QA) |
μμ± (λν, μλ¬Έ) |
2. μκΈ°νκ· μΈμ΄ λͺ¨λΈλ§
νμ΅ λͺ©ν
P(x) = P(xβ) Γ P(xβ|xβ) Γ P(xβ|xβ,xβ) Γ ...
λ¬Έμ₯: "I love NLP"
P("I") Γ P("love"|"I") Γ P("NLP"|"I love") Γ P("<eos>"|"I love NLP")
μμ€: -log P(λ€μ ν ν° | μ΄μ ν ν°λ€)
Causal Language Modeling
import torch
import torch.nn as nn
import torch.nn.functional as F
def causal_lm_loss(logits, targets):
"""
logits: (batch, seq, vocab_size)
targets: (batch, seq) - λ€μ ν ν°
μ
λ ₯: [BOS, I, love, NLP]
νκ²: [I, love, NLP, EOS]
"""
batch_size, seq_len, vocab_size = logits.shape
# (batch*seq, vocab) vs (batch*seq,)
loss = F.cross_entropy(
logits.view(-1, vocab_size),
targets.view(-1),
ignore_index=-100 # ν¨λ© 무μ
)
return loss
3. GPT μν€ν
μ²
ꡬ쑰
μ
λ ₯ ν ν°
β
Token Embedding + Position Embedding
β
βββββββββββββββββββββββββββββββββββ
β Masked Multi-Head Attention β
β β β
β Add & LayerNorm β
β β β
β Feed Forward β
β β β
β Add & LayerNorm β
βββββββββββββββββββββββββββββββββββ
Γ N layers
β
LayerNorm
β
Linear (vocab_size)
β
Softmax β λ€μ ν ν° νλ₯
ꡬν
class GPTBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(
d_model, num_heads, dropout=dropout, batch_first=True
)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, attn_mask=None):
# Pre-LayerNorm (GPT-2 μ€νμΌ)
ln_x = self.ln1(x)
attn_out, _ = self.attn(ln_x, ln_x, ln_x, attn_mask=attn_mask)
x = x + self.dropout(attn_out)
ln_x = self.ln2(x)
x = x + self.ffn(ln_x)
return x
class GPT(nn.Module):
def __init__(self, vocab_size, d_model=768, num_heads=12,
num_layers=12, d_ff=3072, max_len=1024, dropout=0.1):
super().__init__()
self.d_model = d_model
self.max_len = max_len
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.drop = nn.Dropout(dropout)
self.blocks = nn.ModuleList([
GPTBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying
self.head.weight = self.token_emb.weight
# Causal mask λ±λ‘
mask = torch.triu(torch.ones(max_len, max_len), diagonal=1).bool()
self.register_buffer('causal_mask', mask)
def forward(self, input_ids):
batch_size, seq_len = input_ids.shape
assert seq_len <= self.max_len
# μλ² λ©
positions = torch.arange(seq_len, device=input_ids.device)
x = self.token_emb(input_ids) + self.pos_emb(positions)
x = self.drop(x)
# Causal mask
mask = self.causal_mask[:seq_len, :seq_len]
# Transformer λΈλ‘
for block in self.blocks:
x = block(x, attn_mask=mask)
x = self.ln_f(x)
logits = self.head(x) # (batch, seq, vocab)
return logits
4. ν
μ€νΈ μμ±
Greedy Decoding
def generate_greedy(model, input_ids, max_new_tokens):
"""νμ κ°μ₯ νλ₯ λμ ν ν° μ ν"""
model.eval()
with torch.no_grad():
for _ in range(max_new_tokens):
logits = model(input_ids)
next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
input_ids = torch.cat([input_ids, next_token], dim=1)
return input_ids
Temperature Sampling
def generate_with_temperature(model, input_ids, max_new_tokens, temperature=1.0):
"""Temperatureλ‘ λΆν¬ μ‘°μ """
model.eval()
with torch.no_grad():
for _ in range(max_new_tokens):
logits = model(input_ids)
logits = logits[:, -1, :] / temperature
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=1)
return input_ids
# temperature < 1: λ κ²°μ μ (λμ νλ₯ ν ν° μ νΈ)
# temperature > 1: λ 무μμ (λ€μμ± μ¦κ°)
Top-k Sampling
def generate_top_k(model, input_ids, max_new_tokens, k=50, temperature=1.0):
"""μμ kκ° ν ν°μμλ§ μνλ§"""
model.eval()
with torch.no_grad():
for _ in range(max_new_tokens):
logits = model(input_ids)[:, -1, :] / temperature
# Top-k νν°λ§
top_k_logits, top_k_indices = logits.topk(k, dim=-1)
probs = F.softmax(top_k_logits, dim=-1)
# μνλ§
idx = torch.multinomial(probs, num_samples=1)
next_token = top_k_indices.gather(-1, idx)
input_ids = torch.cat([input_ids, next_token], dim=1)
return input_ids
Top-p (Nucleus) Sampling
def generate_top_p(model, input_ids, max_new_tokens, p=0.9, temperature=1.0):
"""λμ νλ₯ pκΉμ§μ ν ν°μμ μνλ§"""
model.eval()
with torch.no_grad():
for _ in range(max_new_tokens):
logits = model(input_ids)[:, -1, :] / temperature
probs = F.softmax(logits, dim=-1)
# νλ₯ λ΄λ¦Όμ°¨μ μ λ ¬
sorted_probs, sorted_indices = probs.sort(descending=True)
cumsum = sorted_probs.cumsum(dim=-1)
# p μ΄ν ν ν° λ§μ€νΉ
mask = cumsum - sorted_probs > p
sorted_probs[mask] = 0
sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
# μνλ§
idx = torch.multinomial(sorted_probs, num_samples=1)
next_token = sorted_indices.gather(-1, idx)
input_ids = torch.cat([input_ids, next_token], dim=1)
return input_ids
5. GPT μ리μ¦
GPT-1 (2018)
- 12 λ μ΄μ΄, 768 μ°¨μ, 117M νλΌλ―Έν°
- BooksCorpusλ‘ νμ΅
- νμΈνλ ν¨λ¬λ€μ λμ
GPT-2 (2019)
- μ΅λ 48 λ μ΄μ΄, 1.5B νλΌλ―Έν°
- WebText (40GB) νμ΅
- Zero-shot λ₯λ ₯ λ°κ²¬
- "Too dangerous to release"
ν¬κΈ° λ³ν:
- Small: 117M (GPT-1κ³Ό λμΌ)
- Medium: 345M
- Large: 762M
- XL: 1.5B
GPT-3 (2020)
- 96 λ μ΄μ΄, 175B νλΌλ―Έν°
- Few-shot / In-context Learning
- APIλ‘λ§ μ 곡
μ£Όμ λ°κ²¬:
- ν둬ννΈλ§μΌλ‘ λ€μν νμ€ν¬ μν
- μ€μΌμΌλ§ λ²μΉ: λͺ¨λΈ ν¬κΈ° β = μ±λ₯ β
GPT-4 (2023)
- λ©ν°λͺ¨λ¬ (ν
μ€νΈ + μ΄λ―Έμ§)
- λ κΈ΄ 컨ν
μ€νΈ (8K, 32K, 128K)
- ν₯μλ μΆλ‘ λ₯λ ₯
- RLHFλ‘ μ λ ¬
6. HuggingFace GPT-2
κΈ°λ³Έ μ¬μ©
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# ν
μ€νΈ μμ±
input_text = "The quick brown fox"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
# μμ±
output = model.generate(
input_ids,
max_length=50,
num_return_sequences=1,
temperature=0.7,
top_p=0.9,
do_sample=True
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
μμ± νλΌλ―Έν°
output = model.generate(
input_ids,
max_length=100, # μ΅λ κΈΈμ΄
min_length=10, # μ΅μ κΈΈμ΄
do_sample=True, # μνλ§ μ¬μ©
temperature=0.8, # μ¨λ
top_k=50, # Top-k
top_p=0.95, # Top-p
num_return_sequences=3, # μμ± κ°μ
no_repeat_ngram_size=2, # n-gram λ°λ³΅ λ°©μ§
repetition_penalty=1.2, # λ°λ³΅ νλν°
pad_token_id=tokenizer.eos_token_id
)
μ‘°κ±΄λΆ μμ±
# ν둬ννΈ κΈ°λ° μμ±
prompt = """
Q: What is the capital of France?
A:"""
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(
input_ids,
max_new_tokens=20,
do_sample=False # Greedy
)
print(tokenizer.decode(output[0]))
7. In-Context Learning
Zero-shot
ν둬ννΈλ§μΌλ‘ νμ€ν¬ μν:
"Translate English to French:
Hello, how are you? β"
Few-shot
μμ λ₯Ό ν둬ννΈμ ν¬ν¨:
"Translate English to French:
Hello β Bonjour
Thank you β Merci
Good morning β Bonjour
How are you? β"
Chain-of-Thought (CoT)
λ¨κ³λ³ μΆλ‘ μ λ:
"Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many balls does he have now?
A: Let's think step by step.
Roger started with 5 balls.
2 cans of 3 balls each = 6 balls.
5 + 6 = 11 balls.
The answer is 11."
8. KV Cache
ν¨μ¨μ μΈ μμ±
class GPTWithKVCache(nn.Module):
def forward(self, input_ids, past_key_values=None):
"""
past_key_values: μ΄μ ν ν°μ K, V μΊμ
μ ν ν°μ λν΄μλ§ κ³μ° ν μΊμ μ
λ°μ΄νΈ
"""
if past_key_values is None:
# μ 체 μνμ€ κ³μ°
...
else:
# λ§μ§λ§ ν ν°λ§ κ³μ°
...
return logits, new_past_key_values
# μμ± μ
past = None
for _ in range(max_new_tokens):
logits, past = model(new_token, past_key_values=past)
# O(n) λμ O(1) 볡μ‘λ
HuggingFace KV Cache
output = model.generate(
input_ids,
max_new_tokens=50,
use_cache=True # KV Cache νμ±ν (κΈ°λ³Έκ°)
)
μ 리
μμ± μ λ΅ λΉκ΅
| λ°©λ² |
μ₯μ |
λ¨μ |
μ©λ |
| Greedy |
λΉ λ¦, μΌκ΄μ± |
λ°λ³΅, μ§λ£¨ν¨ |
λ²μ, QA |
| Temperature |
λ€μμ± μ‘°μ |
νλ νμ |
μΌλ° μμ± |
| Top-k |
μμ μ |
κ³ μ k |
μΌλ° μμ± |
| Top-p |
μ μμ |
μ½κ° λλ¦Ό |
μ°½μ, λν |
ν΅μ¬ μ½λ
# HuggingFace GPT-2
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# μμ±
output = model.generate(input_ids, max_length=50, do_sample=True,
temperature=0.7, top_p=0.9)
λ€μ λ¨κ³
06_HuggingFace_Basics.mdμμ HuggingFace Transformers λΌμ΄λΈλ¬λ¦¬λ₯Ό νμ΅ν©λλ€.