05. GPT 이해

05. GPT 이해

ν•™μŠ΅ λͺ©ν‘œ

  • GPT μ•„ν‚€ν…μ²˜ 이해
  • μžκΈ°νšŒκ·€ μ–Έμ–΄ λͺ¨λΈλ§
  • ν…μŠ€νŠΈ 생성 기법
  • GPT μ‹œλ¦¬μ¦ˆ λ°œμ „

1. GPT κ°œμš”

Generative Pre-trained Transformer

GPT = Transformer 디코더 μŠ€νƒ

νŠΉμ§•:
- 단방ν–₯ (μ™Όμͺ½β†’μ˜€λ₯Έμͺ½)
- μžκΈ°νšŒκ·€ 생성
- λ‹€μŒ 토큰 예츑으둜 ν•™μŠ΅

BERT vs GPT

ν•­λͺ© BERT GPT
ꡬ쑰 인코더 디코더
λ°©ν–₯ μ–‘λ°©ν–₯ 단방ν–₯
ν•™μŠ΅ MLM λ‹€μŒ 토큰 예츑
μš©λ„ 이해 (λΆ„λ₯˜, QA) 생성 (λŒ€ν™”, μž‘λ¬Έ)

2. μžκΈ°νšŒκ·€ μ–Έμ–΄ λͺ¨λΈλ§

ν•™μŠ΅ λͺ©ν‘œ

P(x) = P(x₁) Γ— P(xβ‚‚|x₁) Γ— P(x₃|x₁,xβ‚‚) Γ— ...

λ¬Έμž₯: "I love NLP"
P("I") Γ— P("love"|"I") Γ— P("NLP"|"I love") Γ— P("<eos>"|"I love NLP")

손싀: -log P(λ‹€μŒ 토큰 | 이전 토큰듀)

Causal Language Modeling

import torch
import torch.nn as nn
import torch.nn.functional as F

def causal_lm_loss(logits, targets):
    """
    logits: (batch, seq, vocab_size)
    targets: (batch, seq) - λ‹€μŒ 토큰

    μž…λ ₯: [BOS, I, love, NLP]
    νƒ€κ²Ÿ: [I, love, NLP, EOS]
    """
    batch_size, seq_len, vocab_size = logits.shape

    # (batch*seq, vocab) vs (batch*seq,)
    loss = F.cross_entropy(
        logits.view(-1, vocab_size),
        targets.view(-1),
        ignore_index=-100  # νŒ¨λ”© λ¬΄μ‹œ
    )
    return loss

3. GPT μ•„ν‚€ν…μ²˜

ꡬ쑰

μž…λ ₯ 토큰
    ↓
Token Embedding + Position Embedding
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Masked Multi-Head Attention    β”‚
β”‚           ↓                     β”‚
β”‚  Add & LayerNorm                β”‚
β”‚           ↓                     β”‚
β”‚  Feed Forward                   β”‚
β”‚           ↓                     β”‚
β”‚  Add & LayerNorm                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            Γ— N layers
    ↓
LayerNorm
    ↓
Linear (vocab_size)
    ↓
Softmax β†’ λ‹€μŒ 토큰 ν™•λ₯ 

κ΅¬ν˜„

class GPTBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            d_model, num_heads, dropout=dropout, batch_first=True
        )
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, attn_mask=None):
        # Pre-LayerNorm (GPT-2 μŠ€νƒ€μΌ)
        ln_x = self.ln1(x)
        attn_out, _ = self.attn(ln_x, ln_x, ln_x, attn_mask=attn_mask)
        x = x + self.dropout(attn_out)

        ln_x = self.ln2(x)
        x = x + self.ffn(ln_x)

        return x


class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=768, num_heads=12,
                 num_layers=12, d_ff=3072, max_len=1024, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.max_len = max_len

        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.drop = nn.Dropout(dropout)

        self.blocks = nn.ModuleList([
            GPTBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying
        self.head.weight = self.token_emb.weight

        # Causal mask 등둝
        mask = torch.triu(torch.ones(max_len, max_len), diagonal=1).bool()
        self.register_buffer('causal_mask', mask)

    def forward(self, input_ids):
        batch_size, seq_len = input_ids.shape
        assert seq_len <= self.max_len

        # μž„λ² λ”©
        positions = torch.arange(seq_len, device=input_ids.device)
        x = self.token_emb(input_ids) + self.pos_emb(positions)
        x = self.drop(x)

        # Causal mask
        mask = self.causal_mask[:seq_len, :seq_len]

        # Transformer 블둝
        for block in self.blocks:
            x = block(x, attn_mask=mask)

        x = self.ln_f(x)
        logits = self.head(x)  # (batch, seq, vocab)

        return logits

4. ν…μŠ€νŠΈ 생성

Greedy Decoding

def generate_greedy(model, input_ids, max_new_tokens):
    """항상 κ°€μž₯ ν™•λ₯  높은 토큰 선택"""
    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(input_ids)
            next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
            input_ids = torch.cat([input_ids, next_token], dim=1)
    return input_ids

Temperature Sampling

def generate_with_temperature(model, input_ids, max_new_tokens, temperature=1.0):
    """Temperature둜 뢄포 쑰절"""
    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(input_ids)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            input_ids = torch.cat([input_ids, next_token], dim=1)
    return input_ids

# temperature < 1: 더 결정적 (높은 ν™•λ₯  토큰 μ„ ν˜Έ)
# temperature > 1: 더 λ¬΄μž‘μœ„ (λ‹€μ–‘μ„± 증가)

Top-k Sampling

def generate_top_k(model, input_ids, max_new_tokens, k=50, temperature=1.0):
    """μƒμœ„ k개 ν† ν°μ—μ„œλ§Œ μƒ˜ν”Œλ§"""
    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(input_ids)[:, -1, :] / temperature

            # Top-k 필터링
            top_k_logits, top_k_indices = logits.topk(k, dim=-1)
            probs = F.softmax(top_k_logits, dim=-1)

            # μƒ˜ν”Œλ§
            idx = torch.multinomial(probs, num_samples=1)
            next_token = top_k_indices.gather(-1, idx)

            input_ids = torch.cat([input_ids, next_token], dim=1)
    return input_ids

Top-p (Nucleus) Sampling

def generate_top_p(model, input_ids, max_new_tokens, p=0.9, temperature=1.0):
    """λˆ„μ  ν™•λ₯  pκΉŒμ§€μ˜ ν† ν°μ—μ„œ μƒ˜ν”Œλ§"""
    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits = model(input_ids)[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)

            # ν™•λ₯  λ‚΄λ¦Όμ°¨μˆœ μ •λ ¬
            sorted_probs, sorted_indices = probs.sort(descending=True)
            cumsum = sorted_probs.cumsum(dim=-1)

            # p 이후 토큰 λ§ˆμŠ€ν‚Ή
            mask = cumsum - sorted_probs > p
            sorted_probs[mask] = 0
            sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)

            # μƒ˜ν”Œλ§
            idx = torch.multinomial(sorted_probs, num_samples=1)
            next_token = sorted_indices.gather(-1, idx)

            input_ids = torch.cat([input_ids, next_token], dim=1)
    return input_ids

5. GPT μ‹œλ¦¬μ¦ˆ

GPT-1 (2018)

- 12 λ ˆμ΄μ–΄, 768 차원, 117M νŒŒλΌλ―Έν„°
- BooksCorpus둜 ν•™μŠ΅
- νŒŒμΈνŠœλ‹ νŒ¨λŸ¬λ‹€μž„ λ„μž…

GPT-2 (2019)

- μ΅œλŒ€ 48 λ ˆμ΄μ–΄, 1.5B νŒŒλΌλ―Έν„°
- WebText (40GB) ν•™μŠ΅
- Zero-shot λŠ₯λ ₯ 발견
- "Too dangerous to release"

크기 λ³€ν˜•:
- Small: 117M (GPT-1κ³Ό 동일)
- Medium: 345M
- Large: 762M
- XL: 1.5B

GPT-3 (2020)

- 96 λ ˆμ΄μ–΄, 175B νŒŒλΌλ―Έν„°
- Few-shot / In-context Learning
- API둜만 제곡

μ£Όμš” 발견:
- ν”„λ‘¬ν”„νŠΈλ§ŒμœΌλ‘œ λ‹€μ–‘ν•œ νƒœμŠ€ν¬ μˆ˜ν–‰
- μŠ€μΌ€μΌλ§ 법칙: λͺ¨λΈ 크기 ↑ = μ„±λŠ₯ ↑

GPT-4 (2023)

- λ©€ν‹°λͺ¨λ‹¬ (ν…μŠ€νŠΈ + 이미지)
- 더 κΈ΄ μ»¨ν…μŠ€νŠΈ (8K, 32K, 128K)
- ν–₯μƒλœ μΆ”λ‘  λŠ₯λ ₯
- RLHF둜 μ •λ ¬

6. HuggingFace GPT-2

κΈ°λ³Έ μ‚¬μš©

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# ν…μŠ€νŠΈ 생성
input_text = "The quick brown fox"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# 생성
output = model.generate(
    input_ids,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

생성 νŒŒλΌλ―Έν„°

output = model.generate(
    input_ids,
    max_length=100,           # μ΅œλŒ€ 길이
    min_length=10,            # μ΅œμ†Œ 길이
    do_sample=True,           # μƒ˜ν”Œλ§ μ‚¬μš©
    temperature=0.8,          # μ˜¨λ„
    top_k=50,                 # Top-k
    top_p=0.95,               # Top-p
    num_return_sequences=3,   # 생성 개수
    no_repeat_ngram_size=2,   # n-gram 반볡 λ°©μ§€
    repetition_penalty=1.2,   # 반볡 νŽ˜λ„ν‹°
    pad_token_id=tokenizer.eos_token_id
)

쑰건뢀 생성

# ν”„λ‘¬ν”„νŠΈ 기반 생성
prompt = """
Q: What is the capital of France?
A:"""

input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(
    input_ids,
    max_new_tokens=20,
    do_sample=False  # Greedy
)
print(tokenizer.decode(output[0]))

7. In-Context Learning

Zero-shot

ν”„λ‘¬ν”„νŠΈλ§ŒμœΌλ‘œ νƒœμŠ€ν¬ μˆ˜ν–‰:

"Translate English to French:
Hello, how are you? β†’"

Few-shot

예제λ₯Ό ν”„λ‘¬ν”„νŠΈμ— 포함:

"Translate English to French:
Hello β†’ Bonjour
Thank you β†’ Merci
Good morning β†’ Bonjour
How are you? β†’"

Chain-of-Thought (CoT)

단계별 μΆ”λ‘  μœ λ„:

"Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many balls does he have now?
A: Let's think step by step.
Roger started with 5 balls.
2 cans of 3 balls each = 6 balls.
5 + 6 = 11 balls.
The answer is 11."

8. KV Cache

효율적인 생성

class GPTWithKVCache(nn.Module):
    def forward(self, input_ids, past_key_values=None):
        """
        past_key_values: 이전 ν† ν°μ˜ K, V μΊμ‹œ
        μƒˆ 토큰에 λŒ€ν•΄μ„œλ§Œ 계산 ν›„ μΊμ‹œ μ—…λ°μ΄νŠΈ
        """
        if past_key_values is None:
            # 전체 μ‹œν€€μŠ€ 계산
            ...
        else:
            # λ§ˆμ§€λ§‰ ν† ν°λ§Œ 계산
            ...

        return logits, new_past_key_values

# 생성 μ‹œ
past = None
for _ in range(max_new_tokens):
    logits, past = model(new_token, past_key_values=past)
    # O(n) λŒ€μ‹  O(1) λ³΅μž‘λ„

HuggingFace KV Cache

output = model.generate(
    input_ids,
    max_new_tokens=50,
    use_cache=True  # KV Cache ν™œμ„±ν™” (κΈ°λ³Έκ°’)
)

정리

생성 μ „λž΅ 비ꡐ

방법 μž₯점 단점 μš©λ„
Greedy 빠름, 일관성 반볡, 지루함 λ²ˆμ—­, QA
Temperature λ‹€μ–‘μ„± 쑰절 νŠœλ‹ ν•„μš” 일반 생성
Top-k μ•ˆμ •μ  κ³ μ • k 일반 생성
Top-p 적응적 μ•½κ°„ 느림 μ°½μž‘, λŒ€ν™”

핡심 μ½”λ“œ

# HuggingFace GPT-2
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# 생성
output = model.generate(input_ids, max_length=50, do_sample=True,
                        temperature=0.7, top_p=0.9)

λ‹€μŒ 단계

06_HuggingFace_Basics.mdμ—μ„œ HuggingFace Transformers 라이브러리λ₯Ό ν•™μŠ΅ν•©λ‹ˆλ‹€.

to navigate between lessons