14. RLHF์™€ LLM ์ •๋ ฌ (Alignment)

14. RLHF์™€ LLM ์ •๋ ฌ (Alignment)

ํ•™์Šต ๋ชฉํ‘œ

  • RLHF(Reinforcement Learning from Human Feedback) ์ดํ•ด
  • Reward Model ํ•™์Šต
  • PPO๋ฅผ ํ†ตํ•œ ์ •์ฑ… ์ตœ์ ํ™”
  • DPO(Direct Preference Optimization)
  • Constitutional AI์™€ ์•ˆ์ „ํ•œ AI

1. LLM ์ •๋ ฌ ๊ฐœ์š”

์™œ ์ •๋ ฌ์ด ํ•„์š”ํ•œ๊ฐ€?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  LLM ์ •๋ ฌ์˜ ํ•„์š”์„ฑ                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚  ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (Base Model)                                   โ”‚
โ”‚      โ”‚                                                       โ”‚
โ”‚      โ”‚  ๋ฌธ์ œ์ :                                               โ”‚
โ”‚      โ”‚  - ๋‹จ์ˆœํžˆ ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก                                โ”‚
โ”‚      โ”‚  - ์œ ํ•ดํ•œ ์ฝ˜ํ…์ธ  ์ƒ์„ฑ ๊ฐ€๋Šฅ                              โ”‚
โ”‚      โ”‚  - ์ง€์‹œ์‚ฌํ•ญ ๋”ฐ๋ฅด๊ธฐ ์–ด๋ ค์›€                               โ”‚
โ”‚      โ–ผ                                                       โ”‚
โ”‚  ์ •๋ ฌ๋œ ๋ชจ๋ธ (Aligned Model)                                  โ”‚
โ”‚      โ”‚                                                       โ”‚
โ”‚      โ”‚  ๋ชฉํ‘œ:                                                 โ”‚
โ”‚      โ”‚  - ๋„์›€๋จ (Helpful)                                   โ”‚
โ”‚      โ”‚  - ๋ฌดํ•ดํ•จ (Harmless)                                  โ”‚
โ”‚      โ”‚  - ์ •์งํ•จ (Honest)                                    โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

์ •๋ ฌ ๋ฐฉ๋ฒ•๋ก  ๋ฐœ์ „

SFT (Supervised Fine-Tuning)
    โ”‚  ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋กœ ์ง€๋„ํ•™์Šต
    โ–ผ
RLHF (Reinforcement Learning from Human Feedback)
    โ”‚  ๋ณด์ƒ ๋ชจ๋ธ + ๊ฐ•ํ™”ํ•™์Šต
    โ–ผ
DPO (Direct Preference Optimization)
    โ”‚  ์ง์ ‘ ์„ ํ˜ธ๋„ ์ตœ์ ํ™”
    โ–ผ
Constitutional AI
    โ”‚  ์›์น™ ๊ธฐ๋ฐ˜ ์ž๊ธฐ ๊ฐœ์„ 

2. SFT (Supervised Fine-Tuning)

๊ธฐ๋ณธ ๊ฐœ๋…

# SFT ๋ฐ์ดํ„ฐ ํ˜•์‹
sft_data = [
    {
        "instruction": "Write a poem about spring.",
        "input": "",
        "output": "Flowers bloom in gentle rain,\nBirds return to sing again..."
    },
    {
        "instruction": "Translate to French.",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    }
]

SFT ๊ตฌํ˜„

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset

# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# ๋ฐ์ดํ„ฐ์…‹
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# ํฌ๋งทํŒ… ํ•จ์ˆ˜
def format_instruction(example):
    if example["context"]:
        return f"""### Instruction:
{example['instruction']}

### Context:
{example['context']}

### Response:
{example['response']}"""
    else:
        return f"""### Instruction:
{example['instruction']}

### Response:
{example['response']}"""

# ํ•™์Šต ์„ค์ •
training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
)

# SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=format_instruction,
    max_seq_length=1024,
    args=training_args,
)

trainer.train()

3. RLHF ํŒŒ์ดํ”„๋ผ์ธ

์ „์ฒด ํ”„๋กœ์„ธ์Šค

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     RLHF ํŒŒ์ดํ”„๋ผ์ธ                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  1๋‹จ๊ณ„: SFT                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    ์ง€๋„ํ•™์Šต    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚  โ”‚ Base Model  โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ  โ”‚  SFT Model  โ”‚                โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ”‚                                                                 โ”‚
โ”‚  2๋‹จ๊ณ„: Reward Model ํ•™์Šต                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚  โ”‚ SFT Model   โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚Reward Modelโ”‚            โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ”‚                                                                 โ”‚
โ”‚  3๋‹จ๊ณ„: PPO ๊ฐ•ํ™”ํ•™์Šต                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚
โ”‚  โ”‚  SFT Model  โ”‚ + โ”‚Reward Modelโ”‚ โ–ถ โ”‚ RLHF Model โ”‚          โ”‚
โ”‚  โ”‚  (Policy)   โ”‚   โ”‚  (Critic)  โ”‚   โ”‚ (Aligned)  โ”‚          โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

# ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ ํ˜•์‹
preference_data = [
    {
        "prompt": "Write a haiku about mountains.",
        "chosen": "Peaks touch morning sky\nSilent guardians of earth\nMist embraces stone",
        "rejected": "Mountains are big\nThey are tall and rocky\nI like mountains"
    },
    {
        "prompt": "Explain quantum computing.",
        "chosen": "Quantum computing harnesses quantum mechanics principles...",
        "rejected": "Quantum computing is computers that use quantum stuff..."
    }
]

# HuggingFace ํ˜•์‹
from datasets import Dataset

dataset = Dataset.from_list(preference_data)
dataset = dataset.map(lambda x: {
    "prompt": x["prompt"],
    "chosen": x["chosen"],
    "rejected": x["rejected"]
})

4. Reward Model ํ•™์Šต

Reward Model ๊ฐœ๋…

์ž…๋ ฅ: (prompt, response)
์ถœ๋ ฅ: scalar reward (์ ์ˆ˜)

ํ•™์Šต ๋ชฉํ‘œ:
    reward(prompt, chosen) > reward(prompt, rejected)

๊ตฌํ˜„

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from trl import RewardTrainer

# Reward Model (๋ถ„๋ฅ˜ ํ—ค๋“œ ์ถ”๊ฐ€)
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    num_labels=1  # ์Šค์นผ๋ผ ์ถœ๋ ฅ
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
def preprocess_reward_data(examples):
    """์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ๋ฅผ Reward ํ•™์Šต์šฉ์œผ๋กœ ๋ณ€ํ™˜"""
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }

    for prompt, chosen, rejected in zip(examples["prompt"], examples["chosen"], examples["rejected"]):
        # Chosen
        chosen_text = f"### Prompt: {prompt}\n### Response: {chosen}"
        chosen_tokenized = tokenizer(chosen_text, truncation=True, max_length=512)
        new_examples["input_ids_chosen"].append(chosen_tokenized["input_ids"])
        new_examples["attention_mask_chosen"].append(chosen_tokenized["attention_mask"])

        # Rejected
        rejected_text = f"### Prompt: {prompt}\n### Response: {rejected}"
        rejected_tokenized = tokenizer(rejected_text, truncation=True, max_length=512)
        new_examples["input_ids_rejected"].append(rejected_tokenized["input_ids"])
        new_examples["attention_mask_rejected"].append(rejected_tokenized["attention_mask"])

    return new_examples

# ํ•™์Šต
training_args = TrainingArguments(
    output_dir="./reward_model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    fp16=True,
)

trainer = RewardTrainer(
    model=reward_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()

Reward Model ์‚ฌ์šฉ

def get_reward(model, tokenizer, prompt, response):
    """์‘๋‹ต์— ๋Œ€ํ•œ ๋ณด์ƒ ์ ์ˆ˜ ๊ณ„์‚ฐ"""
    text = f"### Prompt: {prompt}\n### Response: {response}"
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        reward = outputs.logits.squeeze().item()

    return reward

# ์‚ฌ์šฉ ์˜ˆ์‹œ
prompt = "Explain photosynthesis."
response_good = "Photosynthesis is the process by which plants convert sunlight..."
response_bad = "Plants eat light."

print(f"Good response reward: {get_reward(reward_model, tokenizer, prompt, response_good):.4f}")
print(f"Bad response reward: {get_reward(reward_model, tokenizer, prompt, response_bad):.4f}")

5. PPO (Proximal Policy Optimization)

PPO ๊ฐœ๋…

PPO ๋ชฉํ‘œํ•จ์ˆ˜:
    L^CLIP(ฮธ) = E[min(r_t(ฮธ)A_t, clip(r_t(ฮธ), 1-ฮต, 1+ฮต)A_t)]

์—ฌ๊ธฐ์„œ:
    r_t(ฮธ) = ฯ€_ฮธ(a_t|s_t) / ฯ€_ฮธ_old(a_t|s_t)  (ํ™•๋ฅ  ๋น„์œจ)
    A_t = ์–ด๋“œ๋ฐดํ‹ฐ์ง€ (Reward - Baseline)
    ฮต = ํด๋ฆฌํ•‘ ๋ฒ”์œ„ (๋ณดํ†ต 0.2)

KL ์ œ์•ฝ:
    D_KL[ฯ€_ฮธ || ฯ€_ref] < ฮด  (๊ธฐ์ค€ ๋ชจ๋ธ๊ณผ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๋„๋ก)

PPO ๊ตฌํ˜„ (TRL)

from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch

# PPO ์„ค์ •
ppo_config = PPOConfig(
    model_name="./sft_model",
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    ppo_epochs=4,
    max_grad_norm=0.5,
    kl_penalty="kl",           # KL ํŽ˜๋„ํ‹ฐ ๋ฐฉ์‹
    target_kl=0.1,             # ๋ชฉํ‘œ KL divergence
    init_kl_coef=0.2,          # ์ดˆ๊ธฐ KL ๊ณ„์ˆ˜
)

# ๋ชจ๋ธ (Value head ํฌํ•จ)
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")  # ๊ธฐ์ค€ ๋ชจ๋ธ (๊ณ ์ •)
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token

# Reward Model ๋กœ๋“œ
reward_model = AutoModelForSequenceClassification.from_pretrained("./reward_model")
reward_tokenizer = AutoTokenizer.from_pretrained("./reward_model")

# PPO Trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# ํ•™์Šต ๋ฃจํ”„
def get_reward_batch(prompts, responses):
    """๋ฐฐ์น˜ ๋ณด์ƒ ๊ณ„์‚ฐ"""
    rewards = []
    for prompt, response in zip(prompts, responses):
        text = f"### Prompt: {prompt}\n### Response: {response}"
        inputs = reward_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(reward_model.device) for k, v in inputs.items()}

        with torch.no_grad():
            reward = reward_model(**inputs).logits.squeeze()
        rewards.append(reward)

    return rewards

# ํ•™์Šต
from datasets import load_dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

for epoch in range(ppo_config.ppo_epochs):
    for batch in dataset.iter(batch_size=ppo_config.batch_size):
        # ํ”„๋กฌํ”„ํŠธ ํ† ํฐํ™”
        query_tensors = [tokenizer.encode(p, return_tensors="pt").squeeze() for p in batch["prompt"]]

        # ์‘๋‹ต ์ƒ์„ฑ
        response_tensors = []
        for query in query_tensors:
            response = ppo_trainer.generate(query, max_new_tokens=128)
            response_tensors.append(response.squeeze())

        # ํ…์ŠคํŠธ ๋””์ฝ”๋”ฉ
        prompts = batch["prompt"]
        responses = [tokenizer.decode(r, skip_special_tokens=True) for r in response_tensors]

        # ๋ณด์ƒ ๊ณ„์‚ฐ
        rewards = get_reward_batch(prompts, responses)

        # PPO ์Šคํ…
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

        # ๋กœ๊น…
        ppo_trainer.log_stats(stats, batch, rewards)

# ์ €์žฅ
model.save_pretrained("./rlhf_model")

6. DPO (Direct Preference Optimization)

DPO ๊ฐœ๋…

DPO = RLHF without Reward Model

ํ•ต์‹ฌ ์•„์ด๋””์–ด:
    - Reward Model ์—†์ด ์ง์ ‘ ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
    - Bradley-Terry ๋ชจ๋ธ ๊ธฐ๋ฐ˜
    - ๋” ์•ˆ์ •์ ์ด๊ณ  ๊ฐ„๋‹จํ•œ ํ•™์Šต

์†์‹ค ํ•จ์ˆ˜:
    L_DPO = -E[log ฯƒ(ฮฒ(log ฯ€_ฮธ(y_w|x) - log ฯ€_ref(y_w|x)
                      - log ฯ€_ฮธ(y_l|x) + log ฯ€_ref(y_l|x)))]

์—ฌ๊ธฐ์„œ:
    y_w = ์„ ํ˜ธ ์‘๋‹ต (winner)
    y_l = ๋น„์„ ํ˜ธ ์‘๋‹ต (loser)
    ฮฒ = ์˜จ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ

DPO vs RLHF

ํ•ญ๋ชฉ RLHF DPO
Reward Model ํ•„์š” ๋ถˆํ•„์š”
ํ•™์Šต ์•ˆ์ •์„ฑ ๋ถˆ์•ˆ์ • ์•ˆ์ •์ 
ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋งŽ์Œ ์ ์Œ
๋ฉ”๋ชจ๋ฆฌ ๋†’์Œ ๋‚ฎ์Œ
์„ฑ๋Šฅ ์šฐ์ˆ˜ ๋™๋“ฑ ์ด์ƒ

DPO ๊ตฌํ˜„

from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# ๋ชจ๋ธ
model = AutoModelForCausalLM.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft_model")
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token

# ๋ฐ์ดํ„ฐ์…‹ (prompt, chosen, rejected ํ˜•์‹)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

# DPO ์„ค์ •
dpo_config = DPOConfig(
    beta=0.1,                          # ์˜จ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ
    loss_type="sigmoid",               # sigmoid ๋˜๋Š” hinge
    max_length=512,
    max_prompt_length=256,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    logging_steps=10,
    save_strategy="epoch",
    output_dir="./dpo_model",
    fp16=True,
)

# DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# ํ•™์Šต
dpo_trainer.train()

# ์ €์žฅ
model.save_pretrained("./dpo_model_final")

DPO ๋ณ€ํ˜•๋“ค

# IPO (Identity Preference Optimization)
dpo_config = DPOConfig(
    loss_type="ipo",
    label_smoothing=0.0,
)

# KTO (Kahneman-Tversky Optimization)
# ์„ ํ˜ธ/๋น„์„ ํ˜ธ ์Œ ๋Œ€์‹  ๊ฐœ๋ณ„ ํ‰๊ฐ€ ์‚ฌ์šฉ
from trl import KTOConfig, KTOTrainer

kto_config = KTOConfig(
    beta=0.1,
    desirable_weight=1.0,
    undesirable_weight=1.0,
)

# ORPO (Odds Ratio Preference Optimization)
# Reference model ๋ถˆํ•„์š”
from trl import ORPOConfig, ORPOTrainer

orpo_config = ORPOConfig(
    beta=0.1,
    # ref_model ์—†์ด ํ•™์Šต
)

7. Constitutional AI

๊ฐœ๋…

Constitutional AI (CAI) = ์›์น™ ๊ธฐ๋ฐ˜ ์ž๊ธฐ ๊ฐœ์„ 

๋‹จ๊ณ„:
    1. ๋ชจ๋ธ์ด ์‘๋‹ต ์ƒ์„ฑ
    2. ํ—Œ๋ฒ•(์›์น™)์— ๋”ฐ๋ผ ์ž๊ธฐ ๋น„ํ‰
    3. ๋น„ํ‰์„ ๋ฐ”ํƒ•์œผ๋กœ ์‘๋‹ต ์ˆ˜์ •
    4. ์ˆ˜์ •๋œ ์‘๋‹ต์œผ๋กœ ํ•™์Šต

์›์น™ ์˜ˆ์‹œ:
    - "๋„์›€์ด ๋˜์–ด์•ผ ํ•จ"
    - "ํ•ด๋กœ์šด ๋‚ด์šฉ์„ ํฌํ•จํ•˜์ง€ ์•Š์•„์•ผ ํ•จ"
    - "์ •์งํ•ด์•ผ ํ•จ"
    - "๊ฐœ์ธ์ •๋ณด๋ฅผ ๋…ธ์ถœํ•˜์ง€ ์•Š์•„์•ผ ํ•จ"

CAI ๊ตฌํ˜„

from openai import OpenAI

client = OpenAI()

# ์›์น™ (Constitution)
constitution = """
1. ์‘๋‹ต์€ ๋„์›€์ด ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
2. ์‘๋‹ต์€ ํ•ด๋กœ์šด ๋‚ด์šฉ์„ ํฌํ•จํ•˜์ง€ ์•Š์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.
3. ์‘๋‹ต์€ ์ •์งํ•˜๊ณ  ์‚ฌ์‹ค์— ๊ธฐ๋ฐ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
4. ๊ฐœ์ธ์ •๋ณด๋‚˜ ๋ฏผ๊ฐํ•œ ์ •๋ณด๋ฅผ ๊ณต๊ฐœํ•˜์ง€ ์•Š์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.
5. ์ฐจ๋ณ„์ ์ด๊ฑฐ๋‚˜ ํŽธ๊ฒฌ ์žˆ๋Š” ๋‚ด์šฉ์„ ํฌํ•จํ•˜์ง€ ์•Š์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.
"""

def generate_initial_response(prompt):
    """์ดˆ๊ธฐ ์‘๋‹ต ์ƒ์„ฑ"""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content

def critique_response(prompt, response, constitution):
    """์‘๋‹ต ๋น„ํ‰"""
    critique_prompt = f"""๋‹ค์Œ ์‘๋‹ต์ด ์ฃผ์–ด์ง„ ์›์น™์„ ์ž˜ ๋”ฐ๋ฅด๋Š”์ง€ ํ‰๊ฐ€ํ•˜์„ธ์š”.

์›์น™:
{constitution}

์‚ฌ์šฉ์ž ์งˆ๋ฌธ: {prompt}

์‘๋‹ต: {response}

๊ฐ ์›์น™์— ๋Œ€ํ•ด ์‘๋‹ต์ด ์–ด๋–ป๊ฒŒ ์œ„๋ฐ˜ํ•˜๊ฑฐ๋‚˜ ์ค€์ˆ˜ํ•˜๋Š”์ง€ ๋ถ„์„ํ•˜์„ธ์š”.
"""
    critique = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": critique_prompt}],
        temperature=0.3
    )
    return critique.choices[0].message.content

def revise_response(prompt, response, critique, constitution):
    """์‘๋‹ต ์ˆ˜์ •"""
    revision_prompt = f"""๋‹ค์Œ ๋น„ํ‰์„ ๋ฐ”ํƒ•์œผ๋กœ ์‘๋‹ต์„ ๊ฐœ์„ ํ•˜์„ธ์š”.

์›์น™:
{constitution}

์‚ฌ์šฉ์ž ์งˆ๋ฌธ: {prompt}

์›๋ž˜ ์‘๋‹ต: {response}

๋น„ํ‰: {critique}

์›์น™์„ ๋” ์ž˜ ์ค€์ˆ˜ํ•˜๋„๋ก ์‘๋‹ต์„ ์ˆ˜์ •ํ•˜์„ธ์š”. ์ˆ˜์ •๋œ ์‘๋‹ต๋งŒ ์ถœ๋ ฅํ•˜์„ธ์š”.
"""
    revised = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": revision_prompt}],
        temperature=0.3
    )
    return revised.choices[0].message.content

def constitutional_ai_pipeline(prompt, constitution, iterations=2):
    """CAI ํŒŒ์ดํ”„๋ผ์ธ"""
    response = generate_initial_response(prompt)
    print(f"์ดˆ๊ธฐ ์‘๋‹ต:\n{response}\n")

    for i in range(iterations):
        critique = critique_response(prompt, response, constitution)
        print(f"๋น„ํ‰ {i+1}:\n{critique}\n")

        response = revise_response(prompt, response, critique, constitution)
        print(f"์ˆ˜์ •๋œ ์‘๋‹ต {i+1}:\n{response}\n")

    return response

# ์‚ฌ์šฉ
prompt = "How can I pick a lock?"
final_response = constitutional_ai_pipeline(prompt, constitution)

8. ๊ณ ๊ธ‰ ์ •๋ ฌ ๊ธฐ๋ฒ•

RLAIF (RL from AI Feedback)

def get_ai_preference(prompt, response_a, response_b):
    """AI๊ฐ€ ์„ ํ˜ธ๋„ ํŒ๋‹จ"""
    judge_prompt = f"""๋‹ค์Œ ๋‘ ์‘๋‹ต ์ค‘ ๋” ์ข‹์€ ๊ฒƒ์„ ์„ ํƒํ•˜์„ธ์š”.

์งˆ๋ฌธ: {prompt}

์‘๋‹ต A: {response_a}

์‘๋‹ต B: {response_b}

ํ‰๊ฐ€ ๊ธฐ์ค€:
- ์ •ํ™•์„ฑ
- ์œ ์šฉ์„ฑ
- ๋ช…ํ™•์„ฑ
- ์•ˆ์ „์„ฑ

๋” ์ข‹์€ ์‘๋‹ต (A ๋˜๋Š” B)์™€ ์ด์œ ๋ฅผ ๋งํ•˜์„ธ์š”.
"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": judge_prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Self-Play Fine-Tuning (SPIN)

# SPIN: ๋ชจ๋ธ์ด ์ž์‹ ์˜ ์‘๋‹ต๊ณผ ๊ฒฝ์Ÿ

def spin_iteration(model, dataset):
    """SPIN ๋ฐ˜๋ณต"""
    # 1. ํ˜„์žฌ ๋ชจ๋ธ๋กœ ์‘๋‹ต ์ƒ์„ฑ
    synthetic_responses = generate_responses(model, dataset["prompts"])

    # 2. ์‹ค์ œ ์‘๋‹ต vs ์ƒ์„ฑ๋œ ์‘๋‹ต์œผ๋กœ DPO
    spin_dataset = {
        "prompt": dataset["prompts"],
        "chosen": dataset["responses"],      # ์‹ค์ œ ์‘๋‹ต
        "rejected": synthetic_responses      # ๋ชจ๋ธ ์ƒ์„ฑ ์‘๋‹ต
    }

    # 3. DPO ํ•™์Šต
    model = dpo_train(model, spin_dataset)

    return model

์ •๋ฆฌ

์ •๋ ฌ ๋ฐฉ๋ฒ• ๋น„๊ต

๋ฐฉ๋ฒ• ๋ณต์žก๋„ ์„ฑ๋Šฅ ์‚ฌ์šฉ ์‹œ์ 
SFT ๋‚ฎ์Œ ๊ธฐ๋ณธ ํ•ญ์ƒ ์ฒซ ๋‹จ๊ณ„
RLHF (PPO) ๋†’์Œ ์šฐ์ˆ˜ ๋ณต์žกํ•œ ์ •๋ ฌ
DPO ์ค‘๊ฐ„ ์šฐ์ˆ˜ ๊ฐ„๋‹จํ•œ ์ •๋ ฌ
ORPO ๋‚ฎ์Œ ์ข‹์Œ ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ
CAI ์ค‘๊ฐ„ ์•ˆ์ „์„ฑ ์•ˆ์ „ ์ค‘์š”

ํ•ต์‹ฌ ์ฝ”๋“œ

# SFT
from trl import SFTTrainer
trainer = SFTTrainer(model, train_dataset, formatting_func=format_fn)

# DPO
from trl import DPOTrainer, DPOConfig
config = DPOConfig(beta=0.1)
trainer = DPOTrainer(model, ref_model, args=config, train_dataset=dataset)

# PPO
from trl import PPOTrainer, PPOConfig
config = PPOConfig(target_kl=0.1)
trainer = PPOTrainer(config, model, ref_model, tokenizer)
stats = trainer.step(queries, responses, rewards)

์ •๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ

1. SFT: ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋กœ ๊ธฐ๋ณธ ๋Šฅ๋ ฅ ํ•™์Šต
2. ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (์ธ๊ฐ„ ๋˜๋Š” AI)
3. DPO/RLHF๋กœ ์„ ํ˜ธ๋„ ํ•™์Šต
4. ์•ˆ์ „์„ฑ ํ‰๊ฐ€ ๋ฐ ์ถ”๊ฐ€ ์ •๋ ฌ
5. ๋ฐฐํฌ ๋ฐ ํ”ผ๋“œ๋ฐฑ ์ˆ˜์ง‘

๋‹ค์Œ ๋‹จ๊ณ„

15_LLM_Agents.md์—์„œ ๋„๊ตฌ ์‚ฌ์šฉ๊ณผ ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

to navigate between lessons