14. RLHF์ LLM ์ ๋ ฌ (Alignment)
ํ์ต ๋ชฉํ
- RLHF(Reinforcement Learning from Human Feedback) ์ดํด
- Reward Model ํ์ต
- PPO๋ฅผ ํตํ ์ ์ฑ
์ต์ ํ
- DPO(Direct Preference Optimization)
- Constitutional AI์ ์์ ํ AI
1. LLM ์ ๋ ฌ ๊ฐ์
์ ์ ๋ ฌ์ด ํ์ํ๊ฐ?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LLM ์ ๋ ฌ์ ํ์์ฑ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ์ฌ์ ํ์ต ๋ชจ๋ธ (Base Model) โ
โ โ โ
โ โ ๋ฌธ์ ์ : โ
โ โ - ๋จ์ํ ๋ค์ ํ ํฐ ์์ธก โ
โ โ - ์ ํดํ ์ฝํ
์ธ ์์ฑ ๊ฐ๋ฅ โ
โ โ - ์ง์์ฌํญ ๋ฐ๋ฅด๊ธฐ ์ด๋ ค์ โ
โ โผ โ
โ ์ ๋ ฌ๋ ๋ชจ๋ธ (Aligned Model) โ
โ โ โ
โ โ ๋ชฉํ: โ
โ โ - ๋์๋จ (Helpful) โ
โ โ - ๋ฌดํดํจ (Harmless) โ
โ โ - ์ ์งํจ (Honest) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
์ ๋ ฌ ๋ฐฉ๋ฒ๋ก ๋ฐ์
SFT (Supervised Fine-Tuning)
โ ๊ณ ํ์ง ๋ฐ์ดํฐ๋ก ์ง๋ํ์ต
โผ
RLHF (Reinforcement Learning from Human Feedback)
โ ๋ณด์ ๋ชจ๋ธ + ๊ฐํํ์ต
โผ
DPO (Direct Preference Optimization)
โ ์ง์ ์ ํธ๋ ์ต์ ํ
โผ
Constitutional AI
โ ์์น ๊ธฐ๋ฐ ์๊ธฐ ๊ฐ์
2. SFT (Supervised Fine-Tuning)
๊ธฐ๋ณธ ๊ฐ๋
# SFT ๋ฐ์ดํฐ ํ์
sft_data = [
{
"instruction": "Write a poem about spring.",
"input": "",
"output": "Flowers bloom in gentle rain,\nBirds return to sing again..."
},
{
"instruction": "Translate to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
]
SFT ๊ตฌํ
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
# ๋ชจ๋ธ๊ณผ ํ ํฌ๋์ด์
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# ๋ฐ์ดํฐ์
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# ํฌ๋งทํ
ํจ์
def format_instruction(example):
if example["context"]:
return f"""### Instruction:
{example['instruction']}
### Context:
{example['context']}
### Response:
{example['response']}"""
else:
return f"""### Instruction:
{example['instruction']}
### Response:
{example['response']}"""
# ํ์ต ์ค์
training_args = TrainingArguments(
output_dir="./sft_model",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
fp16=True,
)
# SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
formatting_func=format_instruction,
max_seq_length=1024,
args=training_args,
)
trainer.train()
3. RLHF ํ์ดํ๋ผ์ธ
์ ์ฒด ํ๋ก์ธ์ค
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RLHF ํ์ดํ๋ผ์ธ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1๋จ๊ณ: SFT โ
โ โโโโโโโโโโโโโโโ ์ง๋ํ์ต โโโโโโโโโโโโโโโ โ
โ โ Base Model โ โโโโโโโโโโโถ โ SFT Model โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โ 2๋จ๊ณ: Reward Model ํ์ต โ
โ โโโโโโโโโโโโโโโ ์ ํธ๋ ๋ฐ์ดํฐ โโโโโโโโโโโโโโโ โ
โ โ SFT Model โ โโโโโโโโโโโโโโโถ โReward Modelโ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โ 3๋จ๊ณ: PPO ๊ฐํํ์ต โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ SFT Model โ + โReward Modelโ โถ โ RLHF Model โ โ
โ โ (Policy) โ โ (Critic) โ โ (Aligned) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
์ ํธ๋ ๋ฐ์ดํฐ ์์ง
# ์ ํธ๋ ๋ฐ์ดํฐ ํ์
preference_data = [
{
"prompt": "Write a haiku about mountains.",
"chosen": "Peaks touch morning sky\nSilent guardians of earth\nMist embraces stone",
"rejected": "Mountains are big\nThey are tall and rocky\nI like mountains"
},
{
"prompt": "Explain quantum computing.",
"chosen": "Quantum computing harnesses quantum mechanics principles...",
"rejected": "Quantum computing is computers that use quantum stuff..."
}
]
# HuggingFace ํ์
from datasets import Dataset
dataset = Dataset.from_list(preference_data)
dataset = dataset.map(lambda x: {
"prompt": x["prompt"],
"chosen": x["chosen"],
"rejected": x["rejected"]
})
4. Reward Model ํ์ต
Reward Model ๊ฐ๋
์
๋ ฅ: (prompt, response)
์ถ๋ ฅ: scalar reward (์ ์)
ํ์ต ๋ชฉํ:
reward(prompt, chosen) > reward(prompt, rejected)
๊ตฌํ
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from trl import RewardTrainer
# Reward Model (๋ถ๋ฅ ํค๋ ์ถ๊ฐ)
reward_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-7b-hf",
num_labels=1 # ์ค์นผ๋ผ ์ถ๋ ฅ
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
def preprocess_reward_data(examples):
"""์ ํธ๋ ๋ฐ์ดํฐ๋ฅผ Reward ํ์ต์ฉ์ผ๋ก ๋ณํ"""
new_examples = {
"input_ids_chosen": [],
"attention_mask_chosen": [],
"input_ids_rejected": [],
"attention_mask_rejected": [],
}
for prompt, chosen, rejected in zip(examples["prompt"], examples["chosen"], examples["rejected"]):
# Chosen
chosen_text = f"### Prompt: {prompt}\n### Response: {chosen}"
chosen_tokenized = tokenizer(chosen_text, truncation=True, max_length=512)
new_examples["input_ids_chosen"].append(chosen_tokenized["input_ids"])
new_examples["attention_mask_chosen"].append(chosen_tokenized["attention_mask"])
# Rejected
rejected_text = f"### Prompt: {prompt}\n### Response: {rejected}"
rejected_tokenized = tokenizer(rejected_text, truncation=True, max_length=512)
new_examples["input_ids_rejected"].append(rejected_tokenized["input_ids"])
new_examples["attention_mask_rejected"].append(rejected_tokenized["attention_mask"])
return new_examples
# ํ์ต
training_args = TrainingArguments(
output_dir="./reward_model",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-5,
logging_steps=10,
eval_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=500,
fp16=True,
)
trainer = RewardTrainer(
model=reward_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
trainer.train()
Reward Model ์ฌ์ฉ
def get_reward(model, tokenizer, prompt, response):
"""์๋ต์ ๋ํ ๋ณด์ ์ ์ ๊ณ์ฐ"""
text = f"### Prompt: {prompt}\n### Response: {response}"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
reward = outputs.logits.squeeze().item()
return reward
# ์ฌ์ฉ ์์
prompt = "Explain photosynthesis."
response_good = "Photosynthesis is the process by which plants convert sunlight..."
response_bad = "Plants eat light."
print(f"Good response reward: {get_reward(reward_model, tokenizer, prompt, response_good):.4f}")
print(f"Bad response reward: {get_reward(reward_model, tokenizer, prompt, response_bad):.4f}")
5. PPO (Proximal Policy Optimization)
PPO ๊ฐ๋
PPO ๋ชฉํํจ์:
L^CLIP(ฮธ) = E[min(r_t(ฮธ)A_t, clip(r_t(ฮธ), 1-ฮต, 1+ฮต)A_t)]
์ฌ๊ธฐ์:
r_t(ฮธ) = ฯ_ฮธ(a_t|s_t) / ฯ_ฮธ_old(a_t|s_t) (ํ๋ฅ ๋น์จ)
A_t = ์ด๋๋ฐดํฐ์ง (Reward - Baseline)
ฮต = ํด๋ฆฌํ ๋ฒ์ (๋ณดํต 0.2)
KL ์ ์ฝ:
D_KL[ฯ_ฮธ || ฯ_ref] < ฮด (๊ธฐ์ค ๋ชจ๋ธ๊ณผ ๋๋ฌด ๋ฉ์ด์ง์ง ์๋๋ก)
PPO ๊ตฌํ (TRL)
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch
# PPO ์ค์
ppo_config = PPOConfig(
model_name="./sft_model",
learning_rate=1.41e-5,
batch_size=16,
mini_batch_size=4,
gradient_accumulation_steps=1,
ppo_epochs=4,
max_grad_norm=0.5,
kl_penalty="kl", # KL ํ๋ํฐ ๋ฐฉ์
target_kl=0.1, # ๋ชฉํ KL divergence
init_kl_coef=0.2, # ์ด๊ธฐ KL ๊ณ์
)
# ๋ชจ๋ธ (Value head ํฌํจ)
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model") # ๊ธฐ์ค ๋ชจ๋ธ (๊ณ ์ )
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token
# Reward Model ๋ก๋
reward_model = AutoModelForSequenceClassification.from_pretrained("./reward_model")
reward_tokenizer = AutoTokenizer.from_pretrained("./reward_model")
# PPO Trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# ํ์ต ๋ฃจํ
def get_reward_batch(prompts, responses):
"""๋ฐฐ์น ๋ณด์ ๊ณ์ฐ"""
rewards = []
for prompt, response in zip(prompts, responses):
text = f"### Prompt: {prompt}\n### Response: {response}"
inputs = reward_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(reward_model.device) for k, v in inputs.items()}
with torch.no_grad():
reward = reward_model(**inputs).logits.squeeze()
rewards.append(reward)
return rewards
# ํ์ต
from datasets import load_dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train")
for epoch in range(ppo_config.ppo_epochs):
for batch in dataset.iter(batch_size=ppo_config.batch_size):
# ํ๋กฌํํธ ํ ํฐํ
query_tensors = [tokenizer.encode(p, return_tensors="pt").squeeze() for p in batch["prompt"]]
# ์๋ต ์์ฑ
response_tensors = []
for query in query_tensors:
response = ppo_trainer.generate(query, max_new_tokens=128)
response_tensors.append(response.squeeze())
# ํ
์คํธ ๋์ฝ๋ฉ
prompts = batch["prompt"]
responses = [tokenizer.decode(r, skip_special_tokens=True) for r in response_tensors]
# ๋ณด์ ๊ณ์ฐ
rewards = get_reward_batch(prompts, responses)
# PPO ์คํ
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
# ๋ก๊น
ppo_trainer.log_stats(stats, batch, rewards)
# ์ ์ฅ
model.save_pretrained("./rlhf_model")
6. DPO (Direct Preference Optimization)
DPO ๊ฐ๋
DPO = RLHF without Reward Model
ํต์ฌ ์์ด๋์ด:
- Reward Model ์์ด ์ง์ ์ ํธ๋ ๋ฐ์ดํฐ๋ก ํ์ต
- Bradley-Terry ๋ชจ๋ธ ๊ธฐ๋ฐ
- ๋ ์์ ์ ์ด๊ณ ๊ฐ๋จํ ํ์ต
์์ค ํจ์:
L_DPO = -E[log ฯ(ฮฒ(log ฯ_ฮธ(y_w|x) - log ฯ_ref(y_w|x)
- log ฯ_ฮธ(y_l|x) + log ฯ_ref(y_l|x)))]
์ฌ๊ธฐ์:
y_w = ์ ํธ ์๋ต (winner)
y_l = ๋น์ ํธ ์๋ต (loser)
ฮฒ = ์จ๋ ํ๋ผ๋ฏธํฐ
DPO vs RLHF
| ํญ๋ชฉ |
RLHF |
DPO |
| Reward Model |
ํ์ |
๋ถํ์ |
| ํ์ต ์์ ์ฑ |
๋ถ์์ |
์์ ์ |
| ํ์ดํผํ๋ผ๋ฏธํฐ |
๋ง์ |
์ ์ |
| ๋ฉ๋ชจ๋ฆฌ |
๋์ |
๋ฎ์ |
| ์ฑ๋ฅ |
์ฐ์ |
๋๋ฑ ์ด์ |
DPO ๊ตฌํ
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# ๋ชจ๋ธ
model = AutoModelForCausalLM.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft_model")
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token
# ๋ฐ์ดํฐ์
(prompt, chosen, rejected ํ์)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")
# DPO ์ค์
dpo_config = DPOConfig(
beta=0.1, # ์จ๋ ํ๋ผ๋ฏธํฐ
loss_type="sigmoid", # sigmoid ๋๋ hinge
max_length=512,
max_prompt_length=256,
learning_rate=5e-7,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
logging_steps=10,
save_strategy="epoch",
output_dir="./dpo_model",
fp16=True,
)
# DPO Trainer
dpo_trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
# ํ์ต
dpo_trainer.train()
# ์ ์ฅ
model.save_pretrained("./dpo_model_final")
DPO ๋ณํ๋ค
# IPO (Identity Preference Optimization)
dpo_config = DPOConfig(
loss_type="ipo",
label_smoothing=0.0,
)
# KTO (Kahneman-Tversky Optimization)
# ์ ํธ/๋น์ ํธ ์ ๋์ ๊ฐ๋ณ ํ๊ฐ ์ฌ์ฉ
from trl import KTOConfig, KTOTrainer
kto_config = KTOConfig(
beta=0.1,
desirable_weight=1.0,
undesirable_weight=1.0,
)
# ORPO (Odds Ratio Preference Optimization)
# Reference model ๋ถํ์
from trl import ORPOConfig, ORPOTrainer
orpo_config = ORPOConfig(
beta=0.1,
# ref_model ์์ด ํ์ต
)
7. Constitutional AI
๊ฐ๋
Constitutional AI (CAI) = ์์น ๊ธฐ๋ฐ ์๊ธฐ ๊ฐ์
๋จ๊ณ:
1. ๋ชจ๋ธ์ด ์๋ต ์์ฑ
2. ํ๋ฒ(์์น)์ ๋ฐ๋ผ ์๊ธฐ ๋นํ
3. ๋นํ์ ๋ฐํ์ผ๋ก ์๋ต ์์
4. ์์ ๋ ์๋ต์ผ๋ก ํ์ต
์์น ์์:
- "๋์์ด ๋์ด์ผ ํจ"
- "ํด๋ก์ด ๋ด์ฉ์ ํฌํจํ์ง ์์์ผ ํจ"
- "์ ์งํด์ผ ํจ"
- "๊ฐ์ธ์ ๋ณด๋ฅผ ๋
ธ์ถํ์ง ์์์ผ ํจ"
CAI ๊ตฌํ
from openai import OpenAI
client = OpenAI()
# ์์น (Constitution)
constitution = """
1. ์๋ต์ ๋์์ด ๋์ด์ผ ํฉ๋๋ค.
2. ์๋ต์ ํด๋ก์ด ๋ด์ฉ์ ํฌํจํ์ง ์์์ผ ํฉ๋๋ค.
3. ์๋ต์ ์ ์งํ๊ณ ์ฌ์ค์ ๊ธฐ๋ฐํด์ผ ํฉ๋๋ค.
4. ๊ฐ์ธ์ ๋ณด๋ ๋ฏผ๊ฐํ ์ ๋ณด๋ฅผ ๊ณต๊ฐํ์ง ์์์ผ ํฉ๋๋ค.
5. ์ฐจ๋ณ์ ์ด๊ฑฐ๋ ํธ๊ฒฌ ์๋ ๋ด์ฉ์ ํฌํจํ์ง ์์์ผ ํฉ๋๋ค.
"""
def generate_initial_response(prompt):
"""์ด๊ธฐ ์๋ต ์์ฑ"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
return response.choices[0].message.content
def critique_response(prompt, response, constitution):
"""์๋ต ๋นํ"""
critique_prompt = f"""๋ค์ ์๋ต์ด ์ฃผ์ด์ง ์์น์ ์ ๋ฐ๋ฅด๋์ง ํ๊ฐํ์ธ์.
์์น:
{constitution}
์ฌ์ฉ์ ์ง๋ฌธ: {prompt}
์๋ต: {response}
๊ฐ ์์น์ ๋ํด ์๋ต์ด ์ด๋ป๊ฒ ์๋ฐํ๊ฑฐ๋ ์ค์ํ๋์ง ๋ถ์ํ์ธ์.
"""
critique = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": critique_prompt}],
temperature=0.3
)
return critique.choices[0].message.content
def revise_response(prompt, response, critique, constitution):
"""์๋ต ์์ """
revision_prompt = f"""๋ค์ ๋นํ์ ๋ฐํ์ผ๋ก ์๋ต์ ๊ฐ์ ํ์ธ์.
์์น:
{constitution}
์ฌ์ฉ์ ์ง๋ฌธ: {prompt}
์๋ ์๋ต: {response}
๋นํ: {critique}
์์น์ ๋ ์ ์ค์ํ๋๋ก ์๋ต์ ์์ ํ์ธ์. ์์ ๋ ์๋ต๋ง ์ถ๋ ฅํ์ธ์.
"""
revised = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": revision_prompt}],
temperature=0.3
)
return revised.choices[0].message.content
def constitutional_ai_pipeline(prompt, constitution, iterations=2):
"""CAI ํ์ดํ๋ผ์ธ"""
response = generate_initial_response(prompt)
print(f"์ด๊ธฐ ์๋ต:\n{response}\n")
for i in range(iterations):
critique = critique_response(prompt, response, constitution)
print(f"๋นํ {i+1}:\n{critique}\n")
response = revise_response(prompt, response, critique, constitution)
print(f"์์ ๋ ์๋ต {i+1}:\n{response}\n")
return response
# ์ฌ์ฉ
prompt = "How can I pick a lock?"
final_response = constitutional_ai_pipeline(prompt, constitution)
8. ๊ณ ๊ธ ์ ๋ ฌ ๊ธฐ๋ฒ
RLAIF (RL from AI Feedback)
def get_ai_preference(prompt, response_a, response_b):
"""AI๊ฐ ์ ํธ๋ ํ๋จ"""
judge_prompt = f"""๋ค์ ๋ ์๋ต ์ค ๋ ์ข์ ๊ฒ์ ์ ํํ์ธ์.
์ง๋ฌธ: {prompt}
์๋ต A: {response_a}
์๋ต B: {response_b}
ํ๊ฐ ๊ธฐ์ค:
- ์ ํ์ฑ
- ์ ์ฉ์ฑ
- ๋ช
ํ์ฑ
- ์์ ์ฑ
๋ ์ข์ ์๋ต (A ๋๋ B)์ ์ด์ ๋ฅผ ๋งํ์ธ์.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0
)
return response.choices[0].message.content
Self-Play Fine-Tuning (SPIN)
# SPIN: ๋ชจ๋ธ์ด ์์ ์ ์๋ต๊ณผ ๊ฒฝ์
def spin_iteration(model, dataset):
"""SPIN ๋ฐ๋ณต"""
# 1. ํ์ฌ ๋ชจ๋ธ๋ก ์๋ต ์์ฑ
synthetic_responses = generate_responses(model, dataset["prompts"])
# 2. ์ค์ ์๋ต vs ์์ฑ๋ ์๋ต์ผ๋ก DPO
spin_dataset = {
"prompt": dataset["prompts"],
"chosen": dataset["responses"], # ์ค์ ์๋ต
"rejected": synthetic_responses # ๋ชจ๋ธ ์์ฑ ์๋ต
}
# 3. DPO ํ์ต
model = dpo_train(model, spin_dataset)
return model
์ ๋ฆฌ
์ ๋ ฌ ๋ฐฉ๋ฒ ๋น๊ต
| ๋ฐฉ๋ฒ |
๋ณต์ก๋ |
์ฑ๋ฅ |
์ฌ์ฉ ์์ |
| SFT |
๋ฎ์ |
๊ธฐ๋ณธ |
ํญ์ ์ฒซ ๋จ๊ณ |
| RLHF (PPO) |
๋์ |
์ฐ์ |
๋ณต์กํ ์ ๋ ฌ |
| DPO |
์ค๊ฐ |
์ฐ์ |
๊ฐ๋จํ ์ ๋ ฌ |
| ORPO |
๋ฎ์ |
์ข์ |
๋ฉ๋ชจ๋ฆฌ ์ ํ |
| CAI |
์ค๊ฐ |
์์ ์ฑ |
์์ ์ค์ |
ํต์ฌ ์ฝ๋
# SFT
from trl import SFTTrainer
trainer = SFTTrainer(model, train_dataset, formatting_func=format_fn)
# DPO
from trl import DPOTrainer, DPOConfig
config = DPOConfig(beta=0.1)
trainer = DPOTrainer(model, ref_model, args=config, train_dataset=dataset)
# PPO
from trl import PPOTrainer, PPOConfig
config = PPOConfig(target_kl=0.1)
trainer = PPOTrainer(config, model, ref_model, tokenizer)
stats = trainer.step(queries, responses, rewards)
์ ๋ ฌ ํ์ดํ๋ผ์ธ
1. SFT: ๊ณ ํ์ง ๋ฐ์ดํฐ๋ก ๊ธฐ๋ณธ ๋ฅ๋ ฅ ํ์ต
2. ์ ํธ๋ ๋ฐ์ดํฐ ์์ง (์ธ๊ฐ ๋๋ AI)
3. DPO/RLHF๋ก ์ ํธ๋ ํ์ต
4. ์์ ์ฑ ํ๊ฐ ๋ฐ ์ถ๊ฐ ์ ๋ ฌ
5. ๋ฐฐํฌ ๋ฐ ํผ๋๋ฐฑ ์์ง
๋ค์ ๋จ๊ณ
15_LLM_Agents.md์์ ๋๊ตฌ ์ฌ์ฉ๊ณผ ์์ด์ ํธ ์์คํ
์ ํ์ตํฉ๋๋ค.