14. RLHF and LLM Alignment
Learning Objectives
- Understand RLHF (Reinforcement Learning from Human Feedback)
- Reward Model training
- Policy optimization with PPO
- DPO (Direct Preference Optimization)
- Constitutional AI and safe AI
1. LLM Alignment Overview
Why is Alignment Needed?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The Need for LLM Alignment β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Pre-trained Model (Base Model) β
β β β
β β Problems: β
β β - Simply predicts next token β
β β - Can generate harmful content β
β β - Difficulty following instructions β
β βΌ β
β Aligned Model β
β β β
β β Goals: β
β β - Helpful β
β β - Harmless β
β β - Honest β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Evolution of Alignment Methodologies
SFT (Supervised Fine-Tuning)
β Supervised learning with high-quality data
βΌ
RLHF (Reinforcement Learning from Human Feedback)
β Reward model + reinforcement learning
βΌ
DPO (Direct Preference Optimization)
β Direct preference optimization
βΌ
Constitutional AI
β Principle-based self-improvement
2. SFT (Supervised Fine-Tuning)
Basic Concept
# SFT data format
sft_data = [
{
"instruction": "Write a poem about spring.",
"input": "",
"output": "Flowers bloom in gentle rain,\nBirds return to sing again..."
},
{
"instruction": "Translate to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
]
SFT Implementation
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
# Model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# Dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# Formatting function
def format_instruction(example):
if example["context"]:
return f"""### Instruction:
{example['instruction']}
### Context:
{example['context']}
### Response:
{example['response']}"""
else:
return f"""### Instruction:
{example['instruction']}
### Response:
{example['response']}"""
# Training configuration
training_args = TrainingArguments(
output_dir="./sft_model",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
fp16=True,
)
# SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
formatting_func=format_instruction,
max_seq_length=1024,
args=training_args,
)
trainer.train()
3. RLHF Pipeline
Overall Process
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RLHF Pipeline β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Stage 1: SFT β
β βββββββββββββββ Supervised βββββββββββββββ β
β β Base Model β βββββββββββΆ β SFT Model β β
β βββββββββββββββ βββββββββββββββ β
β β
β Stage 2: Reward Model Training β
β βββββββββββββββ Preference data βββββββββββββββ β
β β SFT Model β βββββββββββββββΆ βReward Modelβ β
β βββββββββββββββ βββββββββββββββ β
β β
β Stage 3: PPO Reinforcement Learning β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β SFT Model β + βReward Modelβ βΆ β RLHF Model β β
β β (Policy) β β (Critic) β β (Aligned) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preference Data Collection
# Preference data format
preference_data = [
{
"prompt": "Write a haiku about mountains.",
"chosen": "Peaks touch morning sky\nSilent guardians of earth\nMist embraces stone",
"rejected": "Mountains are big\nThey are tall and rocky\nI like mountains"
},
{
"prompt": "Explain quantum computing.",
"chosen": "Quantum computing harnesses quantum mechanics principles...",
"rejected": "Quantum computing is computers that use quantum stuff..."
}
]
# HuggingFace format
from datasets import Dataset
dataset = Dataset.from_list(preference_data)
dataset = dataset.map(lambda x: {
"prompt": x["prompt"],
"chosen": x["chosen"],
"rejected": x["rejected"]
})
4. Reward Model Training
Reward Model Concept
Input: (prompt, response)
Output: scalar reward (score)
Training objective:
reward(prompt, chosen) > reward(prompt, rejected)
Implementation
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from trl import RewardTrainer
# Reward Model (add classification head)
reward_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-7b-hf",
num_labels=1 # Scalar output
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# Data preprocessing
def preprocess_reward_data(examples):
"""Convert preference data for Reward training"""
new_examples = {
"input_ids_chosen": [],
"attention_mask_chosen": [],
"input_ids_rejected": [],
"attention_mask_rejected": [],
}
for prompt, chosen, rejected in zip(examples["prompt"], examples["chosen"], examples["rejected"]):
# Chosen
chosen_text = f"### Prompt: {prompt}\n### Response: {chosen}"
chosen_tokenized = tokenizer(chosen_text, truncation=True, max_length=512)
new_examples["input_ids_chosen"].append(chosen_tokenized["input_ids"])
new_examples["attention_mask_chosen"].append(chosen_tokenized["attention_mask"])
# Rejected
rejected_text = f"### Prompt: {prompt}\n### Response: {rejected}"
rejected_tokenized = tokenizer(rejected_text, truncation=True, max_length=512)
new_examples["input_ids_rejected"].append(rejected_tokenized["input_ids"])
new_examples["attention_mask_rejected"].append(rejected_tokenized["attention_mask"])
return new_examples
# Training
training_args = TrainingArguments(
output_dir="./reward_model",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-5,
logging_steps=10,
eval_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=500,
fp16=True,
)
trainer = RewardTrainer(
model=reward_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
trainer.train()
Using Reward Model
def get_reward(model, tokenizer, prompt, response):
"""Calculate reward score for response"""
text = f"### Prompt: {prompt}\n### Response: {response}"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
reward = outputs.logits.squeeze().item()
return reward
# Usage example
prompt = "Explain photosynthesis."
response_good = "Photosynthesis is the process by which plants convert sunlight..."
response_bad = "Plants eat light."
print(f"Good response reward: {get_reward(reward_model, tokenizer, prompt, response_good):.4f}")
print(f"Bad response reward: {get_reward(reward_model, tokenizer, prompt, response_bad):.4f}")
5. PPO (Proximal Policy Optimization)
PPO Concept
PPO objective function:
L^CLIP(ΞΈ) = E[min(r_t(ΞΈ)A_t, clip(r_t(ΞΈ), 1-Ξ΅, 1+Ξ΅)A_t)]
Where:
r_t(ΞΈ) = Ο_ΞΈ(a_t|s_t) / Ο_ΞΈ_old(a_t|s_t) (probability ratio)
A_t = advantage (Reward - Baseline)
Ξ΅ = clipping range (typically 0.2)
KL constraint:
D_KL[Ο_ΞΈ || Ο_ref] < Ξ΄ (don't drift too far from reference model)
PPO Implementation (TRL)
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch
# PPO configuration
ppo_config = PPOConfig(
model_name="./sft_model",
learning_rate=1.41e-5,
batch_size=16,
mini_batch_size=4,
gradient_accumulation_steps=1,
ppo_epochs=4,
max_grad_norm=0.5,
kl_penalty="kl", # KL penalty method
target_kl=0.1, # Target KL divergence
init_kl_coef=0.2, # Initial KL coefficient
)
# Model (with Value head)
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model") # Reference model (frozen)
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token
# Load Reward Model
reward_model = AutoModelForSequenceClassification.from_pretrained("./reward_model")
reward_tokenizer = AutoTokenizer.from_pretrained("./reward_model")
# PPO Trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
def get_reward_batch(prompts, responses):
"""Calculate batch rewards"""
rewards = []
for prompt, response in zip(prompts, responses):
text = f"### Prompt: {prompt}\n### Response: {response}"
inputs = reward_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(reward_model.device) for k, v in inputs.items()}
with torch.no_grad():
reward = reward_model(**inputs).logits.squeeze()
rewards.append(reward)
return rewards
# Training
from datasets import load_dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train")
for epoch in range(ppo_config.ppo_epochs):
for batch in dataset.iter(batch_size=ppo_config.batch_size):
# Tokenize prompts
query_tensors = [tokenizer.encode(p, return_tensors="pt").squeeze() for p in batch["prompt"]]
# Generate responses
response_tensors = []
for query in query_tensors:
response = ppo_trainer.generate(query, max_new_tokens=128)
response_tensors.append(response.squeeze())
# Decode text
prompts = batch["prompt"]
responses = [tokenizer.decode(r, skip_special_tokens=True) for r in response_tensors]
# Calculate rewards
rewards = get_reward_batch(prompts, responses)
# PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
# Logging
ppo_trainer.log_stats(stats, batch, rewards)
# Save
model.save_pretrained("./rlhf_model")
6. DPO (Direct Preference Optimization)
DPO Concept
DPO = RLHF without Reward Model
Key idea:
- Train directly with preference data without Reward Model
- Based on Bradley-Terry model
- More stable and simpler training
Loss function:
L_DPO = -E[log Ο(Ξ²(log Ο_ΞΈ(y_w|x) - log Ο_ref(y_w|x)
- log Ο_ΞΈ(y_l|x) + log Ο_ref(y_l|x)))]
Where:
y_w = preferred response (winner)
y_l = rejected response (loser)
Ξ² = temperature parameter
DPO vs RLHF
| Aspect |
RLHF |
DPO |
| Reward Model |
Required |
Not required |
| Training Stability |
Unstable |
Stable |
| Hyperparameters |
Many |
Few |
| Memory |
High |
Low |
| Performance |
Excellent |
Equal or better |
DPO Implementation
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# Model
model = AutoModelForCausalLM.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft_model")
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token
# Dataset (prompt, chosen, rejected format)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")
# DPO configuration
dpo_config = DPOConfig(
beta=0.1, # Temperature parameter
loss_type="sigmoid", # sigmoid or hinge
max_length=512,
max_prompt_length=256,
learning_rate=5e-7,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
logging_steps=10,
save_strategy="epoch",
output_dir="./dpo_model",
fp16=True,
)
# DPO Trainer
dpo_trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
# Training
dpo_trainer.train()
# Save
model.save_pretrained("./dpo_model_final")
DPO Variants
# IPO (Identity Preference Optimization)
dpo_config = DPOConfig(
loss_type="ipo",
label_smoothing=0.0,
)
# KTO (Kahneman-Tversky Optimization)
# Uses individual ratings instead of preference pairs
from trl import KTOConfig, KTOTrainer
kto_config = KTOConfig(
beta=0.1,
desirable_weight=1.0,
undesirable_weight=1.0,
)
# ORPO (Odds Ratio Preference Optimization)
# No reference model needed
from trl import ORPOConfig, ORPOTrainer
orpo_config = ORPOConfig(
beta=0.1,
# No ref_model needed
)
7. Constitutional AI
Concept
Constitutional AI (CAI) = Principle-based self-improvement
Steps:
1. Model generates response
2. Self-critique based on constitution (principles)
3. Revise response based on critique
4. Train on revised responses
Example principles:
- "Should be helpful"
- "Should not contain harmful content"
- "Should be honest"
- "Should not disclose personal information"
CAI Implementation
from openai import OpenAI
client = OpenAI()
# Constitution (Principles)
constitution = """
1. Responses should be helpful.
2. Responses should not contain harmful content.
3. Responses should be honest and fact-based.
4. Should not disclose personal or sensitive information.
5. Should not contain discriminatory or biased content.
"""
def generate_initial_response(prompt):
"""Generate initial response"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
return response.choices[0].message.content
def critique_response(prompt, response, constitution):
"""Critique response"""
critique_prompt = f"""Evaluate whether the following response follows the given principles.
Principles:
{constitution}
User question: {prompt}
Response: {response}
Analyze how the response violates or complies with each principle.
"""
critique = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": critique_prompt}],
temperature=0.3
)
return critique.choices[0].message.content
def revise_response(prompt, response, critique, constitution):
"""Revise response"""
revision_prompt = f"""Improve the response based on the following critique.
Principles:
{constitution}
User question: {prompt}
Original response: {response}
Critique: {critique}
Revise the response to better comply with the principles. Output only the revised response.
"""
revised = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": revision_prompt}],
temperature=0.3
)
return revised.choices[0].message.content
def constitutional_ai_pipeline(prompt, constitution, iterations=2):
"""CAI pipeline"""
response = generate_initial_response(prompt)
print(f"Initial response:\n{response}\n")
for i in range(iterations):
critique = critique_response(prompt, response, constitution)
print(f"Critique {i+1}:\n{critique}\n")
response = revise_response(prompt, response, critique, constitution)
print(f"Revised response {i+1}:\n{response}\n")
return response
# Usage
prompt = "How can I pick a lock?"
final_response = constitutional_ai_pipeline(prompt, constitution)
8. Advanced Alignment Techniques
RLAIF (RL from AI Feedback)
def get_ai_preference(prompt, response_a, response_b):
"""AI judges preference"""
judge_prompt = f"""Choose the better of the following two responses.
Question: {prompt}
Response A: {response_a}
Response B: {response_b}
Evaluation criteria:
- Accuracy
- Usefulness
- Clarity
- Safety
State which response is better (A or B) and explain why.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0
)
return response.choices[0].message.content
Self-Play Fine-Tuning (SPIN)
# SPIN: Model competes with its own responses
def spin_iteration(model, dataset):
"""SPIN iteration"""
# 1. Generate responses with current model
synthetic_responses = generate_responses(model, dataset["prompts"])
# 2. DPO with real vs generated responses
spin_dataset = {
"prompt": dataset["prompts"],
"chosen": dataset["responses"], # Real responses
"rejected": synthetic_responses # Model-generated responses
}
# 3. DPO training
model = dpo_train(model, spin_dataset)
return model
Summary
Alignment Method Comparison
| Method |
Complexity |
Performance |
When to Use |
| SFT |
Low |
Basic |
Always first step |
| RLHF (PPO) |
High |
Excellent |
Complex alignment |
| DPO |
Medium |
Excellent |
Simple alignment |
| ORPO |
Low |
Good |
Memory constrained |
| CAI |
Medium |
Safety |
Safety critical |
Core Code
# SFT
from trl import SFTTrainer
trainer = SFTTrainer(model, train_dataset, formatting_func=format_fn)
# DPO
from trl import DPOTrainer, DPOConfig
config = DPOConfig(beta=0.1)
trainer = DPOTrainer(model, ref_model, args=config, train_dataset=dataset)
# PPO
from trl import PPOTrainer, PPOConfig
config = PPOConfig(target_kl=0.1)
trainer = PPOTrainer(config, model, ref_model, tokenizer)
stats = trainer.step(queries, responses, rewards)
Alignment Pipeline
1. SFT: Learn basic capabilities with high-quality data
2. Collect preference data (human or AI)
3. Learn preferences with DPO/RLHF
4. Evaluate safety and additional alignment
5. Deploy and collect feedback
Next Steps
In 15_LLM_Agents.md, we'll learn about tool use and agent systems.