14. RLHF and LLM Alignment

14. RLHF and LLM Alignment

Learning Objectives

  • Understand RLHF (Reinforcement Learning from Human Feedback)
  • Reward Model training
  • Policy optimization with PPO
  • DPO (Direct Preference Optimization)
  • Constitutional AI and safe AI

1. LLM Alignment Overview

Why is Alignment Needed?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  The Need for LLM Alignment                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  Pre-trained Model (Base Model)                              β”‚
β”‚      β”‚                                                       β”‚
β”‚      β”‚  Problems:                                            β”‚
β”‚      β”‚  - Simply predicts next token                         β”‚
β”‚      β”‚  - Can generate harmful content                       β”‚
β”‚      β”‚  - Difficulty following instructions                  β”‚
β”‚      β–Ό                                                       β”‚
β”‚  Aligned Model                                               β”‚
β”‚      β”‚                                                       β”‚
β”‚      β”‚  Goals:                                               β”‚
β”‚      β”‚  - Helpful                                            β”‚
β”‚      β”‚  - Harmless                                           β”‚
β”‚      β”‚  - Honest                                             β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Evolution of Alignment Methodologies

SFT (Supervised Fine-Tuning)
    β”‚  Supervised learning with high-quality data
    β–Ό
RLHF (Reinforcement Learning from Human Feedback)
    β”‚  Reward model + reinforcement learning
    β–Ό
DPO (Direct Preference Optimization)
    β”‚  Direct preference optimization
    β–Ό
Constitutional AI
    β”‚  Principle-based self-improvement

2. SFT (Supervised Fine-Tuning)

Basic Concept

# SFT data format
sft_data = [
    {
        "instruction": "Write a poem about spring.",
        "input": "",
        "output": "Flowers bloom in gentle rain,\nBirds return to sing again..."
    },
    {
        "instruction": "Translate to French.",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    }
]

SFT Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset

# Model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# Dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Formatting function
def format_instruction(example):
    if example["context"]:
        return f"""### Instruction:
{example['instruction']}

### Context:
{example['context']}

### Response:
{example['response']}"""
    else:
        return f"""### Instruction:
{example['instruction']}

### Response:
{example['response']}"""

# Training configuration
training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
)

# SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=format_instruction,
    max_seq_length=1024,
    args=training_args,
)

trainer.train()

3. RLHF Pipeline

Overall Process

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     RLHF Pipeline                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Stage 1: SFT                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    Supervised    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚  β”‚ Base Model  β”‚ ──────────▢     β”‚  SFT Model  β”‚              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                                                 β”‚
β”‚  Stage 2: Reward Model Training                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    Preference data β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚ SFT Model   β”‚ ──────────────▢   β”‚Reward Modelβ”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                 β”‚
β”‚  Stage 3: PPO Reinforcement Learning                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚  SFT Model  β”‚ + β”‚Reward Modelβ”‚ β–Ά β”‚ RLHF Model β”‚          β”‚
β”‚  β”‚  (Policy)   β”‚   β”‚  (Critic)  β”‚   β”‚ (Aligned)  β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Preference Data Collection

# Preference data format
preference_data = [
    {
        "prompt": "Write a haiku about mountains.",
        "chosen": "Peaks touch morning sky\nSilent guardians of earth\nMist embraces stone",
        "rejected": "Mountains are big\nThey are tall and rocky\nI like mountains"
    },
    {
        "prompt": "Explain quantum computing.",
        "chosen": "Quantum computing harnesses quantum mechanics principles...",
        "rejected": "Quantum computing is computers that use quantum stuff..."
    }
]

# HuggingFace format
from datasets import Dataset

dataset = Dataset.from_list(preference_data)
dataset = dataset.map(lambda x: {
    "prompt": x["prompt"],
    "chosen": x["chosen"],
    "rejected": x["rejected"]
})

4. Reward Model Training

Reward Model Concept

Input: (prompt, response)
Output: scalar reward (score)

Training objective:
    reward(prompt, chosen) > reward(prompt, rejected)

Implementation

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from trl import RewardTrainer

# Reward Model (add classification head)
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    num_labels=1  # Scalar output
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# Data preprocessing
def preprocess_reward_data(examples):
    """Convert preference data for Reward training"""
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }

    for prompt, chosen, rejected in zip(examples["prompt"], examples["chosen"], examples["rejected"]):
        # Chosen
        chosen_text = f"### Prompt: {prompt}\n### Response: {chosen}"
        chosen_tokenized = tokenizer(chosen_text, truncation=True, max_length=512)
        new_examples["input_ids_chosen"].append(chosen_tokenized["input_ids"])
        new_examples["attention_mask_chosen"].append(chosen_tokenized["attention_mask"])

        # Rejected
        rejected_text = f"### Prompt: {prompt}\n### Response: {rejected}"
        rejected_tokenized = tokenizer(rejected_text, truncation=True, max_length=512)
        new_examples["input_ids_rejected"].append(rejected_tokenized["input_ids"])
        new_examples["attention_mask_rejected"].append(rejected_tokenized["attention_mask"])

    return new_examples

# Training
training_args = TrainingArguments(
    output_dir="./reward_model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    fp16=True,
)

trainer = RewardTrainer(
    model=reward_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()

Using Reward Model

def get_reward(model, tokenizer, prompt, response):
    """Calculate reward score for response"""
    text = f"### Prompt: {prompt}\n### Response: {response}"
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        reward = outputs.logits.squeeze().item()

    return reward

# Usage example
prompt = "Explain photosynthesis."
response_good = "Photosynthesis is the process by which plants convert sunlight..."
response_bad = "Plants eat light."

print(f"Good response reward: {get_reward(reward_model, tokenizer, prompt, response_good):.4f}")
print(f"Bad response reward: {get_reward(reward_model, tokenizer, prompt, response_bad):.4f}")

5. PPO (Proximal Policy Optimization)

PPO Concept

PPO objective function:
    L^CLIP(ΞΈ) = E[min(r_t(ΞΈ)A_t, clip(r_t(ΞΈ), 1-Ξ΅, 1+Ξ΅)A_t)]

Where:
    r_t(ΞΈ) = Ο€_ΞΈ(a_t|s_t) / Ο€_ΞΈ_old(a_t|s_t)  (probability ratio)
    A_t = advantage (Reward - Baseline)
    Ξ΅ = clipping range (typically 0.2)

KL constraint:
    D_KL[Ο€_ΞΈ || Ο€_ref] < Ξ΄  (don't drift too far from reference model)

PPO Implementation (TRL)

from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch

# PPO configuration
ppo_config = PPOConfig(
    model_name="./sft_model",
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    ppo_epochs=4,
    max_grad_norm=0.5,
    kl_penalty="kl",           # KL penalty method
    target_kl=0.1,             # Target KL divergence
    init_kl_coef=0.2,          # Initial KL coefficient
)

# Model (with Value head)
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")  # Reference model (frozen)
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token

# Load Reward Model
reward_model = AutoModelForSequenceClassification.from_pretrained("./reward_model")
reward_tokenizer = AutoTokenizer.from_pretrained("./reward_model")

# PPO Trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
def get_reward_batch(prompts, responses):
    """Calculate batch rewards"""
    rewards = []
    for prompt, response in zip(prompts, responses):
        text = f"### Prompt: {prompt}\n### Response: {response}"
        inputs = reward_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(reward_model.device) for k, v in inputs.items()}

        with torch.no_grad():
            reward = reward_model(**inputs).logits.squeeze()
        rewards.append(reward)

    return rewards

# Training
from datasets import load_dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

for epoch in range(ppo_config.ppo_epochs):
    for batch in dataset.iter(batch_size=ppo_config.batch_size):
        # Tokenize prompts
        query_tensors = [tokenizer.encode(p, return_tensors="pt").squeeze() for p in batch["prompt"]]

        # Generate responses
        response_tensors = []
        for query in query_tensors:
            response = ppo_trainer.generate(query, max_new_tokens=128)
            response_tensors.append(response.squeeze())

        # Decode text
        prompts = batch["prompt"]
        responses = [tokenizer.decode(r, skip_special_tokens=True) for r in response_tensors]

        # Calculate rewards
        rewards = get_reward_batch(prompts, responses)

        # PPO step
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

        # Logging
        ppo_trainer.log_stats(stats, batch, rewards)

# Save
model.save_pretrained("./rlhf_model")

6. DPO (Direct Preference Optimization)

DPO Concept

DPO = RLHF without Reward Model

Key idea:
    - Train directly with preference data without Reward Model
    - Based on Bradley-Terry model
    - More stable and simpler training

Loss function:
    L_DPO = -E[log Οƒ(Ξ²(log Ο€_ΞΈ(y_w|x) - log Ο€_ref(y_w|x)
                      - log Ο€_ΞΈ(y_l|x) + log Ο€_ref(y_l|x)))]

Where:
    y_w = preferred response (winner)
    y_l = rejected response (loser)
    Ξ² = temperature parameter

DPO vs RLHF

Aspect RLHF DPO
Reward Model Required Not required
Training Stability Unstable Stable
Hyperparameters Many Few
Memory High Low
Performance Excellent Equal or better

DPO Implementation

from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Model
model = AutoModelForCausalLM.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft_model")
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
tokenizer.pad_token = tokenizer.eos_token

# Dataset (prompt, chosen, rejected format)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

# DPO configuration
dpo_config = DPOConfig(
    beta=0.1,                          # Temperature parameter
    loss_type="sigmoid",               # sigmoid or hinge
    max_length=512,
    max_prompt_length=256,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    logging_steps=10,
    save_strategy="epoch",
    output_dir="./dpo_model",
    fp16=True,
)

# DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Training
dpo_trainer.train()

# Save
model.save_pretrained("./dpo_model_final")

DPO Variants

# IPO (Identity Preference Optimization)
dpo_config = DPOConfig(
    loss_type="ipo",
    label_smoothing=0.0,
)

# KTO (Kahneman-Tversky Optimization)
# Uses individual ratings instead of preference pairs
from trl import KTOConfig, KTOTrainer

kto_config = KTOConfig(
    beta=0.1,
    desirable_weight=1.0,
    undesirable_weight=1.0,
)

# ORPO (Odds Ratio Preference Optimization)
# No reference model needed
from trl import ORPOConfig, ORPOTrainer

orpo_config = ORPOConfig(
    beta=0.1,
    # No ref_model needed
)

7. Constitutional AI

Concept

Constitutional AI (CAI) = Principle-based self-improvement

Steps:
    1. Model generates response
    2. Self-critique based on constitution (principles)
    3. Revise response based on critique
    4. Train on revised responses

Example principles:
    - "Should be helpful"
    - "Should not contain harmful content"
    - "Should be honest"
    - "Should not disclose personal information"

CAI Implementation

from openai import OpenAI

client = OpenAI()

# Constitution (Principles)
constitution = """
1. Responses should be helpful.
2. Responses should not contain harmful content.
3. Responses should be honest and fact-based.
4. Should not disclose personal or sensitive information.
5. Should not contain discriminatory or biased content.
"""

def generate_initial_response(prompt):
    """Generate initial response"""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content

def critique_response(prompt, response, constitution):
    """Critique response"""
    critique_prompt = f"""Evaluate whether the following response follows the given principles.

Principles:
{constitution}

User question: {prompt}

Response: {response}

Analyze how the response violates or complies with each principle.
"""
    critique = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": critique_prompt}],
        temperature=0.3
    )
    return critique.choices[0].message.content

def revise_response(prompt, response, critique, constitution):
    """Revise response"""
    revision_prompt = f"""Improve the response based on the following critique.

Principles:
{constitution}

User question: {prompt}

Original response: {response}

Critique: {critique}

Revise the response to better comply with the principles. Output only the revised response.
"""
    revised = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": revision_prompt}],
        temperature=0.3
    )
    return revised.choices[0].message.content

def constitutional_ai_pipeline(prompt, constitution, iterations=2):
    """CAI pipeline"""
    response = generate_initial_response(prompt)
    print(f"Initial response:\n{response}\n")

    for i in range(iterations):
        critique = critique_response(prompt, response, constitution)
        print(f"Critique {i+1}:\n{critique}\n")

        response = revise_response(prompt, response, critique, constitution)
        print(f"Revised response {i+1}:\n{response}\n")

    return response

# Usage
prompt = "How can I pick a lock?"
final_response = constitutional_ai_pipeline(prompt, constitution)

8. Advanced Alignment Techniques

RLAIF (RL from AI Feedback)

def get_ai_preference(prompt, response_a, response_b):
    """AI judges preference"""
    judge_prompt = f"""Choose the better of the following two responses.

Question: {prompt}

Response A: {response_a}

Response B: {response_b}

Evaluation criteria:
- Accuracy
- Usefulness
- Clarity
- Safety

State which response is better (A or B) and explain why.
"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": judge_prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Self-Play Fine-Tuning (SPIN)

# SPIN: Model competes with its own responses

def spin_iteration(model, dataset):
    """SPIN iteration"""
    # 1. Generate responses with current model
    synthetic_responses = generate_responses(model, dataset["prompts"])

    # 2. DPO with real vs generated responses
    spin_dataset = {
        "prompt": dataset["prompts"],
        "chosen": dataset["responses"],      # Real responses
        "rejected": synthetic_responses      # Model-generated responses
    }

    # 3. DPO training
    model = dpo_train(model, spin_dataset)

    return model

Summary

Alignment Method Comparison

Method Complexity Performance When to Use
SFT Low Basic Always first step
RLHF (PPO) High Excellent Complex alignment
DPO Medium Excellent Simple alignment
ORPO Low Good Memory constrained
CAI Medium Safety Safety critical

Core Code

# SFT
from trl import SFTTrainer
trainer = SFTTrainer(model, train_dataset, formatting_func=format_fn)

# DPO
from trl import DPOTrainer, DPOConfig
config = DPOConfig(beta=0.1)
trainer = DPOTrainer(model, ref_model, args=config, train_dataset=dataset)

# PPO
from trl import PPOTrainer, PPOConfig
config = PPOConfig(target_kl=0.1)
trainer = PPOTrainer(config, model, ref_model, tokenizer)
stats = trainer.step(queries, responses, rewards)

Alignment Pipeline

1. SFT: Learn basic capabilities with high-quality data
2. Collect preference data (human or AI)
3. Learn preferences with DPO/RLHF
4. Evaluate safety and additional alignment
5. Deploy and collect feedback

Next Steps

In 15_LLM_Agents.md, we'll learn about tool use and agent systems.

to navigate between lessons