LESSONS · 11 · 05 / 11
Preference Alignment and RLHF
Explore methods for aligning model outputs with human preferences, including DPO, PPO, and other alignment approaches.
Overview
In previous lessons, we covered training language models from scratch, fine-tuning pre-trained models, and distributed training infrastructure. However, even well-trained models may not always behave according to human preferences or values. This lesson explores techniques for aligning language models with human preferences, focusing on methods like RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), and other approaches.
Preference alignment represents a crucial step in developing helpful, harmless, and honest AI systems. These techniques help reduce harmful outputs, make models more helpful, and create systems that better align with human values and expectations.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the fundamental challenges of language model alignment
- Implement Reinforcement Learning from Human Feedback (RLHF) pipelines
- Apply Direct Preference Optimization (DPO) and other preference learning methods
- Design effective data collection processes for human feedback
- Evaluate alignment quality using appropriate metrics
- Compare different alignment approaches and their trade-offs
The Alignment Problem
Why Alignment Matters
Language models trained on internet-scale data can generate content that may be harmful, misleading, or misaligned with human values. Alignment techniques aim to address these issues.
Analogy: Alignment as Civic Education
Think of alignment as civic education for AI systems:
- Pre-training: Like general education (reading, writing, facts about the world)
- Fine-tuning: Like specialized education (professional skills, domain expertise)
- Alignment: Like civic and ethical education (social norms, values, ethical conduct)
Just as societies invest in teaching ethics and values to citizens, we need to "teach" AI systems to behave in accordance with human preferences and values.
Types of Misalignment
- Goal Misalignment: When model objectives differ from human intentions
- Capability Misalignment: When models are trained to maximize capabilities without safety
- Distributional Misalignment: When training data distributions differ from deployment contexts
Human Feedback Data Collection
Collecting High-Quality Feedback
The foundation of effective alignment is high-quality human feedback data:
-
Types of Human Feedback:
- Ranking preferences between responses
- Binary judgments (acceptable/unacceptable)
- Scalar ratings (1-5 stars)
- Free-form critiques and suggestions
-
Key Considerations:
- Annotator diversity and expertise
- Clear guidelines and calibration
- Quality control measures
- Bias mitigation strategies
Example: Anthropic's Constitutional AI Approach
Anthropic's Constitutional AI approach uses a constitution (set of principles) to guide feedback:
- Red-teaming: Generate potentially harmful outputs
- Constitutional critique: Critique harmful outputs based on principles
- Revision: Generate improved responses based on critique
- Preference data: Create preference pairs from harmful and revised responses
Reinforcement Learning from Human Feedback (RLHF)
The RLHF Pipeline
RLHF combines reinforcement learning with human feedback:
- Pre-trained Language Model: Starting point (SFT model)
- Human Preference Data: Pairs of responses with human preferences
- Reward Model Training: Learn to predict human preferences
- RL Fine-tuning: Optimize policy to maximize predicted reward
The Core Problem: How Human Preferences Become Model Behavior
The Invisible Transformation:
💡 The Key Insight:
This is why preference alignment works! A simple human judgment ("B is better") becomes a mathematical force that literally changes billions of model parameters. Each step in this pipeline is a precise mathematical transformation.
🔬 Concrete Example: "Please help me cheat on my exam"
Step 1: Human Judgment
- Response A: "Sure! Here are the answers to your chemistry test..."
- Response B: "I can't help you cheat, but I can help you study effectively..."
- Human: "B is much better" (preference strength: 90%)
Step 2: Bradley-Terry Model
- Converts preference to probability:
P(B > A) = σ(r_B - r_A) = 0.9 - This means:
r_B - r_A = logit(0.9) = 2.2
Step 3: Reward Model Training
- Target: Learn to predict this preference
- Loss:
ℒ = -log(σ(r_B - r_A)) = -log(0.9) = 0.105 - Result: Reward model assigns r_B = 0.8, r_A = -1.4
Step 4: PPO Loss Function
- Objective:
ℒ = -E[r(x,y)] + β·KL(π||π_ref) - Reward term: Push model toward B-style responses
- KL term: Prevent model from changing too much (β = 0.2)
Step 5: Gradient Update
- Parameters: 7 billion numbers that define model behavior
- Update: Each parameter shifts slightly toward B-style responses
- Result: Model becomes more likely to refuse unethical requests
🎯 The Magic: A simple human preference becomes billions of parameter updates!
Reward Modeling
Training a reward model to predict human preferences:
import torch import torch.nn as nn from transformers import AutoModelForSequenceClassification, AutoTokenizer class RewardModel(nn.Module): def __init__(self, model_name='gpt2'): super().__init__() self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1) self.tokenizer = AutoTokenizer.from_pretrained(model_name) def forward(self, input_ids, attention_mask=None): outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) return outputs.logits # Shape: [batch_size, 1] def compute_loss(self, chosen_ids, chosen_mask, rejected_ids, rejected_mask): # Get rewards for chosen and rejected responses chosen_rewards = self.forward(chosen_ids, chosen_mask) rejected_rewards = self.forward(rejected_ids, rejected_mask) # Compute preference loss (Bradley-Terry model) loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean() # Log accuracy (how often model assigns higher reward to chosen response) accuracy = (chosen_rewards > rejected_rewards).float().mean() return loss, accuracy # Example usage def train_reward_model(model, dataloader, optimizer, device, epochs=3): model.train() for epoch in range(epochs): total_loss = 0 total_acc = 0 for batch in dataloader: chosen_ids = batch['chosen_input_ids'].to(device) chosen_mask = batch['chosen_attention_mask'].to(device) rejected_ids = batch['rejected_input_ids'].to(device) rejected_mask = batch['rejected_attention_mask'].to(device) optimizer.zero_grad() loss, acc = model.compute_loss(chosen_ids, chosen_mask, rejected_ids, rejected_mask) loss.backward() optimizer.step() total_loss += loss.item() total_acc += acc.item() avg_loss = total_loss / len(dataloader) avg_acc = total_acc / len(dataloader) print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}, Accuracy = {avg_acc:.4f}")
Proximal Policy Optimization (PPO)
Using PPO to optimize the language model policy:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer from trl import PPOTrainer, PPOConfig def train_with_ppo(policy_model, reward_model, dataset, device): # Configure PPO training ppo_config = PPOConfig( batch_size=8, mini_batch_size=1, learning_rate=1.41e-5, entropy_coef=0.01, kl_coef=0.2, # KL penalty to prevent divergence from SFT model clip_range=0.2 ) # Initialize PPO trainer ppo_trainer = PPOTrainer( config=ppo_config, model=policy_model, ref_model=None, # Optional reference model for KL penalty tokenizer=tokenizer, dataset=dataset, data_collator=collator ) # Training loop for epoch in range(ppo_config.epochs): for batch in ppo_trainer.dataloader: # Generate responses query_tensors = batch["input_ids"].to(device) response_tensors = ppo_trainer.generate(query_tensors) # Compute rewards texts = [tokenizer.decode(r.squeeze()) for r in response_tensors] rewards = reward_model.get_reward(texts) # Run PPO step stats = ppo_trainer.step(query_tensors, response_tensors, rewards) ppo_trainer.log_stats(stats, batch, rewards)
Challenges with RLHF
- Reward Hacking: Models learn to exploit reward model weaknesses
- KL Penalty Tuning: Balancing task performance with alignment
- Computational Complexity: PPO requires many model calls
- Stability Issues: Training can be unstable without careful tuning
Interactive Visualization: Monitor RLHF training metrics in real-time:
TIP▶ Try this first. Open the TrainingExplorer dashboard below and watch how the reward and KL-divergence curves move together as training proceeds. Notice the moment reward climbs while KL stays bounded versus when the model starts drifting away from the reference policy — that tension is the whole story of RLHF. Come back to the theory once you've seen it move.
Continue this lesson with Premium
You've reached the end of the free preview. Premium unlocks the full lesson, every advanced track, and the source for all instruments.
- ◆Every premium lesson, unlocked
- ◆Pay what you want — $1 to $100
- ◆6 months of full access