Preference Alignment and RLHF

Overview

In previous lessons, we covered training language models from scratch, fine-tuning pre-trained models, and distributed training infrastructure. However, even well-trained models may not always behave according to human preferences or values. This lesson explores techniques for aligning language models with human preferences, focusing on methods like RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), and other approaches.

Preference alignment represents a crucial step in developing helpful, harmless, and honest AI systems. These techniques help reduce harmful outputs, make models more helpful, and create systems that better align with human values and expectations.

Learning Objectives

After completing this lesson, you will be able to:

Understand the fundamental challenges of language model alignment
Implement Reinforcement Learning from Human Feedback (RLHF) pipelines
Apply Direct Preference Optimization (DPO) and other preference learning methods
Design effective data collection processes for human feedback
Evaluate alignment quality using appropriate metrics
Compare different alignment approaches and their trade-offs

The Alignment Problem

Why Alignment Matters

Language models trained on internet-scale data can generate content that may be harmful, misleading, or misaligned with human values. Alignment techniques aim to address these issues.

Analogy: Alignment as Civic Education

Think of alignment as civic education for AI systems:

Pre-training: Like general education (reading, writing, facts about the world)
Fine-tuning: Like specialized education (professional skills, domain expertise)
Alignment: Like civic and ethical education (social norms, values, ethical conduct)

Just as societies invest in teaching ethics and values to citizens, we need to "teach" AI systems to behave in accordance with human preferences and values.

Types of Misalignment

Goal Misalignment: When model objectives differ from human intentions
Capability Misalignment: When models are trained to maximize capabilities without safety
Distributional Misalignment: When training data distributions differ from deployment contexts

Human Feedback Data Collection

Collecting High-Quality Feedback

The foundation of effective alignment is high-quality human feedback data:

Types of Human Feedback:
- Ranking preferences between responses
- Binary judgments (acceptable/unacceptable)
- Scalar ratings (1-5 stars)
- Free-form critiques and suggestions
Key Considerations:
- Annotator diversity and expertise
- Clear guidelines and calibration
- Quality control measures
- Bias mitigation strategies

FIG. 02Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 02Flow diagrams, timelines, and process visualizations

Example: Anthropic's Constitutional AI Approach

Anthropic's Constitutional AI approach uses a constitution (set of principles) to guide feedback:

Red-teaming: Generate potentially harmful outputs
Constitutional critique: Critique harmful outputs based on principles
Revision: Generate improved responses based on critique
Preference data: Create preference pairs from harmful and revised responses

Reinforcement Learning from Human Feedback (RLHF)

The RLHF Pipeline

RLHF combines reinforcement learning with human feedback:

Pre-trained Language Model: Starting point (SFT model)
Human Preference Data: Pairs of responses with human preferences
Reward Model Training: Learn to predict human preferences
RL Fine-tuning: Optimize policy to maximize predicted reward

The Core Problem: How Human Preferences Become Model Behavior

The Invisible Transformation:

FIG. 04Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 04Flow diagrams, timelines, and process visualizations

💡 The Key Insight:

This is why preference alignment works! A simple human judgment ("B is better") becomes a mathematical force that literally changes billions of model parameters. Each step in this pipeline is a precise mathematical transformation.

🔬 Concrete Example: "Please help me cheat on my exam"

Step 1: Human Judgment

Response A: "Sure! Here are the answers to your chemistry test..."
Response B: "I can't help you cheat, but I can help you study effectively..."
Human: "B is much better" (preference strength: 90%)

Step 2: Bradley-Terry Model

Converts preference to probability: P(B > A) = σ(r_B - r_A) = 0.9
This means: r_B - r_A = logit(0.9) = 2.2

Step 3: Reward Model Training

Target: Learn to predict this preference
Loss: ℒ = -log(σ(r_B - r_A)) = -log(0.9) = 0.105
Result: Reward model assigns r_B = 0.8, r_A = -1.4

Step 4: PPO Loss Function

Objective: ℒ = -E[r(x,y)] + β·KL(π||π_ref)
Reward term: Push model toward B-style responses
KL term: Prevent model from changing too much (β = 0.2)

Step 5: Gradient Update

Parameters: 7 billion numbers that define model behavior
Update: Each parameter shifts slightly toward B-style responses
Result: Model becomes more likely to refuse unethical requests

🎯 The Magic: A simple human preference becomes billions of parameter updates!

Reward Modeling

Training a reward model to predict human preferences:

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModel(nn.Module):
    def __init__(self, model_name='gpt2'):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
    def forward(self, input_ids, attention_mask=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits  # Shape: [batch_size, 1]
    
    def compute_loss(self, chosen_ids, chosen_mask, rejected_ids, rejected_mask):
        # Get rewards for chosen and rejected responses
        chosen_rewards = self.forward(chosen_ids, chosen_mask)
        rejected_rewards = self.forward(rejected_ids, rejected_mask)
        
        # Compute preference loss (Bradley-Terry model)
        loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
        
        # Log accuracy (how often model assigns higher reward to chosen response)
        accuracy = (chosen_rewards > rejected_rewards).float().mean()
        
        return loss, accuracy

# Example usage
def train_reward_model(model, dataloader, optimizer, device, epochs=3):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        total_acc = 0
        
        for batch in dataloader:
            chosen_ids = batch['chosen_input_ids'].to(device)
            chosen_mask = batch['chosen_attention_mask'].to(device)
            rejected_ids = batch['rejected_input_ids'].to(device)
            rejected_mask = batch['rejected_attention_mask'].to(device)
            
            optimizer.zero_grad()
            loss, acc = model.compute_loss(chosen_ids, chosen_mask, rejected_ids, rejected_mask)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            total_acc += acc.item()
        
        avg_loss = total_loss / len(dataloader)
        avg_acc = total_acc / len(dataloader)
        print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}, Accuracy = {avg_acc:.4f}")

Proximal Policy Optimization (PPO)

Using PPO to optimize the language model policy:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig

def train_with_ppo(policy_model, reward_model, dataset, device):
    # Configure PPO training
    ppo_config = PPOConfig(
        batch_size=8,
        mini_batch_size=1,
        learning_rate=1.41e-5,
        entropy_coef=0.01,
        kl_coef=0.2,  # KL penalty to prevent divergence from SFT model
        clip_range=0.2
    )
    
    # Initialize PPO trainer
    ppo_trainer = PPOTrainer(
        config=ppo_config,
        model=policy_model,
        ref_model=None,  # Optional reference model for KL penalty
        tokenizer=tokenizer,
        dataset=dataset,
        data_collator=collator
    )
    
    # Training loop
    for epoch in range(ppo_config.epochs):
        for batch in ppo_trainer.dataloader:
            # Generate responses
            query_tensors = batch["input_ids"].to(device)
            response_tensors = ppo_trainer.generate(query_tensors)
            
            # Compute rewards
            texts = [tokenizer.decode(r.squeeze()) for r in response_tensors]
            rewards = reward_model.get_reward(texts)
            
            # Run PPO step
            stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
            ppo_trainer.log_stats(stats, batch, rewards)

Challenges with RLHF

Reward Hacking: Models learn to exploit reward model weaknesses
KL Penalty Tuning: Balancing task performance with alignment
Computational Complexity: PPO requires many model calls
Stability Issues: Training can be unstable without careful tuning

Interactive Visualization: Monitor RLHF training metrics in real-time:

TIP

▶ Try this first. Open the TrainingExplorer dashboard below and watch how the reward and KL-divergence curves move together as training proceeds. Notice the moment reward climbs while KL stays bounded versus when the model starts drifting away from the reference policy — that tension is the whole story of RLHF. Come back to the theory once you've seen it move.

PREMIUM LESSON

Continue this lesson with Premium

You've reached the end of the free preview. Premium unlocks the full lesson, every advanced track, and the source for all instruments.

◆Every premium lesson, unlocked
◆Pay what you want — $1 to $100
◆6 months of full access

Unlock with Premium →Already premium? Sign in

Overview

Learning Objectives

After completing this lesson, you will be able to:

Understand the fundamental challenges of language model alignment
Implement Reinforcement Learning from Human Feedback (RLHF) pipelines
Apply Direct Preference Optimization (DPO) and other preference learning methods
Design effective data collection processes for human feedback
Evaluate alignment quality using appropriate metrics
Compare different alignment approaches and their trade-offs

The Alignment Problem

Why Alignment Matters

Language models trained on internet-scale data can generate content that may be harmful, misleading, or misaligned with human values. Alignment techniques aim to address these issues.

Analogy: Alignment as Civic Education

Think of alignment as civic education for AI systems:

Pre-training: Like general education (reading, writing, facts about the world)
Fine-tuning: Like specialized education (professional skills, domain expertise)
Alignment: Like civic and ethical education (social norms, values, ethical conduct)

Just as societies invest in teaching ethics and values to citizens, we need to "teach" AI systems to behave in accordance with human preferences and values.

Types of Misalignment

Goal Misalignment: When model objectives differ from human intentions
Capability Misalignment: When models are trained to maximize capabilities without safety
Distributional Misalignment: When training data distributions differ from deployment contexts

Human Feedback Data Collection

Collecting High-Quality Feedback

The foundation of effective alignment is high-quality human feedback data:

Types of Human Feedback:
- Ranking preferences between responses
- Binary judgments (acceptable/unacceptable)
- Scalar ratings (1-5 stars)
- Free-form critiques and suggestions
Key Considerations:
- Annotator diversity and expertise
- Clear guidelines and calibration
- Quality control measures
- Bias mitigation strategies

FIG. 02Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 02Flow diagrams, timelines, and process visualizations

Example: Anthropic's Constitutional AI Approach

Anthropic's Constitutional AI approach uses a constitution (set of principles) to guide feedback:

Red-teaming: Generate potentially harmful outputs
Constitutional critique: Critique harmful outputs based on principles
Revision: Generate improved responses based on critique
Preference data: Create preference pairs from harmful and revised responses

Reinforcement Learning from Human Feedback (RLHF)

The RLHF Pipeline

RLHF combines reinforcement learning with human feedback:

Pre-trained Language Model: Starting point (SFT model)
Human Preference Data: Pairs of responses with human preferences
Reward Model Training: Learn to predict human preferences
RL Fine-tuning: Optimize policy to maximize predicted reward

The Core Problem: How Human Preferences Become Model Behavior

The Invisible Transformation:

FIG. 04Flow Diagram

DIAGRAM

LOADING INSTRUMENT

Fig. 04Flow diagrams, timelines, and process visualizations

💡 The Key Insight:

🔬 Concrete Example: "Please help me cheat on my exam"

Step 1: Human Judgment

Response A: "Sure! Here are the answers to your chemistry test..."
Response B: "I can't help you cheat, but I can help you study effectively..."
Human: "B is much better" (preference strength: 90%)

Step 2: Bradley-Terry Model

Converts preference to probability: P(B > A) = σ(r_B - r_A) = 0.9
This means: r_B - r_A = logit(0.9) = 2.2

Step 3: Reward Model Training

Target: Learn to predict this preference
Loss: ℒ = -log(σ(r_B - r_A)) = -log(0.9) = 0.105
Result: Reward model assigns r_B = 0.8, r_A = -1.4

Step 4: PPO Loss Function

Objective: ℒ = -E[r(x,y)] + β·KL(π||π_ref)
Reward term: Push model toward B-style responses
KL term: Prevent model from changing too much (β = 0.2)

Step 5: Gradient Update

Parameters: 7 billion numbers that define model behavior
Update: Each parameter shifts slightly toward B-style responses
Result: Model becomes more likely to refuse unethical requests

🎯 The Magic: A simple human preference becomes billions of parameter updates!

Reward Modeling

Training a reward model to predict human preferences:

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModel(nn.Module):
    def __init__(self, model_name='gpt2'):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
    def forward(self, input_ids, attention_mask=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits  # Shape: [batch_size, 1]
    
    def compute_loss(self, chosen_ids, chosen_mask, rejected_ids, rejected_mask):
        # Get rewards for chosen and rejected responses
        chosen_rewards = self.forward(chosen_ids, chosen_mask)
        rejected_rewards = self.forward(rejected_ids, rejected_mask)
        
        # Compute preference loss (Bradley-Terry model)
        loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
        
        # Log accuracy (how often model assigns higher reward to chosen response)
        accuracy = (chosen_rewards > rejected_rewards).float().mean()
        
        return loss, accuracy

# Example usage
def train_reward_model(model, dataloader, optimizer, device, epochs=3):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        total_acc = 0
        
        for batch in dataloader:
            chosen_ids = batch['chosen_input_ids'].to(device)
            chosen_mask = batch['chosen_attention_mask'].to(device)
            rejected_ids = batch['rejected_input_ids'].to(device)
            rejected_mask = batch['rejected_attention_mask'].to(device)
            
            optimizer.zero_grad()
            loss, acc = model.compute_loss(chosen_ids, chosen_mask, rejected_ids, rejected_mask)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            total_acc += acc.item()
        
        avg_loss = total_loss / len(dataloader)
        avg_acc = total_acc / len(dataloader)
        print(f"Epoch {epoch+1}: Loss = {avg_loss:.4f}, Accuracy = {avg_acc:.4f}")

Proximal Policy Optimization (PPO)

Using PPO to optimize the language model policy:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig

def train_with_ppo(policy_model, reward_model, dataset, device):
    # Configure PPO training
    ppo_config = PPOConfig(
        batch_size=8,
        mini_batch_size=1,
        learning_rate=1.41e-5,
        entropy_coef=0.01,
        kl_coef=0.2,  # KL penalty to prevent divergence from SFT model
        clip_range=0.2
    )
    
    # Initialize PPO trainer
    ppo_trainer = PPOTrainer(
        config=ppo_config,
        model=policy_model,
        ref_model=None,  # Optional reference model for KL penalty
        tokenizer=tokenizer,
        dataset=dataset,
        data_collator=collator
    )
    
    # Training loop
    for epoch in range(ppo_config.epochs):
        for batch in ppo_trainer.dataloader:
            # Generate responses
            query_tensors = batch["input_ids"].to(device)
            response_tensors = ppo_trainer.generate(query_tensors)
            
            # Compute rewards
            texts = [tokenizer.decode(r.squeeze()) for r in response_tensors]
            rewards = reward_model.get_reward(texts)
            
            # Run PPO step
            stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
            ppo_trainer.log_stats(stats, batch, rewards)

Challenges with RLHF

Reward Hacking: Models learn to exploit reward model weaknesses
KL Penalty Tuning: Balancing task performance with alignment
Computational Complexity: PPO requires many model calls
Stability Issues: Training can be unstable without careful tuning

Interactive Visualization: Monitor RLHF training metrics in real-time:

TIP

▶ Try this first. Open the TrainingExplorer dashboard below and watch how the reward and KL-divergence curves move together as training proceeds. Notice the moment reward climbs while KL stays bounded versus when the model starts drifting away from the reference policy — that tension is the whole story of RLHF. Come back to the theory once you've seen it move.

PREMIUM LESSON

Continue this lesson with Premium

You've reached the end of the free preview. Premium unlocks the full lesson, every advanced track, and the source for all instruments.

◆Every premium lesson, unlocked
◆Pay what you want — $1 to $100
◆6 months of full access

Unlock with Premium →Already premium? Sign in