Overview
When language models predict the next token, they output probability distributions over the entire vocabulary. But how do we convert these probabilities into actual text? This lesson explores deterministic generation methods—approaches that make predictable, reproducible choices about which tokens to select from transformer-based language models.
Understanding these foundational methods is crucial before we explore more creative sampling techniques. Think of this as learning to walk before we run: mastering predictable text generation gives us the foundation to understand when and why we might want to introduce randomness in language model outputs.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the fundamental challenge of converting probabilities to text
- Implement and explain greedy search generation
- Implement and explain beam search generation
- Compare the trade-offs between different deterministic approaches
- Choose the right method for specific use cases
- Debug common issues in deterministic generation
The Core Challenge: From Probabilities to Words
The Decision Point
Every time a language model generates text, it faces the same fundamental challenge at each step:
Model output: [0.4, 0.3, 0.15, 0.1, 0.05, ...] Vocabulary: ["the", "a", "an", "this", "that", ...] Decision: Which word do we actually choose?
Mental Model: The Path Through a Forest
Imagine text generation as walking through a dense forest where every step forward reveals multiple possible paths:
- Starting point: Your prompt (the trailhead)
- Each step: Choosing the next word
- Path options: All possible next words, each with a "difficulty score" (inverse of probability)
- Goal: Reach a meaningful destination (complete text)
Different generation strategies represent different hiking philosophies:
- Greedy: Always take the easiest visible path
- Beam Search: Scout ahead on multiple promising paths, then choose the best overall route
Greedy Search: The Simplest Approach
Core Concept
Greedy search always chooses the most probable next token. It's called "greedy" because it makes locally optimal choices without considering future consequences.
Algorithm in Plain English:
- Look at all possible next words
- Pick the one with the highest probability
- Add it to your text
- Repeat until you decide to stop
Visualization: Greedy Path Selection
Algorithm Walkthrough
Let's trace through beam search step by step:
Step 1: Start with prompt
Prompt: "The future of AI" Beams: ["The future of AI"]
Step 2: Generate first token
Top candidates: "is", "will", "depends" Beams: ["The future of AI is", "The future of AI will", "The future of AI depends"]
Step 3: Generate second token (from each beam)
From "...is": "bright", "uncertain", "promising" From "...will": "be", "depend", "involve" From "...depends": "on", "heavily", "largely" Keep top 3 overall: 1. "The future of AI is bright" 2. "The future of AI will be" 3. "The future of AI depends on"
Python Implementation
def beam_search(model, tokenizer, prompt, beam_width=3, max_length=20): """ Simplified beam search implementation for learning purposes. Args: beam_width: Number of sequences to track simultaneously max_length: Maximum tokens to generate """ # Start with the prompt input_ids = tokenizer.encode(prompt, return_tensors="pt") # Each beam: (token_sequence, cumulative_score) beams = [(input_ids[0].tolist(), 0.0)] for step in range(max_length): all_candidates = [] # For each current beam, generate next token candidates for tokens, score in beams: # Get model predictions for next token outputs = model(input_ids=torch.tensor([tokens])) probs = torch.softmax(outputs.logits[0, -1, :], dim=-1) # Get top candidates for this beam top_probs, top_indices = torch.topk(probs, beam_width) # Create new candidate sequences for prob, token_id in zip(top_probs, top_indices): new_tokens = tokens + [token_id.item()] new_score = score + torch.log(prob).item() all_candidates.append((new_tokens, new_score)) # Keep only the best beam_width candidates overall all_candidates.sort(key=lambda x: x[1], reverse=True) beams = all_candidates[:beam_width] # Return the best sequence best_tokens, _ = beams[0] return tokenizer.decode(best_tokens) # Example usage result = beam_search(model, tokenizer, "The future of AI", beam_width=3) print(result)
Beam Search Advantages
✅ Better than greedy because:
- Avoids local optima: Considers multiple paths before deciding
- Higher quality output: Often more coherent and fluent
- Still deterministic: Same input always gives same output
- Configurable exploration: Adjust beam width for your needs
Comparison example:
Prompt: "The solution to climate change requires" Greedy: "a global effort to reduce carbon emissions." Beam (width=5): "coordinated international cooperation, technological innovation, and fundamental changes in how we produce and consume energy."
Beam Search Limitations
❌ Still has issues:
- Computational cost: 5x more expensive than greedy (with beam width 5)
- Generic output: Tends toward "safe" but boring completions
- Length bias: Favors shorter sequences (more tokens = lower probability). Mitigate with length penalty > 1.0.
- Repetition: Can still repeat n‑grams. Mitigate with no‑repeat‑ngram-size.
- Still deterministic: No creativity or surprise
Beam Width Selection Guide
| Task Type | Recommended Beam Width | Reasoning |
|---|---|---|
| Translation | 4-6 | Balance quality and computation |
| Summarization | 3-5 | Want coherent but not too generic |
| Question Answering | 1-3 | Usually one correct answer |
| Creative Writing | Not recommended | Too conservative |
| Code Generation | 2-4 | Syntax correctness important |