Text Generation: Deterministic Methods

Overview

When language models predict the next token, they output probability distributions over the entire vocabulary. But how do we convert these probabilities into actual text? This lesson explores deterministic generation methods—approaches that make predictable, reproducible choices about which tokens to select from transformer-based language models.

Understanding these foundational methods is crucial before we explore more creative sampling techniques. Think of this as learning to walk before we run: mastering predictable text generation gives us the foundation to understand when and why we might want to introduce randomness in language model outputs.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the fundamental challenge of converting probabilities to text
  • Implement and explain greedy search generation
  • Implement and explain beam search generation
  • Compare the trade-offs between different deterministic approaches
  • Choose the right method for specific use cases
  • Debug common issues in deterministic generation

The Core Challenge: From Probabilities to Words

The Decision Point

Every time a language model generates text, it faces the same fundamental challenge at each step:

Model output: [0.4, 0.3, 0.15, 0.1, 0.05, ...] Vocabulary: ["the", "a", "an", "this", "that", ...] Decision: Which word do we actually choose?

Mental Model: The Path Through a Forest

Imagine text generation as walking through a dense forest where every step forward reveals multiple possible paths:

  • Starting point: Your prompt (the trailhead)
  • Each step: Choosing the next word
  • Path options: All possible next words, each with a "difficulty score" (inverse of probability)
  • Goal: Reach a meaningful destination (complete text)

Different generation strategies represent different hiking philosophies:

  • Greedy: Always take the easiest visible path
  • Beam Search: Scout ahead on multiple promising paths, then choose the best overall route

Greedy Search: The Simplest Approach

Core Concept

Greedy search always chooses the most probable next token. It's called "greedy" because it makes locally optimal choices without considering future consequences.

Algorithm in Plain English:

  1. Look at all possible next words
  2. Pick the one with the highest probability
  3. Add it to your text
  4. Repeat until you decide to stop

Visualization: Greedy Path Selection

Loading interactive component...

Python Implementation

def greedy_search(model, tokenizer, prompt, max_length=50): """ Generate text using greedy search (always pick most likely token). Args: model: The language model tokenizer: Tokenizer for encoding/decoding prompt: Starting text max_length: Maximum tokens to generate """ # Convert prompt to token IDs input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for step in range(max_length): # Get model predictions for next token outputs = model(input_ids=torch.tensor([generated])) next_token_logits = outputs.logits[0, -1, :] # Last position logits # Choose token with highest probability next_token_id = torch.argmax(next_token_logits).item() # Add chosen token to sequence generated.append(next_token_id) # Stop if we generate end-of-sequence token if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated) # Example usage prompt = "The most important breakthrough in AI is" result = greedy_search(model, tokenizer, prompt) print(f"Input: {prompt}") print(f"Output: {result}")

When Greedy Search Works Well

✅ Excellent for:

  • Factual question answering: "What is the capital of France?" → "Paris"
  • Short completions: Where there's usually one best answer
  • Translation of common phrases: Well-established translations
  • Code completion: Where syntax correctness matters most

Example outputs where greedy excels:

Input: "The chemical symbol for water is" Greedy: "H2O, consisting of two hydrogen atoms and one oxygen atom." ✅ Perfect! Factual and correct. Input: "To install numpy, run" Greedy: "pip install numpy" ✅ Exactly what we want!

When Greedy Search Fails

❌ Problems arise with:

  • Creative writing: Produces boring, predictable text
  • Long text generation: Gets stuck in repetitive loops
  • Open-ended questions: Gives generic responses

Example where greedy fails:

Input: "Once upon a time, in a magical forest" Greedy: "there was a little girl who was walking through the forest. She was walking through the forest when she saw a little girl walking through the forest..." ❌ Repetitive and boring!

The Repetition Problem

Greedy search often gets trapped in loops because:

  1. Common words have high probability
  2. Once generated, they increase probability of similar patterns
  3. No mechanism to avoid repetition

Demonstration:

# This often happens with greedy search prompt = "I think that" # Greedy might produce: "I think that I think that I think that..."

Beam Search: Exploring Multiple Paths

Core Concept

Beam search improves on greedy search by considering multiple possible sequences simultaneously. Instead of committing to one path, it keeps track of the top-K most promising sequences (called "beams").

Key Innovation: Look ahead before making final decisions.

The Beam Width Parameter

  • Beam width = 1: Equivalent to greedy search
  • Beam width = 3: Keep track of 3 best sequences
  • Beam width = 5: Keep track of 5 best sequences
  • Higher beam width: More exploration, more computation

Visualization: Beam Search Tree

Loading interactive component...

Algorithm Walkthrough

Let's trace through beam search step by step:

Step 1: Start with prompt

Prompt: "The future of AI" Beams: ["The future of AI"]

Step 2: Generate first token

Top candidates: "is", "will", "depends" Beams: ["The future of AI is", "The future of AI will", "The future of AI depends"]

Step 3: Generate second token (from each beam)

From "...is": "bright", "uncertain", "promising" From "...will": "be", "depend", "involve" From "...depends": "on", "heavily", "largely" Keep top 3 overall: 1. "The future of AI is bright" 2. "The future of AI will be" 3. "The future of AI depends on"

Python Implementation

def beam_search(model, tokenizer, prompt, beam_width=3, max_length=20): """ Simplified beam search implementation for learning purposes. Args: beam_width: Number of sequences to track simultaneously max_length: Maximum tokens to generate """ # Start with the prompt input_ids = tokenizer.encode(prompt, return_tensors="pt") # Each beam: (token_sequence, cumulative_score) beams = [(input_ids[0].tolist(), 0.0)] for step in range(max_length): all_candidates = [] # For each current beam, generate next token candidates for tokens, score in beams: # Get model predictions for next token outputs = model(input_ids=torch.tensor([tokens])) probs = torch.softmax(outputs.logits[0, -1, :], dim=-1) # Get top candidates for this beam top_probs, top_indices = torch.topk(probs, beam_width) # Create new candidate sequences for prob, token_id in zip(top_probs, top_indices): new_tokens = tokens + [token_id.item()] new_score = score + torch.log(prob).item() all_candidates.append((new_tokens, new_score)) # Keep only the best beam_width candidates overall all_candidates.sort(key=lambda x: x[1], reverse=True) beams = all_candidates[:beam_width] # Return the best sequence best_tokens, _ = beams[0] return tokenizer.decode(best_tokens) # Example usage result = beam_search(model, tokenizer, "The future of AI", beam_width=3) print(result)

Beam Search Advantages

✅ Better than greedy because:

  • Avoids local optima: Considers multiple paths before deciding
  • Higher quality output: Often more coherent and fluent
  • Still deterministic: Same input always gives same output
  • Configurable exploration: Adjust beam width for your needs

Comparison example:

Prompt: "The solution to climate change requires" Greedy: "a global effort to reduce carbon emissions." Beam (width=5): "coordinated international cooperation, technological innovation, and fundamental changes in how we produce and consume energy."

Beam Search Limitations

❌ Still has issues:

  • Computational cost: 5x more expensive than greedy (with beam width 5)
  • Generic output: Tends toward "safe" but boring completions
  • Length bias: Favors shorter sequences (more tokens = lower probability). Mitigate with length penalty > 1.0.
  • Repetition: Can still repeat n‑grams. Mitigate with no‑repeat‑ngram-size.
  • Still deterministic: No creativity or surprise

Beam Width Selection Guide

Task TypeRecommended Beam WidthReasoning
Translation4-6Balance quality and computation
Summarization3-5Want coherent but not too generic
Question Answering1-3Usually one correct answer
Creative WritingNot recommendedToo conservative
Code Generation2-4Syntax correctness important

Direct Comparison: Greedy vs. Beam Search

Side-by-Side Examples

Loading interactive component...

Quality Metrics Comparison

AspectGreedy SearchBeam Search (width=5)
FluencyGoodVery Good
CoherenceGoodVery Good
CreativityPoorPoor
ConsistencyHighHigh
SpeedFast5x Slower
Memory UsageLow5x Higher

When to Choose Each Method

Choose Greedy Search when:

  • Speed is critical
  • Memory is limited
  • Task has clear "correct" answers
  • Generating short completions
  • Prototyping or debugging

Choose Beam Search when:

  • Quality is more important than speed
  • Task benefits from planning ahead (translation, summarization)
  • You have computational resources
  • Output length is moderate (20-100 tokens)
  • Serving fewer requests but need higher quality

Practical Implementation with Hugging Face

Modern libraries make these methods easy to use:

from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer # Setup model_name = "gpt2" model = GPT2LMHeadModel.from_pretrained(model_name) tokenizer = GPT2Tokenizer.from_pretrained(model_name) generator = pipeline('text-generation', model=model, tokenizer=tokenizer) # Greedy search (do_sample=False, num_beams=1) greedy_output = generator( "The future of renewable energy", max_length=50, num_return_sequences=1, do_sample=False, pad_token_id=tokenizer.eos_token_id ) # Beam search (do_sample=False, num_beams>1) beam_output = generator( "The future of renewable energy", max_length=50, num_beams=5, num_return_sequences=1, do_sample=False, early_stopping=True, pad_token_id=tokenizer.eos_token_id ) print("Greedy:", greedy_output[0]['generated_text']) print("Beam:", beam_output[0]['generated_text'])

Key Parameters Explained

  • do_sample=False: Use deterministic methods (greedy/beam)
  • num_beams=1: Greedy search
  • num_beams>1: Beam search with specified width
  • early_stopping=True: Stop when EOS token is generated
  • num_return_sequences: How many different outputs to return
  • length_penalty: >1.0 encourages longer outputs; <1.0 prefers shorter
  • no_repeat_ngram_size: Prevents repeating n‑grams of given size

Common Issues and Debugging

Problem 1: Repetitive Output

# Symptoms: "I think that I think that I think..." # Solutions: - Try beam search instead of greedy - Add length constraints - Use repetition penalty (covered in next lesson)

Problem 2: Truncated Output

# Symptoms: Output stops too early # Solutions: - Increase max_length parameter - Check for unexpected EOS tokens - Verify tokenizer settings

Problem 3: Generic Output

# Symptoms: Boring, predictable text # Solutions: - This is expected with deterministic methods - Consider probabilistic sampling (next lesson) - Adjust beam width

Problem 4: High Memory Usage

# Symptoms: Out of memory errors with beam search # Solutions: - Reduce beam width - Process shorter sequences - Use gradient checkpointing - Switch to greedy for prototyping

Summary and Next Steps

What We've Learned

  1. Greedy Search: Simple, fast, but can get stuck in loops
  2. Beam Search: Better quality through exploration, but more expensive
  3. Trade-offs: Speed vs. quality, determinism vs. creativity
  4. Use Cases: When to choose each method

The Limitation of Deterministic Methods

Both greedy and beam search share a fundamental limitation: they're too conservative. They always choose the "safe" options, leading to:

  • Predictable, sometimes boring text
  • Lack of creativity and surprise
  • Difficulty with open-ended creative tasks

What's Next

In our next lesson, we'll explore probabilistic sampling techniques that introduce controlled randomness:

  • Temperature sampling: Adding creativity while maintaining quality
  • Top-k sampling: Limiting choices to reasonable options
  • Nucleus (top-p) sampling: Dynamically adjusting choice sets
  • Combining techniques: Building production-ready systems

These methods will give us the tools to generate more interesting, creative, and diverse text while still maintaining quality and coherence.

Practice Exercises

Exercise 1: Implementation Challenge

Implement both greedy and beam search from scratch (without using Hugging Face's built-in methods). Compare your results with the library versions.

Exercise 2: Parameter Exploration

Test beam search with different beam widths (1, 3, 5, 10) on the same prompt. Analyze how beam width affects:

  • Output quality
  • Generation time
  • Memory usage

Exercise 3: Use Case Analysis

For each scenario below, decide whether to use greedy search or beam search and justify your choice:

  1. Real-time chatbot responses
  2. Academic paper summarization
  3. Code completion in an IDE
  4. News article generation
  5. Poetry writing assistant

Exercise 4: Debugging Practice

Given these problematic outputs, identify the likely issue and propose solutions:

  1. "The the the the the the..."
  2. Output stops after just 3 tokens
  3. Very generic, dictionary-like responses
  4. Out of memory errors

Additional Resources