Overview
In our previous lesson, we mastered deterministic generation methods—greedy search and beam search. These techniques are excellent for tasks requiring consistency and correctness, but they share a fundamental limitation when generating text from language models: they're too conservative.
When we want creative, diverse, or surprising text generation from transformer models, we need to introduce controlled randomness. This lesson explores probabilistic sampling techniques that balance creativity with quality, giving language models the ability to produce varied, interesting outputs while maintaining coherence.
Think of this as the difference between a conversation with a very knowledgeable but predictable expert versus one with a creative, thoughtful friend who surprises you with interesting perspectives.
Learning Objectives
After completing this lesson, you will be able to:
- Understand why randomness improves text generation
- Implement and tune temperature sampling for creativity control
- Use top-k sampling to limit choice sets intelligently
- Apply nucleus (top-p) sampling for dynamic token selection
- Combine multiple techniques for production-ready systems
- Debug and optimize sampling parameters for different use cases
- Handle common issues like repetition and incoherence
The Case for Controlled Randomness
Why Perfect Predictions Aren't Perfect
Deterministic methods optimize for likelihood—they choose what's most probable given the training data. But the most probable text isn't always the most:
- Interesting: "The weather is nice" vs. "The crimson sunset painted the horizon"
- Useful: Generic responses vs. specific, tailored answers
- Human-like: Robotic predictability vs. natural variation
The Exploration-Exploitation Balance
Every text generation step involves a fundamental trade-off:
Real-world analogy: Choosing a restaurant
- Exploitation: Always go to your proven favorite
- Exploration: Try completely random new places
- Smart sampling: Try highly-rated new places in genres you like
Temperature Sampling: The Creativity Dial
Core Concept
Temperature sampling modifies the probability distribution before sampling, controlling how "sharp" or "flat" the distribution becomes.
Mathematical formulation:
Where:
- = original logit for token
- = temperature parameter
- Lower → more focused (sharper distribution)
- Higher → more random (flatter distribution)
Temperature Effects Visualization
Understanding Temperature Values
| Temperature | Effect | Use Cases | Example Output Style |
|---|---|---|---|
| 0.1-0.3 | Very focused, almost deterministic | Factual Q&A, technical writing | "Solar panels convert sunlight into electricity through photovoltaic cells." |
| 0.5-0.8 | Balanced creativity and coherence | General content, articles | "Solar technology represents a paradigm shift toward sustainable energy solutions." |
| 0.9-1.2 | Creative and diverse | Creative writing, brainstorming | "Sunlight dances across crystalline surfaces, awakening electrons in their silicon dreams." |
| 1.5+ | Highly creative, potentially incoherent | Experimental art, poetry | "Quantum photons whisper secrets to semiconducting consciousness, birthing energy..." |
Python Implementation
def temperature_sampling(model, tokenizer, prompt, temperature=0.7, max_length=50): """ Generate text using temperature sampling. Args: temperature: Controls randomness (lower = more focused, higher = more random) """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for _ in range(max_length): # Get model predictions outputs = model(input_ids=torch.tensor([generated])) next_token_logits = outputs.logits[0, -1, :] # Apply temperature scaling scaled_logits = next_token_logits / temperature # Convert to probabilities probs = torch.nn.functional.softmax(scaled_logits, dim=-1) # Sample from the distribution next_token_id = torch.multinomial(probs, num_samples=1).item() generated.append(next_token_id) # Stop if we generate end token if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated) # Example: Different temperatures for creative writing prompt = "In a world where dreams become reality" creative_output = temperature_sampling(model, tokenizer, prompt, temperature=1.1) balanced_output = temperature_sampling(model, tokenizer, prompt, temperature=0.7) focused_output = temperature_sampling(model, tokenizer, prompt, temperature=0.3) print(f"Creative (T=1.1): {creative_output}") print(f"Balanced (T=0.7): {balanced_output}") print(f"Focused (T=0.3): {focused_output}")
Temperature Tuning Guidelines
For different content types:
# Recommended temperature ranges TEMPERATURE_GUIDES = { "factual_qa": 0.1, # Want precise, correct answers "technical_docs": 0.3, # Clear, accurate explanations "news_articles": 0.5, # Professional but not robotic "blog_posts": 0.7, # Engaging and personable "creative_writing": 0.9, # Original and surprising "poetry": 1.2, # Highly creative and artistic "brainstorming": 1.5, # Maximum idea diversity }
Top-K Sampling: Intelligent Choice Limitation
Core Concept
Top-K sampling addresses a key problem with temperature sampling: even with low temperature, there's still a small chance of selecting very inappropriate tokens. Top-K limits the choice to only the K most likely tokens.
Algorithm:
- Get probability distribution from model
- Select only the top-K most likely tokens
- Renormalize probabilities among these K tokens
- Sample from this reduced distribution (optionally with temperature)
Visualization: Top-K Filtering Effect
Python Implementation
def top_k_sampling(model, tokenizer, prompt, k=50, temperature=1.0, max_length=50): """ Generate text using top-k sampling. Args: k: Number of top tokens to consider temperature: Temperature scaling (applied after top-k filtering) """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for _ in range(max_length): outputs = model(input_ids=torch.tensor([generated])) next_token_logits = outputs.logits[0, -1, :] # Apply temperature scaled_logits = next_token_logits / temperature # Get top-k tokens and their logits top_k_logits, top_k_indices = torch.topk(scaled_logits, k) # Create a filtered distribution (set non-top-k to -inf) filtered_logits = torch.full_like(scaled_logits, float('-inf')) filtered_logits.scatter_(0, top_k_indices, top_k_logits) # Convert to probabilities and sample probs = torch.nn.functional.softmax(filtered_logits, dim=-1) next_token_id = torch.multinomial(probs, num_samples=1).item() generated.append(next_token_id) if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated)
Choosing K Values
| K Value | Effect | Best For | Reasoning |
|---|---|---|---|
| 10-20 | Very constrained | Technical writing, Q&A | Only most confident predictions |
| 30-50 | Balanced filtering | General content creation | Good quality-diversity balance |
| 80-100 | Light filtering | Creative writing | Removes only clearly bad options |
| 200+ | Minimal effect | When you trust the model | Mostly preserves original distribution |
Top-K vs. Temperature Trade-offs
# Comparison of different approaches examples = [ {"method": "Pure temperature", "params": {"temperature": 0.8}, "pros": ["Simple", "Smooth control"], "cons": ["Can select very low-probability tokens"]}, {"method": "Pure top-k", "params": {"k": 50, "temperature": 1.0}, "pros": ["Prevents bad tokens", "Consistent quality"], "cons": ["Hard cutoff can be arbitrary"]}, {"method": "Combined", "params": {"k": 50, "temperature": 0.8}, "pros": ["Best of both worlds", "Production-ready"], "cons": ["More parameters to tune"]} ]
Nucleus (Top-P) Sampling: Dynamic Choice Sets
Core Concept
Nucleus sampling (also called top-p sampling) addresses a key limitation of top-k: different contexts require different numbers of reasonable choices.
Key insight: Instead of a fixed number of tokens, select the smallest set of tokens whose cumulative probability exceeds threshold p.
Algorithm:
- Sort tokens by probability (descending)
- Find the smallest set where cumulative probability ≥ p
- Renormalize probabilities within this "nucleus"
- Sample from the nucleus
Why Nucleus Sampling is Revolutionary
Context-adaptive selection:
- Confident predictions: Nucleus might contain only 5-10 tokens
- Uncertain predictions: Nucleus might contain 100+ tokens
- Self-adjusting: Model's confidence determines choice set size
Visualization: Nucleus Formation
Python Implementation
def nucleus_sampling(model, tokenizer, prompt, p=0.9, temperature=1.0, max_length=50): """ Generate text using nucleus (top-p) sampling. Args: p: Cumulative probability threshold (0.0 to 1.0) temperature: Temperature scaling """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for _ in range(max_length): outputs = model(input_ids=torch.tensor([generated])) next_token_logits = outputs.logits[0, -1, :] # Apply temperature scaled_logits = next_token_logits / temperature # Convert to probabilities probs = torch.nn.functional.softmax(scaled_logits, dim=-1) # Sort probabilities in descending order sorted_probs, sorted_indices = torch.sort(probs, descending=True) # Calculate cumulative probabilities cumulative_probs = torch.cumsum(sorted_probs, dim=-1) # Find nucleus: tokens whose cumulative probability <= p nucleus_mask = cumulative_probs <= p # Always include at least the first token (highest probability) if nucleus_mask.sum() == 0: nucleus_mask[0] = True # Include the first token that pushes us over the threshold if nucleus_mask.sum() < len(cumulative_probs): nucleus_mask[nucleus_mask.sum()] = True # Select tokens and probabilities in the nucleus nucleus_tokens = sorted_indices[nucleus_mask] nucleus_probs = sorted_probs[nucleus_mask] # Renormalize probabilities nucleus_probs = nucleus_probs / nucleus_probs.sum() # Sample from the nucleus sample_idx = torch.multinomial(nucleus_probs, num_samples=1).item() next_token_id = nucleus_tokens[sample_idx].item() generated.append(next_token_id) if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated)
Choosing P Values
| P Value | Effect | Nucleus Size | Best For |
|---|---|---|---|
| 0.5-0.7 | Conservative | Small, focused | Technical content, Q&A |
| 0.8-0.9 | Balanced | Medium, adaptive | General content, articles |
| 0.92-0.95 | Creative | Larger, diverse | Creative writing, storytelling |
| 0.98+ | Very creative | Very large | Experimental, artistic content |
Nucleus vs. Top-K Comparison
Advanced Techniques and Combinations
The Production Recipe: Combined Sampling
Most production systems combine multiple techniques for optimal results:
def combined_sampling(model, tokenizer, prompt, top_k=50, top_p=0.9, temperature=0.7, max_length=50): """ Simplified production sampling combining key techniques. """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for _ in range(max_length): outputs = model(input_ids=torch.tensor([generated])) logits = outputs.logits[0, -1, :] # Apply temperature scaling scaled_logits = logits / temperature # Apply top-k filtering if top_k > 0: top_k_logits, top_k_indices = torch.topk(scaled_logits, top_k) filtered_logits = torch.full_like(scaled_logits, float('-inf')) filtered_logits.scatter_(0, top_k_indices, top_k_logits) scaled_logits = filtered_logits # Convert to probabilities and apply nucleus filtering probs = torch.softmax(scaled_logits, dim=-1) if top_p < 1.0: sorted_probs, sorted_indices = torch.sort(probs, descending=True) cumulative_probs = torch.cumsum(sorted_probs, dim=-1) nucleus_mask = cumulative_probs <= top_p # Always include at least the top token if nucleus_mask.sum() == 0: nucleus_mask[0] = True # Zero out probabilities outside nucleus probs = probs * 0 probs.scatter_(0, sorted_indices[nucleus_mask], sorted_probs[nucleus_mask]) probs = probs / probs.sum() # Sample next token next_token_id = torch.multinomial(probs, num_samples=1).item() generated.append(next_token_id) if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated)
Other modern strategies
- Typical sampling (a.k.a. locally typical decoding): prioritizes tokens whose surprise is close to the expected entropy, often improving coherence over pure top‑p.
- Contrastive search: balances model likelihood with a degeneration penalty to reduce repetition.
Handling Repetition
Repetition is a common issue in probabilistic sampling. Several techniques help:
Repetition Penalty
Reduce probability of recently used tokens:
# Simple repetition penalty implementation def apply_repetition_penalty(logits, past_tokens, penalty=1.2): for token_id in past_tokens: if logits[token_id] > 0: logits[token_id] /= penalty else: logits[token_id] *= penalty return logits
Frequency and Presence Penalties
- Frequency penalty: Penalize based on how often a token appears
- Presence penalty: Penalize any token that has appeared at all
Parameter Recommendations by Use Case
| Use Case | Temperature | Top-K | Top-P | Repetition Penalty | Notes |
|---|---|---|---|---|---|
| Chat Assistant | 0.7 | 50 | 0.9 | 1.1 | Balanced and helpful |
| Creative Writing | 0.9 | 100 | 0.95 | 1.2 | Encourage creativity |
| Technical Docs | 0.3 | 30 | 0.8 | 1.0 | Prioritize accuracy |
| News Articles | 0.6 | 40 | 0.85 | 1.15 | Professional tone |
| Code Generation | 0.2 | 20 | 0.7 | 1.0 | Syntax correctness |
| Poetry | 1.1 | 150 | 0.97 | 1.3 | Maximum creativity |
Practical Implementation with Hugging Face
The Transformers library makes advanced sampling easy:
from transformers import pipeline # Set up the pipeline generator = pipeline('text-generation', model='gpt2') prompt = "The future of artificial intelligence will" # Temperature sampling temp_output = generator( prompt, max_length=50, do_sample=True, temperature=0.7, num_return_sequences=2 ) # Combined approach (production-ready) production_output = generator( prompt, max_length=50, do_sample=True, temperature=0.7, top_k=50, top_p=0.9, repetition_penalty=1.2, num_return_sequences=2 ) print("Temperature Sampling:") for output in temp_output: print(f"- {output['generated_text']}") print("\nProduction Sampling:") for output in production_output: print(f"- {output['generated_text']}")
Key Hugging Face Parameters
do_sample=True: Enable probabilistic samplingtemperature: Control randomness (0.1-2.0)top_k: Limit to top-k tokens (0 = disabled)top_p: Nucleus sampling threshold (0.0-1.0)repetition_penalty: Penalize repeated tokens (1.0-2.0)num_return_sequences: Generate multiple outputs
Common Issues and Solutions
Quick Troubleshooting Guide
| Problem | Symptoms | Solution |
|---|---|---|
| Too Random | Nonsensical text, grammar errors | Lower temperature (0.5-0.7), reduce top_p (0.8-0.9) |
| Too Boring | Generic responses, repetitive | Increase temperature (0.8-1.0), increase top_p (0.9-0.95) |
| Inconsistent | Some outputs great, others terrible | Generate multiple samples, use conservative parameters |
| Repetitive | Repeated phrases despite penalties | Increase repetition penalty (1.2-1.5) |
Parameter Tuning Process
- Start with defaults: temperature=0.7, top_p=0.9, top_k=50
- Adjust temperature first: Control overall creativity level
- Fine-tune filtering: Adjust top_p/top_k for quality
- Test extensively: Use diverse prompts and evaluate outputs
Evaluating Sampling Quality
Key Evaluation Criteria
- Fluency: Is the text grammatically correct?
- Coherence: Does it make logical sense?
- Relevance: Does it address the prompt appropriately?
- Creativity: Is it interesting and non-generic?
- Consistency: Does quality remain stable across samples?
Summary
What We've Learned
- Temperature sampling: Control creativity with a single parameter
- Top-k sampling: Limit choices to reasonable options
- Nucleus sampling: Adaptive, context-aware token selection
- Combined approaches: Production-ready systems using multiple techniques
- Parameter tuning: Guidelines for different use cases
- Common issues: How to debug and fix sampling problems
The Complete Sampling Toolkit
You now have the complete toolkit for text generation:
Deterministic Methods (previous lesson):
- Greedy search: Fast, reliable, predictable
- Beam search: Higher quality, still deterministic
Probabilistic Methods (this lesson):
- Temperature: Creativity dial
- Top-k: Smart choice limitation
- Nucleus: Adaptive selection
- Combined: Production-ready systems
When to Use What
| Scenario | Recommended Approach | Key Parameters |
|---|---|---|
| Factual Q&A | Low temperature | temp=0.2, top_p=0.8 |
| Creative Writing | Nucleus sampling | temp=0.9, top_p=0.95 |
| Chat Assistant | Balanced combination | temp=0.7, top_k=50, top_p=0.9 |
| Code Generation | Conservative sampling | temp=0.3, top_k=30 |
| Brainstorming | High creativity | temp=1.1, top_p=0.97 |
Practice Exercises
Exercise 1: Parameter Exploration
Create a simple interface that lets you adjust temperature, top-k, and top-p parameters in real-time. Generate text with the same prompt using different settings and analyze the differences.
Exercise 2: Use Case Optimization
Choose a specific use case (e.g., writing product descriptions, generating study notes, creating story outlines) and systematically tune parameters to optimize for that task.
Exercise 3: Quality Evaluation
Implement automated metrics to evaluate generation quality. Compare different sampling methods on dimensions like diversity, fluency, and relevance.
Exercise 4: Repetition Handling
Experiment with different repetition penalty values and strategies. Create examples where repetition is problematic and show how to fix it.
Exercise 5: Production System
Build a complete text generation system that:
- Takes user prompts
- Allows parameter adjustment
- Generates multiple candidates
- Includes basic quality filtering
- Handles edge cases gracefully
Additional Resources
- The Curious Case of Neural Text Degeneration - Original nucleus sampling paper
- Hugging Face Generation Strategies
- Typical Sampling for Natural Language Generation
- How to Generate Text with Transformers
- OpenAI API Documentation - Real-world parameter examples