Text Generation: Probabilistic Sampling

Overview

In our previous lesson, we mastered deterministic generation methods—greedy search and beam search. These techniques are excellent for tasks requiring consistency and correctness, but they share a fundamental limitation when generating text from language models: they're too conservative.

When we want creative, diverse, or surprising text generation from transformer models, we need to introduce controlled randomness. This lesson explores probabilistic sampling techniques that balance creativity with quality, giving language models the ability to produce varied, interesting outputs while maintaining coherence.

Think of this as the difference between a conversation with a very knowledgeable but predictable expert versus one with a creative, thoughtful friend who surprises you with interesting perspectives.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand why randomness improves text generation
  • Implement and tune temperature sampling for creativity control
  • Use top-k sampling to limit choice sets intelligently
  • Apply nucleus (top-p) sampling for dynamic token selection
  • Combine multiple techniques for production-ready systems
  • Debug and optimize sampling parameters for different use cases
  • Handle common issues like repetition and incoherence

The Case for Controlled Randomness

Why Perfect Predictions Aren't Perfect

Deterministic methods optimize for likelihood—they choose what's most probable given the training data. But the most probable text isn't always the most:

  • Interesting: "The weather is nice" vs. "The crimson sunset painted the horizon"
  • Useful: Generic responses vs. specific, tailored answers
  • Human-like: Robotic predictability vs. natural variation

The Exploration-Exploitation Balance

Every text generation step involves a fundamental trade-off:

Loading tool...

Understanding Temperature Values

TemperatureEffectUse CasesExample Output Style
0.1-0.3Very focused, almost deterministicFactual Q&A, technical writing"Solar panels convert sunlight into electricity through photovoltaic cells."
0.5-0.8Balanced creativity and coherenceGeneral content, articles"Solar technology represents a paradigm shift toward sustainable energy solutions."
0.9-1.2Creative and diverseCreative writing, brainstorming"Sunlight dances across crystalline surfaces, awakening electrons in their silicon dreams."
1.5+Highly creative, potentially incoherentExperimental art, poetry"Quantum photons whisper secrets to semiconducting consciousness, birthing energy..."

Python Implementation

def temperature_sampling(model, tokenizer, prompt, temperature=0.7, max_length=50): """ Generate text using temperature sampling. Args: temperature: Controls randomness (lower = more focused, higher = more random) """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for _ in range(max_length): # Get model predictions outputs = model(input_ids=torch.tensor([generated])) next_token_logits = outputs.logits[0, -1, :] # Apply temperature scaling scaled_logits = next_token_logits / temperature # Convert to probabilities probs = torch.nn.functional.softmax(scaled_logits, dim=-1) # Sample from the distribution next_token_id = torch.multinomial(probs, num_samples=1).item() generated.append(next_token_id) # Stop if we generate end token if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated) # Example: Different temperatures for creative writing prompt = "In a world where dreams become reality" creative_output = temperature_sampling(model, tokenizer, prompt, temperature=1.1) balanced_output = temperature_sampling(model, tokenizer, prompt, temperature=0.7) focused_output = temperature_sampling(model, tokenizer, prompt, temperature=0.3) print(f"Creative (T=1.1): {creative_output}") print(f"Balanced (T=0.7): {balanced_output}") print(f"Focused (T=0.3): {focused_output}")

Temperature Tuning Guidelines

For different content types:

# Recommended temperature ranges TEMPERATURE_GUIDES = { "factual_qa": 0.1, # Want precise, correct answers "technical_docs": 0.3, # Clear, accurate explanations "news_articles": 0.5, # Professional but not robotic "blog_posts": 0.7, # Engaging and personable "creative_writing": 0.9, # Original and surprising "poetry": 1.2, # Highly creative and artistic "brainstorming": 1.5, # Maximum idea diversity }

Top-K Sampling: Intelligent Choice Limitation

Core Concept

Top-K sampling addresses a key problem with temperature sampling: even with low temperature, there's still a small chance of selecting very inappropriate tokens. Top-K limits the choice to only the K most likely tokens.

Algorithm:

  1. Get probability distribution from model
  2. Select only the top-K most likely tokens
  3. Renormalize probabilities among these K tokens
  4. Sample from this reduced distribution (optionally with temperature)

Visualization: Top-K Filtering Effect

Loading tool...

Python Implementation

def nucleus_sampling(model, tokenizer, prompt, p=0.9, temperature=1.0, max_length=50): """ Generate text using nucleus (top-p) sampling. Args: p: Cumulative probability threshold (0.0 to 1.0) temperature: Temperature scaling """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() for _ in range(max_length): outputs = model(input_ids=torch.tensor([generated])) next_token_logits = outputs.logits[0, -1, :] # Apply temperature scaled_logits = next_token_logits / temperature # Convert to probabilities probs = torch.nn.functional.softmax(scaled_logits, dim=-1) # Sort probabilities in descending order sorted_probs, sorted_indices = torch.sort(probs, descending=True) # Calculate cumulative probabilities cumulative_probs = torch.cumsum(sorted_probs, dim=-1) # Find nucleus: tokens whose cumulative probability <= p nucleus_mask = cumulative_probs <= p # Always include at least the first token (highest probability) if nucleus_mask.sum() == 0: nucleus_mask[0] = True # Include the first token that pushes us over the threshold if nucleus_mask.sum() < len(cumulative_probs): nucleus_mask[nucleus_mask.sum()] = True # Select tokens and probabilities in the nucleus nucleus_tokens = sorted_indices[nucleus_mask] nucleus_probs = sorted_probs[nucleus_mask] # Renormalize probabilities nucleus_probs = nucleus_probs / nucleus_probs.sum() # Sample from the nucleus sample_idx = torch.multinomial(nucleus_probs, num_samples=1).item() next_token_id = nucleus_tokens[sample_idx].item() generated.append(next_token_id) if next_token_id == tokenizer.eos_token_id: break return tokenizer.decode(generated)

Choosing P Values

P ValueEffectNucleus SizeBest For
0.5-0.7ConservativeSmall, focusedTechnical content, Q&A
0.8-0.9BalancedMedium, adaptiveGeneral content, articles
0.92-0.95CreativeLarger, diverseCreative writing, storytelling
0.98+Very creativeVery largeExperimental, artistic content

Nucleus vs. Top-K Comparison

Loading tool...
# Simple repetition penalty implementation def apply_repetition_penalty(logits, past_tokens, penalty=1.2): for token_id in past_tokens: if logits[token_id] > 0: logits[token_id] /= penalty else: logits[token_id] *= penalty return logits

Frequency and Presence Penalties

  • Frequency penalty: Penalize based on how often a token appears
  • Presence penalty: Penalize any token that has appeared at all

Parameter Recommendations by Use Case

Use CaseTemperatureTop-KTop-PRepetition PenaltyNotes
Chat Assistant0.7500.91.1Balanced and helpful
Creative Writing0.91000.951.2Encourage creativity
Technical Docs0.3300.81.0Prioritize accuracy
News Articles0.6400.851.15Professional tone
Code Generation0.2200.71.0Syntax correctness
Poetry1.11500.971.3Maximum creativity

Practical Implementation with Hugging Face

The Transformers library makes advanced sampling easy:

from transformers import pipeline # Set up the pipeline generator = pipeline('text-generation', model='gpt2') prompt = "The future of artificial intelligence will" # Temperature sampling temp_output = generator( prompt, max_length=50, do_sample=True, temperature=0.7, num_return_sequences=2 ) # Combined approach (production-ready) production_output = generator( prompt, max_length=50, do_sample=True, temperature=0.7, top_k=50, top_p=0.9, repetition_penalty=1.2, num_return_sequences=2 ) print("Temperature Sampling:") for output in temp_output: print(f"- {output['generated_text']}") print("\nProduction Sampling:") for output in production_output: print(f"- {output['generated_text']}")

Key Hugging Face Parameters

  • do_sample=True: Enable probabilistic sampling
  • temperature: Control randomness (0.1-2.0)
  • top_k: Limit to top-k tokens (0 = disabled)
  • top_p: Nucleus sampling threshold (0.0-1.0)
  • repetition_penalty: Penalize repeated tokens (1.0-2.0)
  • num_return_sequences: Generate multiple outputs

Common Issues and Solutions

Quick Troubleshooting Guide

ProblemSymptomsSolution
Too RandomNonsensical text, grammar errorsLower temperature (0.5-0.7), reduce top_p (0.8-0.9)
Too BoringGeneric responses, repetitiveIncrease temperature (0.8-1.0), increase top_p (0.9-0.95)
InconsistentSome outputs great, others terribleGenerate multiple samples, use conservative parameters
RepetitiveRepeated phrases despite penaltiesIncrease repetition penalty (1.2-1.5)

Parameter Tuning Process

  1. Start with defaults: temperature=0.7, top_p=0.9, top_k=50
  2. Adjust temperature first: Control overall creativity level
  3. Fine-tune filtering: Adjust top_p/top_k for quality
  4. Test extensively: Use diverse prompts and evaluate outputs

Evaluating Sampling Quality

Key Evaluation Criteria

  1. Fluency: Is the text grammatically correct?
  2. Coherence: Does it make logical sense?
  3. Relevance: Does it address the prompt appropriately?
  4. Creativity: Is it interesting and non-generic?
  5. Consistency: Does quality remain stable across samples?

Summary

What We've Learned

  1. Temperature sampling: Control creativity with a single parameter
  2. Top-k sampling: Limit choices to reasonable options
  3. Nucleus sampling: Adaptive, context-aware token selection
  4. Combined approaches: Production-ready systems using multiple techniques
  5. Parameter tuning: Guidelines for different use cases
  6. Common issues: How to debug and fix sampling problems

The Complete Sampling Toolkit

You now have the complete toolkit for text generation:

Deterministic Methods (previous lesson):

  • Greedy search: Fast, reliable, predictable
  • Beam search: Higher quality, still deterministic

Probabilistic Methods (this lesson):

  • Temperature: Creativity dial
  • Top-k: Smart choice limitation
  • Nucleus: Adaptive selection
  • Combined: Production-ready systems

When to Use What

ScenarioRecommended ApproachKey Parameters
Factual Q&ALow temperaturetemp=0.2, top_p=0.8
Creative WritingNucleus samplingtemp=0.9, top_p=0.95
Chat AssistantBalanced combinationtemp=0.7, top_k=50, top_p=0.9
Code GenerationConservative samplingtemp=0.3, top_k=30
BrainstormingHigh creativitytemp=1.1, top_p=0.97

Practice Exercises

Exercise 1: Parameter Exploration

Create a simple interface that lets you adjust temperature, top-k, and top-p parameters in real-time. Generate text with the same prompt using different settings and analyze the differences.

Exercise 2: Use Case Optimization

Choose a specific use case (e.g., writing product descriptions, generating study notes, creating story outlines) and systematically tune parameters to optimize for that task.

Exercise 3: Quality Evaluation

Implement automated metrics to evaluate generation quality. Compare different sampling methods on dimensions like diversity, fluency, and relevance.

Exercise 4: Repetition Handling

Experiment with different repetition penalty values and strategies. Create examples where repetition is problematic and show how to fix it.

Exercise 5: Production System

Build a complete text generation system that:

  • Takes user prompts
  • Allows parameter adjustment
  • Generates multiple candidates
  • Includes basic quality filtering
  • Handles edge cases gracefully

Additional Resources