Overview
In our previous lesson, we introduced basic tokenization methods like word and character tokenization. While intuitive, these approaches have significant limitations when handling large vocabularies, out-of-vocabulary words, and morphologically rich languages.
Modern NLP models rely on sophisticated subword tokenization strategies that find an optimal balance between character-level and word-level representations. Today's leading models use three main approaches:
- SentencePiece (Unigram): Dominant in 2024 - used by LLaMA, PaLM, T5, and most multilingual models
- Byte-Pair Encoding (BPE): Powers GPT models and many encoder-decoder architectures
- WordPiece: Foundation of BERT-family models and Google's ecosystem
This lesson explores these subword tokenization techniques that have revolutionized NLP, with hands-on tools to understand how each algorithm works in practice.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the limitations of traditional tokenization approaches
- Explain how modern subword tokenization algorithms work
- Compare different subword tokenization methods (BPE, WordPiece, SentencePiece)
- Implement and use subword tokenizers in practice
- Select appropriate tokenization strategies for different NLP tasks
The Need for Subword Tokenization
Limitations of Word-Level Tokenization
Word tokenization seemed intuitive in our previous lesson, but it has several critical weaknesses:
-
Vocabulary Explosion: Languages are productive — they can generate a virtually unlimited number of words through compounding, inflection, and derivation.
-
Out-of-Vocabulary (OOV) Words: Any word not seen during training becomes an
<UNK>(unknown) token, losing all semantic information. -
Morphological Blindness: The tokens "play", "playing", and "played" are treated as completely different words, even though they share the same root.
-
Rare Words Problem: Infrequent words have sparse statistics, making it difficult for models to learn good representations.
Analogy: Word Construction as Lego Blocks
Think of words as structures built from smaller reusable pieces, like Lego blocks. Rather than trying to pre-manufacture every possible structure (word), we can provide the fundamental blocks and rules for combining them.
- In English: "un" + "break" + "able" = "unbreakable"
- In German: "Grund" + "gesetz" + "buch" = "Grundgesetzbuch" (constitution)
Visualization: Vocabulary Size vs. Coverage
The Vocabulary Size Problem
| Vocabulary Size | Word-level Coverage | Subword Coverage (BPE) | Difference |
|---|---|---|---|
| 10K tokens | 80.5% | 95.8% | +15.3% |
| 20K tokens | 85.2% | 97.9% | +12.7% |
| 30K tokens | 87.9% | 98.6% | +10.7% |
| 50K tokens | 90.5% | 99.2% | +8.7% |
| 100K tokens | 93.4% | 99.8% | +6.4% |
Key Insight: Subword tokenization achieves 95%+ coverage with just 10K tokens, while word-level tokenization needs 100K+ tokens to reach 93% coverage.
Why This Matters
- Memory Efficiency: Smaller vocabularies = smaller embedding matrices
- Better Generalization: Higher coverage means fewer
<UNK>tokens - Computational Efficiency: Less vocabulary means faster training and inference
Byte-Pair Encoding (BPE)
BPE is one of the most widely used subword tokenization algorithms, employed by models like GPT (OpenAI) and BART (Facebook). Modern GPT models use a byte‑level BPE variant operating directly on UTF‑8 bytes, ensuring any Unicode sequence is representable without UNK fallbacks.
History and Origins
Originally developed as a data compression algorithm by Philip Gage in 1994, BPE was adapted for NLP by Rico Sennrich in 2016 for neural machine translation.
How BPE Works
BPE follows a simple yet effective procedure:
- Initialize vocabulary with individual characters
- Count all symbol pairs in the corpus
- Merge the most frequent pair
- Repeat steps 2-3 until desired vocabulary size or stopping criterion is reached
Step-by-Step BPE Algorithm Example
Let's trace through the BPE algorithm with the corpus: "low lower lowest"
Step 1: Initialize with characters + end-of-word marker
Text: low lower lowest Tokens: l o w </w> l o w e r </w> l o w e s t </w>
Step 2: Count all adjacent pairs
Pair counts: ('l', 'o'): 3 # appears in all three words ('o', 'w'): 3 # appears in all three words ('w', '</w>'): 1 # only in "low" ('w', 'e'): 2 # in "lower" and "lowest" ('e', 'r'): 1 # only in "lower" ('r', '</w>'): 1 # only in "lower" ('e', 's'): 1 # only in "lowest" ('s', 't'): 1 # only in "lowest" ('t', '</w>'): 1 # only in "lowest"
Step 3: Merge most frequent pair → ('l', 'o') becomes 'lo'
Text: low lower lowest Tokens: lo w </w> lo w e r </w> lo w e s t </w>
Step 4: Recount pairs
Pair counts: ('lo', 'w'): 3 # now most frequent ('w', '</w>'): 1 ('w', 'e'): 2 ('e', 'r'): 1 ... (other pairs)
Step 5: Merge ('lo', 'w') → 'low'
Text: low lower lowest Tokens: low </w> low e r </w> low e s t </w>
Step 6: Continue merging...
After merging ('e', 'r'): Tokens: low </w> low er </w> low e s t </w> After merging ('e', 's'): Tokens: low </w> low er </w> low es t </w>
Final vocabulary: {l, o, w, e, r, s, t, </w>, lo, low, er, es, ...}
Key Insights from this Example
- Frequency drives merging: Most common character pairs get merged first
- Hierarchical building: Simple subwords become building blocks for complex ones
- Shared subwords: "low" appears in all variants, maximizing reuse
- Morphology awareness: Suffixes like "er", "es", "est" emerge naturally
Interactive BPE Algorithm Explorer
Unicode and Whitespace Caveats
- SentencePiece’s visible underscore (▁) indicates a word boundary; decoding restores original spaces.
- Byte‑level BPE preserves combining marks and multi‑byte glyphs via UTF‑8 bytes; visual artifacts are expected in token lists but lossless on decode.
Implementation
import sentencepiece as spm # Train SentencePiece model spm.SentencePieceTrainer.train( input='data.txt', model_prefix='sentencepiece', vocab_size=8000, model_type='unigram', # or 'bpe' character_coverage=0.9995, normalization_rule_name='nmt_nfkc' ) # Load the model sp = spm.SentencePieceProcessor() sp.load('sentencepiece.model') # Encode and decode text = "SentencePiece is an unsupervised text tokenizer." encoded = sp.encode(text, out_type=str) decoded = sp.decode(encoded) print(f"Original: {text}") print(f"Tokens: {encoded}") print(f"Decoded: {decoded}")
Applications of SentencePiece
- Google's T5 and PaLM models
- Meta AI's LLaMA models
- XLNet and many multilingual models
- Particularly popular for non-English and multilingual models
Comparison of Tokenization Methods
Performance Across Languages
| Language | Word-Level | BPE | WordPiece | SentencePiece | Winner |
|---|---|---|---|---|---|
| English | 92% | 95% | 94% | 95% | Tie |
| Chinese | 45% | 80% | 82% | 94% | SentencePiece |
| Japanese | 50% | 85% | 83% | 95% | SentencePiece |
| German | 85% | 92% | 91% | 94% | SentencePiece |
| Arabic | 80% | 88% | 89% | 93% | SentencePiece |
| Russian | 75% | 90% | 88% | 93% | SentencePiece |
Key Insight: SentencePiece consistently achieves the highest coverage across languages, explaining its dominance in multilingual and modern LLMs.
Feature Comparison
| Feature | BPE | WordPiece | SentencePiece |
|---|---|---|---|
| Current Popularity | High | Medium | Highest |
| 2024 Usage | GPT (byte‑level), RoBERTa | BERT family | LLaMA, T5, PaLM |
| Merge criterion | Frequency | Likelihood | Frequency or Likelihood |
| Pre-tokenization | Required | Required | Not required |
| Language support | Partially agnostic | Partially agnostic | Fully agnostic |
| Whitespace handling | Removed | Removed | Preserved (▁) |
| Subword marking | None | ##prefix | ▁ prefix |
| Vocabulary size | 10k-50k | 10k-30k | 8k-32k |
| Out-of-vocabulary | Character/byte fallback | Character fallback | Character fallback |
| Reversibility | Partial | Partial | Complete |
Which Tokenizer Should You Use in 2024?
For New Projects:
- SentencePiece (Unigram) - Best overall choice, especially for multilingual
- Fine-tuning existing models - Use the original model's tokenizer
- English-only BERT tasks - WordPiece is fine
- GPT-style generation - BPE or byte-level BPE
Decision Tree:
- Fine-tuning a pre-trained model? → Use its original tokenizer
- Building from scratch + multilingual? → SentencePiece
- Building from scratch + English-only? → SentencePiece or BPE
- Need perfect reversibility? → SentencePiece
Advanced Topics
Tokenization Implications for Model Performance
The choice of tokenization strategy has profound effects on:
- Model Size: Vocabulary size directly impacts embedding layer parameters
- Training Efficiency: Better tokenization means more efficient training
- Language Support: Some tokenizers handle certain languages better
- Model Generalization: Good subword tokenization improves generalization to new words
Tokenization Challenges
- Language Boundaries: Not all languages use spaces or have clear word boundaries
- Morphologically Rich Languages: Languages like Finnish or Turkish have complex word structures
- Code-Switching: Handling text that mixes multiple languages
- Non-linguistic Content: Emojis, URLs, hashtags, code snippets
Beyond Subword Tokenization
Research continues to improve tokenization:
- Character-level Transformers: Bypass tokenization entirely
- Byte-level BPE: GPT-2/3/4 use byte-level BPE to handle any Unicode character
- Dynamic Tokenization: Adapt tokenization based on the input
- Tokenization-free Models: Some experimental approaches try to work directly with raw text
Practical Implementation
Choosing the Right Tokenizer
Guidelines for selecting a tokenizer:
- Task Alignment: Match your tokenizer with your downstream task
- Model Compatibility: If fine-tuning, use the original model's tokenizer
- Language Support: Consider language-specific needs
- Vocabulary Size: Balance between coverage and computational efficiency
Tokenization in the Hugging Face Ecosystem
The Hugging Face Tokenizers library provides fast implementations of all major tokenization algorithms:
from transformers import AutoTokenizer # Load pre-trained tokenizers bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") t5_tokenizer = AutoTokenizer.from_pretrained("t5-base") # Example text text = "Tokenization splits text into subword units!" # Compare tokenization results print("BERT (WordPiece):", bert_tokenizer.tokenize(text)) print("GPT-2 (BPE):", gpt2_tokenizer.tokenize(text)) print("T5 (SentencePiece):", t5_tokenizer.tokenize(text))