Overview
In our previous lessons, we explored the transformer architecture and various sampling techniques for text generation. Now, we'll trace the foundational evolutionary journey of transformer models that revolutionized NLP from 2018 to 2023.
This lesson examines how the original encoder-decoder transformer architecture branched into specialized variants—encoder-only, decoder-only, and encoder-decoder approaches—each optimized for different tasks. We'll analyze milestone models like BERT, GPT, T5, and understand the key insights that drove this foundational evolution leading up to the modern era.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the architectural differences between encoder-only, decoder-only, and encoder-decoder models
- Explain the innovations and key contributions of foundational models (BERT, GPT-3, T5, etc.)
- Compare the strengths and weaknesses of different transformer variants
- Recognize the relationship between model architecture and NLP task suitability
- Identify key trends in the foundational evolution of transformer models
- Apply this knowledge to understand the principles behind architectural choices
The Transformer Family Tree
From General to Specialized Architectures
The original transformer model (Vaswani et al., 2017) introduced a general encoder-decoder architecture for sequence-to-sequence tasks. Since then, transformer models evolved along three main branches:
- Encoder-only models (e.g., BERT, RoBERTa): Specialize in understanding language
- Decoder-only models (e.g., GPT, GPT-3): Focus on generating language
- Encoder-decoder models (e.g., T5, BART): Maintain the full architecture for sequence transformation
BERT Architecture Variants
- BERT-base: 12 transformer layers, 12 attention heads, 768 hidden dimensions (110M parameters)
- BERT-large: 24 transformer layers, 16 attention heads, 1024 hidden dimensions (340M parameters)
BERT's Impact and Applications
BERT excels in a wide range of understanding tasks:
- Text classification
- Named entity recognition
- Question answering
- Sentiment analysis
- Natural language inference
The Fine-tuning Paradigm
BERT introduced a new two-step approach that has become standard:
- Pre-training on vast amounts of unlabeled text using self-supervised objectives
- Fine-tuning the pre-trained model on specific downstream tasks with labeled data
This approach dramatically reduced the amount of task-specific labeled data needed.
RoBERTa: Robustly Optimized BERT Approach
RoBERTa, introduced by Facebook AI in 2019, showed that BERT was significantly undertrained. It maintains BERT's architecture but introduces several training improvements.
RoBERTa's Improvements Over BERT
- More data and longer training: Using 10 times more data and computing power
- Larger batches: 8K vs. 256 examples per batch
- Dynamic masking: Generating new masked patterns every time a sequence is encountered
- Removing NSP: Focusing only on the masked language modeling task
- Longer sequences: Training on sequences of up to 512 tokens
These seemingly minor changes led to significantly better performance, highlighting the importance of training methodology.
| Aspect | BERT | RoBERTa |
|---|---|---|
| Training Data | 16GB (BookCorpus + Wikipedia) | 160GB (Including CC-News, OpenWebText, Stories) |
| Batch Size | 256 sequences | 8,000 sequences |
| Training Steps | 1,000,000 steps | 500,000 steps (but larger batches) |
| Masking Strategy | Static (masked once during preprocessing) | Dynamic (masked differently each epoch) |
| Pre-training Tasks | MLM + NSP | MLM only |
| Max Sequence Length | 512 tokens (but often 128) | 512 tokens throughout training |
| GLUE Benchmark | 82.2% | 88.5% |
Other Notable Encoder-Only Innovations
- ALBERT: Parameter reduction techniques (shared layers, factorized embedding)
- DistilBERT: Knowledge distillation for a smaller, faster model
- DeBERTa: Disentangled attention mechanism and enhanced mask decoder
- ELECTRA: Replaced MLM with a more efficient token detection objective
Decoder-Only Models: Generating Language
GPT: Generative Pre-trained Transformer
The GPT family, starting with the original GPT in 2018 by OpenAI, showcased the power of the transformer decoder for text generation.
Key Characteristics of GPT Models
- Autoregressive generation: Models the probability of a token given previous tokens
- Unidirectional attention: Each token can only attend to previous tokens (causal attention)
- Generative capabilities: Optimized for producing coherent, fluent text
The GPT Evolution: Demonstrating Scaling Laws
GPT-2 showed that scaling up the model (from 117M to 1.5B parameters) and training data led to surprising emergent abilities:
- Better long-range coherence
- Improved factual knowledge
- Ability to perform simple reasoning
GPT-3: Emergence of Few-Shot Learning
GPT-3 (175B parameters) demonstrated a remarkable new capability: few-shot learning through in-context examples.
| Input Example | Expected Output | Model Response |
|---|---|---|
| I loved this movie, it was fantastic! | Positive | Positive (94% confidence) |
| Terrible service and the food was cold. | Negative | Negative (97% confidence) |
| The experience was neither good nor bad. | Neutral | Neutral (88% confidence) |
| The concert exceeded all my expectations, what a night! | ? | Positive (96% confidence) |
Note: Few-shot learning demonstration: The model is shown examples 1-3 and then predicts the sentiment of example 4 without explicit training.
The Impact of Scaling Laws
Research by Kaplan et al. (2020) revealed predictable scaling laws in language models that fundamentally changed how we think about model development:
- Power Law Relationship: As model size increases by 10x, performance improves at a consistent but diminishing rate
- Measurable Improvements: Language model loss decreases from 2.5 (1M parameters) to 1.1 (1T parameters), a 56% relative improvement
- Predictable Scaling: This relationship allows researchers to predict performance gains from increasing model size
This discovery enabled researchers to make strategic trade-offs between model size, dataset size, and compute resources, leading to the rapid evolution of increasingly capable language models.
T5 Variants and Training
T5 was extensively ablated to find optimal training procedures:
- T5-Small to T5-11B: A range of model sizes from 60M to 11B parameters
- Extensive pre-training: On the large C4 (Colossal Clean Crawled Corpus)
- Multiple objectives tested: Vanilla language modeling, corrupted span prediction, etc.
The final T5 approach used a form of span corruption where randomly selected spans of text were replaced with sentinel tokens that the model had to reconstruct.
BART: Bidirectional and Auto-Regressive Transformers
BART, introduced by Facebook AI in 2019, combines the bidirectional encoding of BERT with the autoregressive decoding of GPT.
BART's Innovative Pre-training
BART is pre-trained by:
- Corrupting documents with an arbitrary noising function
- Learning to reconstruct the original document
This allowed BART to explore various noising approaches:
- Token masking (like BERT)
- Token deletion
- Text infilling (multiple tokens replaced with a single mask)
- Sentence permutation
- Document rotation
BART's Flexibility
BART excels at a diverse set of tasks:
- Sequence classification
- Token classification
- Sequence generation
- Machine translation
Comparing the Three Paradigms
| Architecture | Pre-training Objective | Strengths | Weaknesses | Exemplar Models | Best For |
|---|---|---|---|---|---|
| Encoder-Only | Masked Language Modeling | Strong understanding of context and relationships | Limited generation capability | BERT, RoBERTa, DeBERTa | Classification, NER, Sentiment Analysis |
| Decoder-Only | Autoregressive Language Modeling | Excellent text generation, emergent abilities at scale | Less effective for understanding context, inefficient for seq2seq tasks | GPT, GPT-2, GPT-3 | Open-ended generation, dialogue, creative writing |
| Encoder-Decoder | Span corruption, denoising | Versatile, strong at sequence transformation tasks | More complex architecture, higher computational requirements | T5, BART, UL2 | Translation, Summarization, Question Answering |
Foundational Innovations Beyond the Basics
Parameter Efficiency Techniques
As models grew larger, researchers developed methods to make them more efficient:
- Parameter Sharing: ALBERT reduced parameters by sharing weights across layers
- Low-Rank Approximations: Compressing weight matrices with matrix factorization
- Knowledge Distillation: Training smaller "student" models to mimic larger "teacher" models
- Quantization: Reducing numerical precision without sacrificing significant performance
Attention Mechanism Improvements
The core attention mechanism also evolved during this foundational period:
- Sparse Attention (Longformer, BigBird): Attending to select tokens rather than all
- Linear Attention (Linformer, Performer): Reducing complexity from O(n²) to O(n)
- Local+Global Attention (Longformer, BigBird): Combining local context with global tokens