Overview
In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. These models revolutionized NLP by capturing semantic relationships between words. However, they share a fundamental limitation: they assign the same vector to a word regardless of its context.
This lesson introduces contextual embeddings - dynamic representations that adapt based on surrounding words, enabling machines to understand the nuanced, context-dependent nature of human language.
Learning Objectives
After completing this lesson, you will be able to:
- Understand why static embeddings fail with polysemous words
- Explain how contextual models like ELMo and BERT solve this limitation
- Recognize the key architectural innovations that enable context-sensitivity
- Compare different contextual embedding approaches and their trade-offs
- Apply contextual embeddings to real-world NLP tasks
Conceptual Introduction: The Chameleon Word Problem
Real-World Analogy: The Multi-Talented Actor
Imagine an actor who plays completely different characters:
- Character A: A serious bank president in a drama
- Character B: A muddy river guide in an adventure film
- Character C: A pilot banking an airplane in an action movie
The same actor embodies entirely different personas depending on the context (script, other actors, setting). Traditional word embeddings are like having only one headshot photo to represent this actor - it captures their appearance but misses the rich variety of roles they can play.
Contextual embeddings are like having a different photo for each performance, showing how the actor adapts to each role while maintaining their core identity.
Visualizing the Static Embedding Limitation
First, let's visualize how traditional embeddings work in vector space. Notice how each word has a fixed position regardless of context:
Key Observation: Each word occupies exactly one point in this space, which creates a fundamental problem...
The Linguistic Problem
Consider these sentences:
- "I'll bank the money" (financial institution)
- "I'll bank the fire" (cover with ashes)
- "I sat by the river bank" (edge of water)
Traditional embeddings assign identical vectors to "bank" in all contexts, collapsing distinct meanings into a single representation. This creates fundamental limitations in understanding and reasoning.
Interactive Exploration: Experiencing the Word Sense Problem
Let's explore this limitation interactively. The following tool demonstrates how traditional embeddings handle polysemous words (words with multiple meanings):
What to observe:
- How different contexts reveal completely different meanings for the same word
- Why a single vector representation cannot capture this diversity
- The challenge this creates for downstream NLP tasks
This visualization shows exactly why the NLP field needed a revolutionary approach to word representation.
Theoretical Foundation: The Contextual Revolution
ELMo: The First Breakthrough
ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major success in contextual embeddings.
Core Innovation
ELMo generates word representations by:
- Training bidirectional LSTM language models on large text corpora
- Using all internal states of the LSTMs, not just the final output
- Creating weighted combinations of representations from different layers
Mathematical Formulation
For a word in sequence, ELMo creates:
Where:
- = contextual representation from layer
- = softmax-normalized learned weights
- = global scaling parameter
- = number of LSTM layers
Layer Specialization Discovery
ELMo revealed that different layers capture different linguistic information:
- Lower layers: Syntactic information (part-of-speech, morphology)
- Higher layers: Semantic information (word sense, context-specific meaning)
BERT: The Transformer Revolution
BERT (Bidirectional Encoder Representations from Transformers) by Devlin et al. (2018) represented a quantum leap by replacing LSTMs with transformer architecture.
Key Architectural Innovations
- Bidirectional Attention: Unlike sequential models, words attend to both left and right context simultaneously
- Self-Attention Mechanism: Each word can directly attend to any other word in the sequence
- Deep Contextualization: 12-24 layers of sophisticated attention patterns
Attention patterns in practice
Explore Transformers
Overview
Analysis
Architecture
Mechanisms
Training
Advanced
Attention Pattern Visualizer
Understanding the difference between self-attention and cross-attention
Self-Attention: "The cat sat on the mat"
How Self-Attention Works:
Quick Comparison
Self-Attention
- • Q, K, V from same sequence
- • Tokens attend to each other
- • Builds contextual representations
- • Used in: BERT, GPT, encoder/decoder layers
Cross-Attention
- • Q from decoder, K,V from encoder
- • Decoder attends to encoder
- • Connects two sequences
- • Used in: T5, BART, translation models
Pre-training Objectives
BERT uses two clever training tasks:
1. Masked Language Modeling (MLM)
- Randomly mask 15% of input tokens
- Predict masked tokens using bidirectional context
- Forces the model to use full context for understanding
2. Next Sentence Prediction (NSP)
- Given two sentences, predict if the second logically follows the first
- Teaches inter-sentence relationships
Mathematical Foundation
BERT's self-attention mechanism computes:
Where for each token:
- (Query): "What am I looking for?"
- (Key): "What do I represent?"
- (Value): "What information do I provide?"
This allows each word to dynamically gather relevant information from its entire context.
Beyond BERT: Modern Developments
RoBERTa: Optimized Training
- Longer training with larger datasets
- Removed Next Sentence Prediction task
- Dynamic masking patterns
- Result: Significant performance improvements
Multimodal Evolution: CLIP
CLIP (Contrastive Language-Image Pre-training) extends contextual embeddings to connect text and images in a unified representation space, enabling:
- Zero-shot image classification using text descriptions
- Cross-modal retrieval and understanding
- Foundation for modern AI systems like DALL-E and GPT-4V
Implementation: Working with Contextual Embeddings
Using Pre-trained BERT for Semantic Similarity
Quick SBERT similarity (CPU)
pythonfrom transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F def get_contextual_embeddings(text_list, model_name='sentence-transformers/all-MiniLM-L6-v2'): """Extract contextual embeddings from text using BERT-based models.""" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)
Zero-Shot Classification with Contextual Models
pythonfrom transformers import pipeline # Initialize zero-shot classifier classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") # Example: Context-dependent classification text = "The river bank was eroding due to heavy rainfall." candidate_labels = ["finance", "geography", "action", "institution"] result = classifier(text, candidate_labels)
Connections: Comparative Analysis of Embedding Approaches
Now that we understand the theory and implementation, let's interactively compare how different embedding approaches handle the same linguistic challenges:
Key Insights to Explore:
- Static vs. Contextual: How Word2Vec, GloVe compare to BERT on ambiguous words
- Performance Trade-offs: Speed vs. accuracy across different models
- Use Case Matching: When to choose each approach for specific applications
- Evolution Timeline: The progression from simple to sophisticated representations
Cross-Domain Connections
The contextual embedding revolution parallels developments in other AI domains:
- Computer Vision: From fixed features (SIFT, HOG) to contextual features (Vision Transformers)
- Speech Recognition: From phoneme-based to contextual acoustic models
- Recommendation Systems: From static user profiles to dynamic, context-aware preferences
Modern Landscape: State-of-the-Art Models
MTEB Leaderboard Leaders
Model | MTEB Score | Key Innovation |
---|---|---|
E5-large | 65.3 | Advanced contrastive learning with hard negatives |
BGE-Large | 64.5 | Custom mining strategies for training data |
GTE-Large | 63.7 | Curriculum learning approach to embedding quality |
MTEB (Massive Text Embedding Benchmark) evaluates models across 8 embedding task categories.
Why Contextual Embeddings Excel
1. Word Sense Disambiguation: Different vectors for different meanings
2. Compositional Understanding: Better phrase and sentence representations
3. Transfer Learning: Pre-trained representations adapt to new domains
4. Reduced Data Requirements: Leverage large-scale pre-training for small tasks
Practice: Applied Exercises
Exercise 1: Context Sensitivity Analysis
Compare how traditional and contextual embeddings handle these word pairs:
- "apple" (fruit) vs. "Apple" (company)
- "run" (jog) vs. "run" (execute program)
- "patient" (medical) vs. "patient" (waiting calmly)
Exercise 2: Domain Adaptation
Use BERT embeddings to build a semantic search system for:
- Scientific papers (using SciBERT)
- Legal documents (using Legal-BERT)
- Medical texts (using BioBERT)
Exercise 3: Cross-Lingual Understanding
Explore how multilingual BERT handles:
- Word translations across languages
- Cross-lingual semantic similarity
- Code-switching in multilingual texts
Summary
In this lesson, we've explored the revolutionary shift from static to contextual embeddings:
- The Problem: Static embeddings cannot capture context-dependent word meanings
- The Solution: Contextual models like ELMo and BERT that adapt representations based on surrounding context
- The Architecture: Bidirectional attention mechanisms and sophisticated pre-training objectives
- The Impact: Dramatic improvements across virtually all NLP tasks
- The Applications: From semantic search to zero-shot classification
Contextual embeddings represent one of the most significant breakthroughs in NLP history, laying the foundation for modern language models like GPT and ChatGPT.
In our next lesson, we'll explore the pre-transformer architectures (RNNs, LSTMs, GRUs) that paved the way for these contextual breakthroughs, understanding how the field evolved from sequential processing to parallel attention mechanisms.