Overview
In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. These models revolutionized NLP by capturing semantic relationships between words. However, they share a fundamental limitation: they assign the same vector to a word regardless of its context.
This lesson introduces contextual embeddings - dynamic representations that adapt based on surrounding words, enabling machines to understand the nuanced, context-dependent nature of human language.
Learning Objectives
After completing this lesson, you will be able to:
- Understand why static embeddings fail with polysemous words
- Explain how contextual models like ELMo and BERT solve this limitation
- Recognize the key architectural innovations that enable context-sensitivity
- Compare different contextual embedding approaches and their trade-offs
- Apply contextual embeddings to real-world NLP tasks
Conceptual Introduction: The Chameleon Word Problem
Real-World Analogy: The Multi-Talented Actor
Imagine an actor who plays completely different characters:
- Character A: A serious bank president in a drama
- Character B: A muddy river guide in an adventure film
- Character C: A pilot banking an airplane in an action movie
The same actor embodies entirely different personas depending on the context (script, other actors, setting). Traditional word embeddings are like having only one headshot photo to represent this actor - it captures their appearance but misses the rich variety of roles they can play.
Contextual embeddings are like having a different photo for each performance, showing how the actor adapts to each role while maintaining their core identity.
Visualizing the Static Embedding Limitation
First, let's visualize how traditional embeddings work in vector space. Notice how each word has a fixed position regardless of context:
What to observe:
- How different contexts reveal completely different meanings for the same word
- Why a single vector representation cannot capture this diversity
- The challenge this creates for downstream NLP tasks
This visualization shows exactly why the NLP field needed a revolutionary approach to word representation.
Theoretical Foundation: The Contextual Revolution
ELMo: The First Breakthrough
ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major success in contextual embeddings.
Core Innovation
ELMo generates word representations by:
- Training bidirectional LSTM language models on large text corpora
- Using all internal states of the LSTMs, not just the final output
- Creating weighted combinations of representations from different layers
Mathematical Formulation
For a word in sequence, ELMo creates:
Where:
- = contextual representation from layer
- = softmax-normalized learned weights
- = global scaling parameter
- = number of LSTM layers
Layer Specialization Discovery
ELMo revealed that different layers capture different linguistic information:
- Lower layers: Syntactic information (part-of-speech, morphology)
- Higher layers: Semantic information (word sense, context-specific meaning)
BERT: The Transformer Revolution
BERT (Bidirectional Encoder Representations from Transformers) by Devlin et al. (2018) represented a quantum leap by replacing LSTMs with transformer architecture.
Key Architectural Innovations
- Bidirectional Attention: Unlike sequential models, words attend to both left and right context simultaneously
- Self-Attention Mechanism: Each word can directly attend to any other word in the sequence
- Deep Contextualization: 12-24 layers of sophisticated attention patterns
Attention patterns in practice
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F def get_contextual_embeddings(text_list, model_name='sentence-transformers/all-MiniLM-L6-v2'): """Extract contextual embeddings from text using BERT-based models.""" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Tokenize input texts encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt') # Generate embeddings with torch.no_grad(): model_output = model(**encoded_input) # Mean pooling over token embeddings token_embeddings = model_output[0] attention_mask = encoded_input['attention_mask'] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() sentence_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Normalize embeddings sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) return sentence_embeddings # Example: Context-sensitive similarity sentences = [ "The bank approved my loan application.", # Financial context "We sat on the bank of the river.", # Geographic context "I need to bank this money quickly.", # Action context "The investment bank offers great returns." # Financial context ] embeddings = get_contextual_embeddings(sentences) # Calculate similarities similarities = F.cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1:]) print("Similarities with 'The bank approved my loan application.':") for i, sim in enumerate(similarities): print(f" → '{sentences[i+1]}': {sim:.3f}")
Zero-Shot Classification with Contextual Models
from transformers import pipeline # Initialize zero-shot classifier classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") # Example: Context-dependent classification text = "The river bank was eroding due to heavy rainfall." candidate_labels = ["finance", "geography", "action", "institution"] result = classifier(text, candidate_labels) print(f"Text: {text}") print("\nClassification results:") for label, score in zip(result['labels'], result['scores']): print(f" {label}: {score:.3f}")
Connections: Comparative Analysis of Embedding Approaches
Now that we understand the theory and implementation, let's interactively compare how different embedding approaches handle the same linguistic challenges: