Contextual Embeddings and Modern Representations

Overview

In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. These models revolutionized NLP by capturing semantic relationships between words. However, they share a fundamental limitation: they assign the same vector to a word regardless of its context.

This lesson introduces contextual embeddings - dynamic representations that adapt based on surrounding words, enabling machines to understand the nuanced, context-dependent nature of human language.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand why static embeddings fail with polysemous words
  • Explain how contextual models like ELMo and BERT solve this limitation
  • Recognize the key architectural innovations that enable context-sensitivity
  • Compare different contextual embedding approaches and their trade-offs
  • Apply contextual embeddings to real-world NLP tasks

Conceptual Introduction: The Chameleon Word Problem

Real-World Analogy: The Multi-Talented Actor

Imagine an actor who plays completely different characters:

  • Character A: A serious bank president in a drama
  • Character B: A muddy river guide in an adventure film
  • Character C: A pilot banking an airplane in an action movie

The same actor embodies entirely different personas depending on the context (script, other actors, setting). Traditional word embeddings are like having only one headshot photo to represent this actor - it captures their appearance but misses the rich variety of roles they can play.

Contextual embeddings are like having a different photo for each performance, showing how the actor adapts to each role while maintaining their core identity.

Visualizing the Static Embedding Limitation

First, let's visualize how traditional embeddings work in vector space. Notice how each word has a fixed position regardless of context:

Vector Space Visualizer

Explore word embeddings in 2D space and discover semantic relationships

Selected: 4 words

PCA Projection

kingqueenmanwomanDimension 1Dimension 2

Clusters

royalty
people

Word Distances

king ↔ queen
Distance: 44.7Similarity: 0.959
king ↔ man
Distance: 89.4Similarity: 0.982
king ↔ woman
Distance: 100.5Similarity: 0.937
queen ↔ man
Distance: 100.0Similarity: 0.906
queen ↔ woman
Distance: 85.4Similarity: 0.987
man ↔ woman
Distance: 53.9Similarity: 0.903

About PCA

PCA reduces high-dimensional embeddings to 2D while preserving maximum variance. Points close together are similar in the original space.

How to Use

  • • Click words to add/remove them from the visualization
  • • Toggle labels and connections to explore relationships
  • • Try different visualization methods to see various perspectives
  • • Words closer together are more semantically similar
  • • Colors represent semantic clusters (animals, emotions, etc.)

Key Observation: Each word occupies exactly one point in this space, which creates a fundamental problem...

The Linguistic Problem

Consider these sentences:

  • "I'll bank the money" (financial institution)
  • "I'll bank the fire" (cover with ashes)
  • "I sat by the river bank" (edge of water)

Traditional embeddings assign identical vectors to "bank" in all contexts, collapsing distinct meanings into a single representation. This creates fundamental limitations in understanding and reasoning.

Interactive Exploration: Experiencing the Word Sense Problem

Let's explore this limitation interactively. The following tool demonstrates how traditional embeddings handle polysemous words (words with multiple meanings):

Word Sense Disambiguation

Explore how context determines the meaning of polysemous words in embeddings

Financial Institution

A place where money is kept and financial services are provided

Related Words:
moneyaccountloan

River Bank

The land alongside a river or stream

Related Words:
riverwatershore

Sloped Ground

A slope or inclined area of land

Related Words:
slopehillincline

Understanding Word Sense Disambiguation

  • Polysemous words have multiple related meanings (like "bank")
  • Context is crucial for determining the intended sense
  • Traditional embeddings struggle with this - they assign one vector per word
  • Contextual embeddings (like BERT) can handle multiple senses better
  • • Click on different senses to see how context affects meaning

What to observe:

  • How different contexts reveal completely different meanings for the same word
  • Why a single vector representation cannot capture this diversity
  • The challenge this creates for downstream NLP tasks

This visualization shows exactly why the NLP field needed a revolutionary approach to word representation.

Theoretical Foundation: The Contextual Revolution

ELMo: The First Breakthrough

ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major success in contextual embeddings.

Core Innovation

ELMo generates word representations by:

  1. Training bidirectional LSTM language models on large text corpora
  2. Using all internal states of the LSTMs, not just the final output
  3. Creating weighted combinations of representations from different layers

Mathematical Formulation

For a word wkw_k in sequence, ELMo creates:

ELMok=γj=0Lsjhk,jLM\text{ELMo}_k = \gamma \sum_{j=0}^L s_j \mathbf{h}_{k,j}^{LM}

Where:

  • hk,jLM\mathbf{h}_{k,j}^{LM} = contextual representation from layer jj
  • sjs_j = softmax-normalized learned weights
  • γ\gamma = global scaling parameter
  • LL = number of LSTM layers

Layer Specialization Discovery

ELMo revealed that different layers capture different linguistic information:

  • Lower layers: Syntactic information (part-of-speech, morphology)
  • Higher layers: Semantic information (word sense, context-specific meaning)

BERT: The Transformer Revolution

BERT (Bidirectional Encoder Representations from Transformers) by Devlin et al. (2018) represented a quantum leap by replacing LSTMs with transformer architecture.

Key Architectural Innovations

  1. Bidirectional Attention: Unlike sequential models, words attend to both left and right context simultaneously
  2. Self-Attention Mechanism: Each word can directly attend to any other word in the sequence
  3. Deep Contextualization: 12-24 layers of sophisticated attention patterns

Attention patterns in practice

Explore Transformers

Overview

Analysis

Architecture

Mechanisms

Training

Advanced

Attention Pattern Visualizer

Understanding the difference between self-attention and cross-attention

Self-Attention: "The cat sat on the mat"

Same Sequence → Q, K, V → Attention
Input: "The cat sat on the mat"
Q, K, V from same sequence
Self-attention weights
ThecatsatonthematThecatsatonthemat0.80.10.10.00.00.00.20.60.10.00.00.00.10.30.50.10.00.00.10.20.40.30.10.00.00.10.30.40.10.10.00.00.10.10.30.5
Key insight: Each word can attend to any word in the same sequence, including itself. Notice how "cat" strongly attends to itself (0.6) and moderately to "sat" (0.15).
How Self-Attention Works:
1.Same sequence for Q, K, V: All three matrices (Query, Key, Value) come from the same input sequence
2.Bidirectional attention: Each word can attend to every other word (unless masked)
3.Context building: Words gather information from their context to build richer representations
4.Used in: BERT (encoder), GPT (decoder with masking), both encoder and decoder of T5

Quick Comparison

Self-Attention
  • • Q, K, V from same sequence
  • • Tokens attend to each other
  • • Builds contextual representations
  • • Used in: BERT, GPT, encoder/decoder layers
Cross-Attention
  • • Q from decoder, K,V from encoder
  • • Decoder attends to encoder
  • • Connects two sequences
  • • Used in: T5, BART, translation models

Pre-training Objectives

BERT uses two clever training tasks:

1. Masked Language Modeling (MLM)

  • Randomly mask 15% of input tokens
  • Predict masked tokens using bidirectional context
  • Forces the model to use full context for understanding

2. Next Sentence Prediction (NSP)

  • Given two sentences, predict if the second logically follows the first
  • Teaches inter-sentence relationships

Mathematical Foundation

BERT's self-attention mechanism computes:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where for each token:

  • QQ (Query): "What am I looking for?"
  • KK (Key): "What do I represent?"
  • VV (Value): "What information do I provide?"

This allows each word to dynamically gather relevant information from its entire context.

Beyond BERT: Modern Developments

RoBERTa: Optimized Training

  • Longer training with larger datasets
  • Removed Next Sentence Prediction task
  • Dynamic masking patterns
  • Result: Significant performance improvements

Multimodal Evolution: CLIP

CLIP (Contrastive Language-Image Pre-training) extends contextual embeddings to connect text and images in a unified representation space, enabling:

  • Zero-shot image classification using text descriptions
  • Cross-modal retrieval and understanding
  • Foundation for modern AI systems like DALL-E and GPT-4V

Implementation: Working with Contextual Embeddings

Using Pre-trained BERT for Semantic Similarity

Quick SBERT similarity (CPU)

Loading...
python
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F def get_contextual_embeddings(text_list, model_name='sentence-transformers/all-MiniLM-L6-v2'): """Extract contextual embeddings from text using BERT-based models.""" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)

Zero-Shot Classification with Contextual Models

python
from transformers import pipeline # Initialize zero-shot classifier classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") # Example: Context-dependent classification text = "The river bank was eroding due to heavy rainfall." candidate_labels = ["finance", "geography", "action", "institution"] result = classifier(text, candidate_labels)

Connections: Comparative Analysis of Embedding Approaches

Now that we understand the theory and implementation, let's interactively compare how different embedding approaches handle the same linguistic challenges:

Embedding Model Comparison

Compare different word embedding approaches and their characteristics

Word2Vec

Neural network-based embeddings using Skip-gram or CBOW

Released: 2013
Dims: 300

GloVe

Global Vectors using matrix factorization on co-occurrence statistics

Released: 2014
Dims: 300

FastText

Extension of Word2Vec with subword information

Released: 2016
Dims: 300

BERT (contextual)

Contextual embeddings that vary based on sentence context

Released: 2018
Dims: 768

Word2Vec - Similar to "king"

man
0.98
queen
0.96
woman
0.94
japan
0.67
tokyo
0.63

GloVe - Similar to "king"

man
0.98
queen
0.94
woman
0.93
japan
0.68
tokyo
0.66

Model Selection Guide

  • Word2Vec: Good general-purpose choice, fast and reliable
  • GloVe: Better for analogy tasks, leverages global statistics
  • FastText: Best for handling rare words and morphologically rich languages
  • BERT: Use when context matters and computational resources allow

Key Insights to Explore:

  1. Static vs. Contextual: How Word2Vec, GloVe compare to BERT on ambiguous words
  2. Performance Trade-offs: Speed vs. accuracy across different models
  3. Use Case Matching: When to choose each approach for specific applications
  4. Evolution Timeline: The progression from simple to sophisticated representations

Cross-Domain Connections

The contextual embedding revolution parallels developments in other AI domains:

  • Computer Vision: From fixed features (SIFT, HOG) to contextual features (Vision Transformers)
  • Speech Recognition: From phoneme-based to contextual acoustic models
  • Recommendation Systems: From static user profiles to dynamic, context-aware preferences

Modern Landscape: State-of-the-Art Models

MTEB Leaderboard Leaders

ModelMTEB ScoreKey Innovation
E5-large65.3Advanced contrastive learning with hard negatives
BGE-Large64.5Custom mining strategies for training data
GTE-Large63.7Curriculum learning approach to embedding quality

MTEB (Massive Text Embedding Benchmark) evaluates models across 8 embedding task categories.

Why Contextual Embeddings Excel

1. Word Sense Disambiguation: Different vectors for different meanings 2. Compositional Understanding: Better phrase and sentence representations
3. Transfer Learning: Pre-trained representations adapt to new domains 4. Reduced Data Requirements: Leverage large-scale pre-training for small tasks

Practice: Applied Exercises

Exercise 1: Context Sensitivity Analysis

Compare how traditional and contextual embeddings handle these word pairs:

  • "apple" (fruit) vs. "Apple" (company)
  • "run" (jog) vs. "run" (execute program)
  • "patient" (medical) vs. "patient" (waiting calmly)

Exercise 2: Domain Adaptation

Use BERT embeddings to build a semantic search system for:

  • Scientific papers (using SciBERT)
  • Legal documents (using Legal-BERT)
  • Medical texts (using BioBERT)

Exercise 3: Cross-Lingual Understanding

Explore how multilingual BERT handles:

  • Word translations across languages
  • Cross-lingual semantic similarity
  • Code-switching in multilingual texts

Summary

In this lesson, we've explored the revolutionary shift from static to contextual embeddings:

  1. The Problem: Static embeddings cannot capture context-dependent word meanings
  2. The Solution: Contextual models like ELMo and BERT that adapt representations based on surrounding context
  3. The Architecture: Bidirectional attention mechanisms and sophisticated pre-training objectives
  4. The Impact: Dramatic improvements across virtually all NLP tasks
  5. The Applications: From semantic search to zero-shot classification

Contextual embeddings represent one of the most significant breakthroughs in NLP history, laying the foundation for modern language models like GPT and ChatGPT.

In our next lesson, we'll explore the pre-transformer architectures (RNNs, LSTMs, GRUs) that paved the way for these contextual breakthroughs, understanding how the field evolved from sequential processing to parallel attention mechanisms.

Additional Resources