Contextual Embeddings and Modern Representations

Overview

In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. These models revolutionized NLP by capturing semantic relationships between words. However, they share a fundamental limitation: they assign the same vector to a word regardless of its context.

This lesson introduces contextual embeddings - dynamic representations that adapt based on surrounding words, enabling machines to understand the nuanced, context-dependent nature of human language.

Learning Objectives

After completing this lesson, you will be able to:

Understand why static embeddings fail with polysemous words
Explain how contextual models like ELMo and BERT solve this limitation
Recognize the key architectural innovations that enable context-sensitivity
Compare different contextual embedding approaches and their trade-offs
Apply contextual embeddings to real-world NLP tasks

Conceptual Introduction: The Chameleon Word Problem

Real-World Analogy: The Multi-Talented Actor

Imagine an actor who plays completely different characters:

Character A: A serious bank president in a drama
Character B: A muddy river guide in an adventure film
Character C: A pilot banking an airplane in an action movie

The same actor embodies entirely different personas depending on the context (script, other actors, setting). Traditional word embeddings are like having only one headshot photo to represent this actor - it captures their appearance but misses the rich variety of roles they can play.

Contextual embeddings are like having a different photo for each performance, showing how the actor adapts to each role while maintaining their core identity.

Visualizing the Static Embedding Limitation

First, let's visualize how traditional embeddings work in vector space. Notice how each word has a fixed position regardless of context:

Loading tool...

What to observe:

How different contexts reveal completely different meanings for the same word
Why a single vector representation cannot capture this diversity
The challenge this creates for downstream NLP tasks

This visualization shows exactly why the NLP field needed a revolutionary approach to word representation.

Theoretical Foundation: The Contextual Revolution

ELMo: The First Breakthrough

ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major success in contextual embeddings.

Core Innovation

ELMo generates word representations by:

Training bidirectional LSTM language models on large text corpora
Using all internal states of the LSTMs, not just the final output
Creating weighted combinations of representations from different layers

Mathematical Formulation

For a word $w_k$ in sequence, ELMo creates:

$\text{ELMo}_k = \gamma \sum_{j=0}^L s_j \mathbf{h}_{k,j}^{LM}$

Where:

$\mathbf{h}_{k,j}^{LM}$ = contextual representation from layer $j$
$s_j$ = softmax-normalized learned weights
$\gamma$ = global scaling parameter
$L$ = number of LSTM layers

Layer Specialization Discovery

ELMo revealed that different layers capture different linguistic information:

Lower layers: Syntactic information (part-of-speech, morphology)
Higher layers: Semantic information (word sense, context-specific meaning)

BERT: The Transformer Revolution

BERT (Bidirectional Encoder Representations from Transformers) by Devlin et al. (2018) represented a quantum leap by replacing LSTMs with transformer architecture.

Key Architectural Innovations

Bidirectional Attention: Unlike sequential models, words attend to both left and right context simultaneously
Self-Attention Mechanism: Each word can directly attend to any other word in the sequence
Deep Contextualization: 12-24 layers of sophisticated attention patterns

Attention patterns in practice

Loading tool...

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def get_contextual_embeddings(text_list, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    """Extract contextual embeddings from text using BERT-based models."""
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    # Tokenize input texts
    encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt')
    
    # Generate embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    # Mean pooling over token embeddings
    token_embeddings = model_output[0]
    attention_mask = encoded_input['attention_mask']
    
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sentence_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    
    return sentence_embeddings

# Example: Context-sensitive similarity
sentences = [
    "The bank approved my loan application.",      # Financial context
    "We sat on the bank of the river.",          # Geographic context  
    "I need to bank this money quickly.",        # Action context
    "The investment bank offers great returns."   # Financial context
]

embeddings = get_contextual_embeddings(sentences)

# Calculate similarities
similarities = F.cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1:])

print("Similarities with 'The bank approved my loan application.':")
for i, sim in enumerate(similarities):
    print(f"  → '{sentences[i+1]}': {sim:.3f}")

Zero-Shot Classification with Contextual Models

from transformers import pipeline

# Initialize zero-shot classifier
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Example: Context-dependent classification
text = "The river bank was eroding due to heavy rainfall."
candidate_labels = ["finance", "geography", "action", "institution"]

result = classifier(text, candidate_labels)

print(f"Text: {text}")
print("\nClassification results:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.3f}")

Connections: Comparative Analysis of Embedding Approaches

Now that we understand the theory and implementation, let's interactively compare how different embedding approaches handle the same linguistic challenges:

Loading tool...

NLP Fundamentals: Core Concepts and Architectures

Contextual Embeddings and Modern Representations

Overview

Learning Objectives

Conceptual Introduction: The Chameleon Word Problem

Real-World Analogy: The Multi-Talented Actor

Visualizing the Static Embedding Limitation

Theoretical Foundation: The Contextual Revolution

ELMo: The First Breakthrough

Core Innovation

Mathematical Formulation

Layer Specialization Discovery

BERT: The Transformer Revolution

Key Architectural Innovations

Attention patterns in practice

Zero-Shot Classification with Contextual Models

Connections: Comparative Analysis of Embedding Approaches