Contextual Embeddings and Modern Representations

Overview

In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. These models revolutionized NLP by capturing semantic relationships between words. However, they share a fundamental limitation: they assign the same vector to a word regardless of its context.

This lesson introduces contextual embeddings - dynamic representations that adapt based on surrounding words, enabling machines to understand the nuanced, context-dependent nature of human language.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand why static embeddings fail with polysemous words
  • Explain how contextual models like ELMo and BERT solve this limitation
  • Recognize the key architectural innovations that enable context-sensitivity
  • Compare different contextual embedding approaches and their trade-offs
  • Apply contextual embeddings to real-world NLP tasks

Conceptual Introduction: The Chameleon Word Problem

Real-World Analogy: The Multi-Talented Actor

Imagine an actor who plays completely different characters:

  • Character A: A serious bank president in a drama
  • Character B: A muddy river guide in an adventure film
  • Character C: A pilot banking an airplane in an action movie

The same actor embodies entirely different personas depending on the context (script, other actors, setting). Traditional word embeddings are like having only one headshot photo to represent this actor - it captures their appearance but misses the rich variety of roles they can play.

Contextual embeddings are like having a different photo for each performance, showing how the actor adapts to each role while maintaining their core identity.

Visualizing the Static Embedding Limitation

First, let's visualize how traditional embeddings work in vector space. Notice how each word has a fixed position regardless of context:

Loading tool...

What to observe:

  • How different contexts reveal completely different meanings for the same word
  • Why a single vector representation cannot capture this diversity
  • The challenge this creates for downstream NLP tasks

This visualization shows exactly why the NLP field needed a revolutionary approach to word representation.

Theoretical Foundation: The Contextual Revolution

ELMo: The First Breakthrough

ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major success in contextual embeddings.

Core Innovation

ELMo generates word representations by:

  1. Training bidirectional LSTM language models on large text corpora
  2. Using all internal states of the LSTMs, not just the final output
  3. Creating weighted combinations of representations from different layers

Mathematical Formulation

For a word wkw_k in sequence, ELMo creates:

ELMok=γj=0Lsjhk,jLM\text{ELMo}_k = \gamma \sum_{j=0}^L s_j \mathbf{h}_{k,j}^{LM}

Where:

  • hk,jLM\mathbf{h}_{k,j}^{LM} = contextual representation from layer jj
  • sjs_j = softmax-normalized learned weights
  • γ\gamma = global scaling parameter
  • LL = number of LSTM layers

Layer Specialization Discovery

ELMo revealed that different layers capture different linguistic information:

  • Lower layers: Syntactic information (part-of-speech, morphology)
  • Higher layers: Semantic information (word sense, context-specific meaning)

BERT: The Transformer Revolution

BERT (Bidirectional Encoder Representations from Transformers) by Devlin et al. (2018) represented a quantum leap by replacing LSTMs with transformer architecture.

Key Architectural Innovations

  1. Bidirectional Attention: Unlike sequential models, words attend to both left and right context simultaneously
  2. Self-Attention Mechanism: Each word can directly attend to any other word in the sequence
  3. Deep Contextualization: 12-24 layers of sophisticated attention patterns

Attention patterns in practice

Loading tool...
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F def get_contextual_embeddings(text_list, model_name='sentence-transformers/all-MiniLM-L6-v2'): """Extract contextual embeddings from text using BERT-based models.""" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Tokenize input texts encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt') # Generate embeddings with torch.no_grad(): model_output = model(**encoded_input) # Mean pooling over token embeddings token_embeddings = model_output[0] attention_mask = encoded_input['attention_mask'] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() sentence_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Normalize embeddings sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) return sentence_embeddings # Example: Context-sensitive similarity sentences = [ "The bank approved my loan application.", # Financial context "We sat on the bank of the river.", # Geographic context "I need to bank this money quickly.", # Action context "The investment bank offers great returns." # Financial context ] embeddings = get_contextual_embeddings(sentences) # Calculate similarities similarities = F.cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1:]) print("Similarities with 'The bank approved my loan application.':") for i, sim in enumerate(similarities): print(f" → '{sentences[i+1]}': {sim:.3f}")

Zero-Shot Classification with Contextual Models

from transformers import pipeline # Initialize zero-shot classifier classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") # Example: Context-dependent classification text = "The river bank was eroding due to heavy rainfall." candidate_labels = ["finance", "geography", "action", "institution"] result = classifier(text, candidate_labels) print(f"Text: {text}") print("\nClassification results:") for label, score in zip(result['labels'], result['scores']): print(f" {label}: {score:.3f}")

Connections: Comparative Analysis of Embedding Approaches

Now that we understand the theory and implementation, let's interactively compare how different embedding approaches handle the same linguistic challenges:

Loading tool...