Overview
In our previous lessons, we've explored word representations from static embeddings to contextual embeddings. But a critical question remains: how do we effectively process sequences of these word representations to understand the full meaning of sentences, paragraphs, and documents?
This lesson introduces Recurrent Neural Networks (RNNs), the foundational architecture for sequential data processing in NLP. Before transformers became the dominant paradigm, RNNs and their variants (LSTM, GRU) were the state-of-the-art for tasks like language modeling, machine translation, and sentiment analysis.
Learning Objectives
After completing this lesson, you will be able to:
- Understand why sequential data requires specialized neural architectures
- Explain the basic RNN architecture and its recurrence mechanism
- Describe the vanishing/exploding gradient problems in vanilla RNNs
- Compare LSTM and GRU architectures and their advantages
- Implement RNN variants for common NLP tasks
- Recognize the limitations that led to the transformer revolution
The Sequential Nature of Language
The Challenge of Variable-Length Input
Traditional neural networks expect fixed-size inputs, but language is inherently variable in length:
- Sentences can be short ("I agree.") or very long
- Documents can range from tweets to novels
- Conversations can have arbitrary turns and lengths
How do we design neural networks that can handle this variability while preserving the sequential relationships?
Analogy: Understanding Music
Consider how you understand music. A single note in isolation gives limited information, but as you hear sequences of notes, you build an understanding of the melody, rhythm, and emotional tone.
If you were to hear only random isolated notes, you'd lose the temporal patterns that make music meaningful. Similarly, to understand language, we need to process words not in isolation, but as part of a meaningful sequence while maintaining the memory of what came before.
Why Feed-Forward Networks Fall Short
| Requirement | Feed-Forward Networks | Recurrent Networks |
|---|---|---|
| Variable-length input | Fixed input size | Can handle any sequence length |
| Parameter sharing across positions | No position-specific parameters | Same weights used at each time step |
| Memory of previous inputs | No memory mechanism | State vector carries information forward |
| Order sensitivity | Order agnostic | Order matters |
| Position awareness | No positional awareness | Position implicitly encoded through recurrence |
Recurrent Neural Networks: The Basic Architecture
The Recurrence Mechanism
The key innovation in RNNs is the recurrence mechanism: the network maintains a hidden state (or "memory") that is updated at each time step based on both the current input and the previous hidden state.
Interactive RNN Architecture Explorer
Explore different RNN architectures and see how they evolved to solve various problems:
💡 Tip: Use the tabs above to compare vanilla RNNs, LSTMs, GRUs, and bidirectional variants. We'll explore training dynamics and sequence processing with additional tools as we progress through the lesson.
Mathematical Formulation
At each time step , the vanilla RNN computes:
Where:
- is the input at time step (e.g., a word embedding)
- is the hidden state at time step
- is the hidden state from the previous time step
- is the output at time step
- , , and are weight matrices
- and are bias vectors
- is typically tanh or ReLU activation function
- is an output activation function (e.g., softmax for classification)
Parameter Sharing
A key advantage of RNNs is parameter sharing across time steps. The same weights are used at each step, which:
- Drastically reduces the number of parameters
- Allows processing sequences of any length
- Enables the network to recognize patterns regardless of position
Training RNNs: Backpropagation Through Time (BPTT)
RNNs are trained using an extension of backpropagation called Backpropagation Through Time (BPTT), which unfolds the recurrent network through time and treats it as a deep feed-forward network.
Training Dynamics: Backpropagation Through Time
Now let's explore the training challenges that led to LSTM and GRU innovations:
💡 Tip: Use the tabs above to explore gradient flow problems and compare how different architectures handle training challenges. This visualization shows why vanilla RNNs struggle with long sequences.
Long Short-Term Memory (LSTM): Solving the Long-Term Dependency Problem
To address the vanishing gradient problem, Hochreiter and Schmidhuber introduced the Long Short-Term Memory (LSTM) architecture in 1997. LSTMs use a more complex recurrent unit with gates that control information flow.
LSTM Architecture
👆 Use the Architecture Explorer above and select "LSTM" to see the detailed gate structure and how it differs from vanilla RNNs.
The Gate Mechanism
An LSTM cell contains three gates that regulate information flow:
- Forget Gate: Decides what information to discard from the cell state
- Input Gate: Decides what new information to store in the cell state
- Output Gate: Decides what parts of the cell state to output
Mathematical Formulation
For input at time step :
Forget Gate:
Input Gate:
Cell State Update:
Output Gate:
Where:
- is the sigmoid function
- represents element-wise multiplication
- is the cell state at time
- is the hidden state at time
- and are weight matrices and bias vectors
Memory Management Analogy
Think of the LSTM cell as a skilled personal assistant managing your information flow:
- Forget Gate: Like clearing your desk of irrelevant papers
- Input Gate: Like deciding which new information deserves to be filed away
- Cell State: Like your organized filing cabinet of important information
- Output Gate: Like preparing a briefing of only the relevant information you need right now
Addressing Long-Term Dependencies
LSTMs excel at capturing long-term dependencies through their explicit memory mechanism. The combination of the cell state (long-term memory) and hidden state (working memory) allows LSTMs to maintain relevant information across many time steps while forgetting irrelevant details.
Gated Recurrent Unit (GRU): A Streamlined Alternative
Introduced in 2014 by Cho et al., the Gated Recurrent Unit (GRU) is a simplified variant of the LSTM that combines the forget and input gates into a single "update gate."
GRU Architecture
👆 Use the Architecture Explorer above and select "GRU" to see how it simplifies the LSTM design while maintaining effectiveness.
Mathematical Formulation
For input at time step :
Update Gate:
Reset Gate:
Candidate Hidden State:
Final Hidden State:
LSTM vs. GRU: Comparison
| Feature | LSTM | GRU |
|---|---|---|
| Parameters | More (4 sets of weights and biases) | Fewer (3 sets of weights and biases) |
| Memory unit | Cell state and hidden state | Hidden state only |
| Gates | Forget, input, and output gates | Update and reset gates |
| Training speed | Slower | Faster |
| Performance on very long dependencies | Slightly better | Good |
| Computational efficiency | More computation | Less computation |
Note: GRUs typically train faster and require fewer parameters, but LSTMs may perform better on certain tasks, especially those requiring fine-grained memory control.
Bidirectional RNNs: Capturing Context from Both Directions
In many NLP tasks, understanding a word requires context from both past and future words. Bidirectional RNNs process the sequence in both forward and backward directions.
Bidirectional Architecture
👆 Use the Architecture Explorer above and select "Bidirectional" to see how information flows in both directions.
Benefits for NLP Tasks
Bidirectional processing is especially valuable for:
- Named Entity Recognition
- Part-of-Speech Tagging
- Machine Translation
- Question Answering
Example: Disambiguating Word Sense
The word "bank" has different meanings depending on context. Bidirectional RNNs can use both past and future context to determine the correct interpretation.
Example contexts:
- "I went to the bank to deposit money" (financial institution)
- "We sat by the river bank watching the sunset" (edge of water)
- "The pilot had to bank the airplane to the left" (to tilt)
Bidirectional RNNs excel at these disambiguation tasks because they can consider the full sentence context.
Common NLP Applications of RNNs
Language Modeling
Language modeling is the task of predicting the next word given a sequence of previous words. This is a fundamental NLP task with applications in:
- Speech recognition
- Machine translation
- Text generation
- Spelling correction
Code Example: Simple Character-Level Language Model
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import numpy as np from collections import Counter import random # Sample text data text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.""" # Prepare character-level data chars = sorted(list(set(text))) char_to_idx = {char: i for i, char in enumerate(chars)} idx_to_char = {i: char for i, char in enumerate(chars)} vocab_size = len(chars) class CharLSTM(nn.Module): def __init__(self, vocab_size, hidden_size=128, num_layers=2): super(CharLSTM, self).__init__() self.hidden_size = hidden_size self.num_layers = num_layers self.vocab_size = vocab_size # Embedding layer self.embedding = nn.Embedding(vocab_size, hidden_size) # LSTM layer(s) self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True, dropout=0.2) # Output layer self.fc = nn.Linear(hidden_size, vocab_size) def forward(self, x, hidden=None): batch_size = x.size(0) if hidden is None: h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size) c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size) hidden = (h0, c0) # Embedding embedded = self.embedding(x) # LSTM forward pass lstm_out, hidden = self.lstm(embedded, hidden) # Output layer output = self.fc(lstm_out) return output, hidden # Create training sequences def create_sequences(text, seq_length=40): sequences = [] targets = [] for i in range(len(text) - seq_length): seq = text[i:i + seq_length] target = text[i + 1:i + seq_length + 1] sequences.append([char_to_idx[char] for char in seq]) targets.append([char_to_idx[char] for char in target]) return torch.tensor(sequences), torch.tensor(targets) # Prepare data seq_length = 40 X, y = create_sequences(text, seq_length) # Initialize model model = CharLSTM(vocab_size, hidden_size=128, num_layers=2) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.002) # Training function def train_model(model, X, y, epochs=50, batch_size=32): model.train() for epoch in range(epochs): total_loss = 0 hidden = None for i in range(0, len(X) - batch_size, batch_size): batch_X = X[i:i+batch_size] batch_y = y[i:i+batch_size] # Reset gradients optimizer.zero_grad() # Forward pass output, hidden = model(batch_X, hidden) # Detach hidden state to prevent backprop through entire sequence hidden = tuple([h.data for h in hidden]) # Calculate loss loss = criterion(output.view(-1, vocab_size), batch_y.view(-1)) # Backward pass loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 5) optimizer.step() total_loss += loss.item() if epoch % 10 == 0: print(f'Epoch {epoch}, Loss: {total_loss/len(X):.4f}') # Text generation function def generate_text(model, seed_text, length=100, temperature=0.8): model.eval() # Convert seed text to indices current_seq = [char_to_idx[char] for char in seed_text[-seq_length:]] generated = seed_text with torch.no_grad(): hidden = None for _ in range(length): # Prepare input x = torch.tensor([current_seq]).long() # Forward pass output, hidden = model(x, hidden) # Apply temperature logits = output[0, -1] / temperature probs = F.softmax(logits, dim=0) # Sample next character next_char_idx = torch.multinomial(probs, 1).item() next_char = idx_to_char[next_char_idx] # Update sequence and generated text generated += next_char current_seq = current_seq[1:] + [next_char_idx] return generated # Example usage: # train_model(model, X, y, epochs=50) # generated_text = generate_text(model, "Natural language processing is ", length=100) # print(generated_text) print("Model architecture:") print(model)
Sentiment Analysis
Sentiment analysis determines the emotional tone behind text, often used for customer reviews, social media monitoring, and brand analysis.
Code Example: Sentiment Classification with LSTM
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch.utils.data import Dataset, DataLoader from sklearn.model_selection import train_test_split import numpy as np # Sample data texts = [ "This movie was fantastic! I really enjoyed it.", "The plot was intriguing and kept me engaged.", "Terrible movie, waste of time and money.", "I hated the characters and the story made no sense.", "The acting was superb and the cinematography was beautiful.", "What a disappointment, I expected much better.", "Amazing storyline with incredible character development.", "Boring and predictable, couldn't wait for it to end.", "Outstanding performances from all the actors.", "The worst movie I've ever seen in my life." ] labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0] # 1 for positive, 0 for negative class TextTokenizer: def __init__(self, max_words=1000): self.max_words = max_words self.word_to_idx = {} self.idx_to_word = {} def fit_on_texts(self, texts): # Count word frequencies word_count = {} for text in texts: words = text.lower().split() for word in words: word_count[word] = word_count.get(word, 0) + 1 # Sort by frequency and take top max_words sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True) vocab_words = [word for word, count in sorted_words[:self.max_words-2]] # Reserve 2 for special tokens # Build vocabulary self.word_to_idx = {'<PAD>': 0, '<UNK>': 1} self.idx_to_word = {0: '<PAD>', 1: '<UNK>'} for i, word in enumerate(vocab_words): self.word_to_idx[word] = i + 2 self.idx_to_word[i + 2] = word def texts_to_sequences(self, texts): sequences = [] for text in texts: words = text.lower().split() sequence = [self.word_to_idx.get(word, 1) for word in words] # 1 is <UNK> sequences.append(sequence) return sequences def pad_sequences(sequences, maxlen=None, padding='post', truncating='post'): if maxlen is None: maxlen = max(len(seq) for seq in sequences) padded = [] for seq in sequences: if len(seq) > maxlen: if truncating == 'post': seq = seq[:maxlen] else: seq = seq[-maxlen:] if padding == 'post': seq = seq + [0] * (maxlen - len(seq)) else: seq = [0] * (maxlen - len(seq)) + seq padded.append(seq) return torch.tensor(padded, dtype=torch.long) # Tokenize the texts max_words = 100 max_len = 20 tokenizer = TextTokenizer(max_words) tokenizer.fit_on_texts(texts) sequences = tokenizer.texts_to_sequences(texts) X = pad_sequences(sequences, maxlen=max_len) y = torch.tensor(labels, dtype=torch.float32) class SentimentLSTM(nn.Module): def __init__(self, vocab_size, embedding_dim=128, hidden_dim=64, num_layers=1, dropout=0.2): super(SentimentLSTM, self).__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0) self.dropout = nn.Dropout(dropout) self.fc1 = nn.Linear(hidden_dim, 32) self.fc2 = nn.Linear(32, 1) def forward(self, x): # Embedding embedded = self.embedding(x) # LSTM lstm_out, (hidden, _) = self.lstm(embedded) # Use the last hidden state (from the last time step) # Take the last non-padded output for each sequence batch_size = x.size(0) seq_lengths = (x != 0).sum(dim=1) # Find actual sequence lengths # Gather the last relevant output for each sequence last_outputs = [] for i in range(batch_size): last_idx = max(0, seq_lengths[i] - 1) last_outputs.append(lstm_out[i, last_idx, :]) last_hidden = torch.stack(last_outputs) # Fully connected layers x = self.dropout(last_hidden) x = F.relu(self.fc1(x)) x = self.dropout(x) x = torch.sigmoid(self.fc2(x)) return x.squeeze() # Initialize model vocab_size = len(tokenizer.word_to_idx) model = SentimentLSTM(vocab_size, embedding_dim=128, hidden_dim=64, dropout=0.2) criterion = nn.BCELoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Training function def train_sentiment_model(model, X, y, epochs=100, batch_size=4): model.train() for epoch in range(epochs): total_loss = 0 # Create batches for i in range(0, len(X), batch_size): batch_X = X[i:i+batch_size] batch_y = y[i:i+batch_size] # Reset gradients optimizer.zero_grad() # Forward pass outputs = model(batch_X) loss = criterion(outputs, batch_y) # Backward pass loss.backward() optimizer.step() total_loss += loss.item() if epoch % 20 == 0: print(f'Epoch {epoch}, Loss: {total_loss:.4f}') # Prediction function def predict_sentiment(model, tokenizer, text, max_len=20): model.eval() # Tokenize and pad the input text sequence = tokenizer.texts_to_sequences([text]) padded = pad_sequences(sequence, maxlen=max_len) with torch.no_grad(): prediction = model(padded).item() return { 'text': text, 'positive_probability': prediction, 'negative_probability': 1 - prediction, 'sentiment': 'Positive' if prediction > 0.5 else 'Negative' } # Example usage: # train_sentiment_model(model, X, y, epochs=100) # result = predict_sentiment(model, tokenizer, "This movie was absolutely amazing!") # print(result) print("Model architecture:") print(model)
Machine Translation with Encoder-Decoder Architecture
Machine translation uses a sequence-to-sequence (Seq2Seq) architecture with an encoder RNN and a decoder RNN.
Interactive Translation Demo
See how RNN encoder-decoder models with attention work for machine translation:
💡 Tip: This tool demonstrates the attention mechanism that became the foundation for transformers. Notice how the decoder "attends" to different parts of the source sequence when generating each target word.
Code Example: Simple Encoder-Decoder for Translation
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import random # Simple Encoder-Decoder with Attention for Neural Machine Translation class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout): super().__init__() self.hid_dim = hid_dim self.n_layers = n_layers self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True) self.dropout = nn.Dropout(dropout) def forward(self, src): # src = [batch size, src len] embedded = self.dropout(self.embedding(src)) # embedded = [batch size, src len, emb dim] outputs, (hidden, cell) = self.rnn(embedded) # outputs = [batch size, src len, hid dim * n directions] # hidden = [n layers * n directions, batch size, hid dim] # cell = [n layers * n directions, batch size, hid dim] return outputs, hidden, cell class Attention(nn.Module): def __init__(self, hid_dim): super().__init__() self.attn = nn.Linear((hid_dim * 2) + hid_dim, hid_dim) self.v = nn.Linear(hid_dim, 1, bias=False) def forward(self, hidden, encoder_outputs): # hidden = [batch size, hid dim] # encoder_outputs = [batch size, src len, hid dim] batch_size = encoder_outputs.shape[0] src_len = encoder_outputs.shape[1] # Repeat decoder hidden state src_len times hidden = hidden.unsqueeze(1).repeat(1, src_len, 1) # hidden = [batch size, src len, hid dim] # Calculate energy energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2))) # energy = [batch size, src len, hid dim] attention = self.v(energy).squeeze(2) # attention = [batch size, src len] return F.softmax(attention, dim=1) class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout, attention): super().__init__() self.output_dim = output_dim self.hid_dim = hid_dim self.n_layers = n_layers self.attention = attention self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.LSTM((hid_dim) + emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True) self.fc_out = nn.Linear((hid_dim * 2) + hid_dim + emb_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, input, hidden, cell, encoder_outputs): # input = [batch size, 1] # hidden = [n layers, batch size, hid dim] # cell = [n layers, batch size, hid dim] # encoder_outputs = [batch size, src len, hid dim] embedded = self.dropout(self.embedding(input)) # embedded = [batch size, 1, emb dim] # Calculate attention weights a = self.attention(hidden[-1], encoder_outputs) # a = [batch size, src len] a = a.unsqueeze(1) # a = [batch size, 1, src len] weighted = torch.bmm(a, encoder_outputs) # weighted = [batch size, 1, hid dim] rnn_input = torch.cat((embedded, weighted), dim=2) # rnn_input = [batch size, 1, (hid dim) + emb dim] output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell)) # output = [batch size, 1, hid dim] # hidden = [n layers, batch size, hid dim] # cell = [n layers, batch size, hid dim] # Calculate prediction prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=2)) # prediction = [batch size, 1, output dim] return prediction.squeeze(1), hidden, cell class Seq2Seq(nn.Module): def __init__(self, encoder, decoder, device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src, trg, teacher_forcing_ratio=0.5): # src = [batch size, src len] # trg = [batch size, trg len] batch_size = trg.shape[0] trg_len = trg.shape[1] trg_vocab_size = self.decoder.output_dim # Tensor to store decoder outputs outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device) # Encoder forward pass encoder_outputs, hidden, cell = self.encoder(src) # First input to the decoder is the <sos> tokens input = trg[:, 0].unsqueeze(1) for t in range(1, trg_len): # Insert input token embedding, previous hidden and cell states, and encoder outputs output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs) # Place predictions in a tensor holding predictions for each token outputs[:, t] = output # Decide if we are going to use teacher forcing or not teacher_force = random.random() < teacher_forcing_ratio # Get the highest predicted token from our predictions top1 = output.argmax(1) # If teacher forcing, use actual next token as next input # If not, use predicted token input = trg[:, t].unsqueeze(1) if teacher_force else top1.unsqueeze(1) return outputs # Model parameters INPUT_DIM = 1000 # Source vocabulary size OUTPUT_DIM = 1000 # Target vocabulary size ENC_EMB_DIM = 256 DEC_EMB_DIM = 256 HID_DIM = 512 N_LAYERS = 2 ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5 # Initialize model components device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') attn = Attention(HID_DIM) enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT) dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT, attn) model = Seq2Seq(enc, dec, device).to(device) # Training function def train_translation_model(model, iterator, optimizer, criterion, clip): model.train() epoch_loss = 0 for i, batch in enumerate(iterator): src, trg = batch src, trg = src.to(device), trg.to(device) optimizer.zero_grad() output = model(src, trg) # trg = [batch size, trg len] # output = [batch size, trg len, output dim] output_dim = output.shape[-1] # Reshape for loss calculation output = output[:, 1:].reshape(-1, output_dim) trg = trg[:, 1:].reshape(-1) loss = criterion(output, trg) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator) # Example initialization optimizer = optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss(ignore_index=0) # Assuming 0 is padding token print("Seq2Seq Model with Attention:") print(f"Encoder: {sum(p.numel() for p in enc.parameters() if p.requires_grad):,} trainable parameters") print(f"Decoder: {sum(p.numel() for p in dec.parameters() if p.requires_grad):,} trainable parameters") print(f"Total: {sum(p.numel() for p in model.parameters() if p.requires_grad):,} trainable parameters") # Translation function (for inference) def translate_sentence(model, src_tensor, src_vocab, trg_vocab, max_len=50): model.eval() with torch.no_grad(): encoder_outputs, hidden, cell = model.encoder(src_tensor) # Create initial input (start of sequence token) trg_indexes = [trg_vocab['<sos>']] for i in range(max_len): trg_tensor = torch.LongTensor([trg_indexes[-1]]).unsqueeze(1).to(device) output, hidden, cell = model.decoder(trg_tensor, hidden, cell, encoder_outputs) pred_token = output.argmax(1).item() trg_indexes.append(pred_token) if pred_token == trg_vocab['<eos>']: break # Convert indexes back to tokens trg_tokens = [list(trg_vocab.keys())[list(trg_vocab.values()).index(i)] for i in trg_indexes] return trg_tokens[1:] # Remove <sos> token
RNNs with Attention Mechanism: A Step Toward Transformers
The attention mechanism, introduced in 2014, was a critical advancement that addressed limitations in the encoder-decoder architecture, particularly for long sequences.
The Problem: Information Bottleneck
In the basic encoder-decoder architecture, the entire source sequence is compressed into a fixed-size vector, creating an information bottleneck.
Attention Mechanism: The Bridge to Transformers
Attention allows the decoder to "focus" on different parts of the source sequence at each decoding step. This was the conceptual breakthrough that led to transformers.
Note: This is encoder-decoder attention between RNNs. In our next lesson on transformers, we'll see how this concept evolved into self-attention, where sequences attend to themselves.
Mathematical Formulation
-
Calculate alignment scores between decoder state and all encoder states :
-
Normalize to get attention weights:
-
Calculate context vector as weighted sum:
-
Generate output using context vector and current decoder state:
The Bridge to Transformers
The attention mechanism was a crucial step toward the transformer architecture:
- Eliminated the bottleneck of fixed-size context vectors
- Allowed direct connections between distant positions
- Introduced the concept of weighted importance between elements
- Provided a foundation for self-attention in transformers
Coming up: In our next lesson, we'll see how this encoder-decoder attention evolved into self-attention, where sequences attend to themselves, leading to the revolutionary transformer architecture.
Limitations of RNNs and the Path to Transformers
Despite their innovations, RNNs (even with LSTM/GRU and attention) have several limitations:
Sequential Processing and Limited Context
RNNs process tokens sequentially, making them inherently difficult to parallelize. Even with gating mechanisms, RNNs struggle to maintain very long-range dependencies.
👆 Use the "RNN Comparison" tab in the Training Visualization above to see how different architectures perform on various metrics like training speed, memory usage, and dependency modeling.
Emergence of Transformers
The transformer architecture addressed these limitations by:
- Parallelization: Processing all tokens simultaneously
- Direct connections: Allowing each position to attend to all positions
- Multi-head attention: Capturing different types of relationships
- Positional encoding: Maintaining sequence order without recurrence
Summary
In this lesson, we've covered:
- The sequential nature of language and why it requires specialized architectures
- Vanilla RNN architecture and its limitations
- LSTM and GRU cells that address the vanishing gradient problem
- Bidirectional RNNs for capturing context from both directions
- Applications in language modeling, sentiment analysis, and machine translation
- Attention mechanisms that paved the way for transformers
- Limitations of RNNs that led to the transformer revolution
RNNs represent a crucial chapter in the evolution of NLP architectures. While they've largely been superseded by transformers for many tasks, understanding RNNs is essential for appreciating the motivations behind modern architectures and for contexts where their sequential nature and efficiency make them still relevant.
In our next lesson, we'll explore transformers in depth, understanding how they revolutionized NLP and enabled the powerful language models we use today.
Practice Exercises
-
RNN from Scratch:
- Implement a vanilla RNN in PyTorch
- Observe the vanishing gradient problem firsthand
- Compare training stability across different sequence lengths
-
LSTM Language Model:
- Build a character-level language model using LSTMs
- Generate text samples and analyze coherence
- Experiment with temperature settings in sampling
-
Sentiment Analysis Comparison:
- Implement sentiment classifiers using:
- Bag-of-words + Logistic Regression
- Word embeddings + Vanilla RNN
- Word embeddings + LSTM
- Word embeddings + Bidirectional LSTM
- Compare performance and training time
- Implement sentiment classifiers using:
-
Neural Machine Translation:
- Implement a simple encoder-decoder model for translation
- Add an attention mechanism
- Analyze which source words receive attention for different target words
Additional Resources
- Understanding LSTM Networks by Christopher Olah
- The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy
- Sequence to Sequence Learning with Neural Networks by Sutskever et al.
- Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al.
- Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Chung et al.
- Deep Learning for NLP and Speech Recognition by Kamath et al. (Chapters 7-9)