Pre-Transformer Models: RNN, LSTM, and GRU

Overview

In our previous lessons, we've explored word representations from static embeddings to contextual embeddings. But a critical question remains: how do we effectively process sequences of these word representations to understand the full meaning of sentences, paragraphs, and documents?

This lesson introduces Recurrent Neural Networks (RNNs), the foundational architecture for sequential data processing in NLP. Before transformers became the dominant paradigm, RNNs and their variants (LSTM, GRU) were the state-of-the-art for tasks like language modeling, machine translation, and sentiment analysis.

Learning Objectives

After completing this lesson, you will be able to:

Understand why sequential data requires specialized neural architectures
Explain the basic RNN architecture and its recurrence mechanism
Describe the vanishing/exploding gradient problems in vanilla RNNs
Compare LSTM and GRU architectures and their advantages
Implement RNN variants for common NLP tasks
Recognize the limitations that led to the transformer revolution

The Sequential Nature of Language

The Challenge of Variable-Length Input

Traditional neural networks expect fixed-size inputs, but language is inherently variable in length:

Sentences can be short ("I agree.") or very long
Documents can range from tweets to novels
Conversations can have arbitrary turns and lengths

How do we design neural networks that can handle this variability while preserving the sequential relationships?

Analogy: Understanding Music

Consider how you understand music. A single note in isolation gives limited information, but as you hear sequences of notes, you build an understanding of the melody, rhythm, and emotional tone.

If you were to hear only random isolated notes, you'd lose the temporal patterns that make music meaningful. Similarly, to understand language, we need to process words not in isolation, but as part of a meaningful sequence while maintaining the memory of what came before.

Why Feed-Forward Networks Fall Short

Requirement	Feed-Forward Networks	Recurrent Networks
Variable-length input	Fixed input size	Can handle any sequence length
Parameter sharing across positions	No position-specific parameters	Same weights used at each time step
Memory of previous inputs	No memory mechanism	State vector carries information forward
Order sensitivity	Order agnostic	Order matters
Position awareness	No positional awareness	Position implicitly encoded through recurrence

Recurrent Neural Networks: The Basic Architecture

The Recurrence Mechanism

The key innovation in RNNs is the recurrence mechanism: the network maintains a hidden state (or "memory") that is updated at each time step based on both the current input and the previous hidden state.

Interactive RNN Architecture Explorer

Explore different RNN architectures and see how they evolved to solve various problems:

Training RNNs: Backpropagation Through Time (BPTT)

RNNs are trained using an extension of backpropagation called Backpropagation Through Time (BPTT), which unfolds the recurrent network through time and treats it as a deep feed-forward network.

Training Dynamics: Backpropagation Through Time

Now let's explore the training challenges that led to LSTM and GRU innovations:

Addressing Long-Term Dependencies

LSTMs excel at capturing long-term dependencies through their explicit memory mechanism. The combination of the cell state (long-term memory) and hidden state (working memory) allows LSTMs to maintain relevant information across many time steps while forgetting irrelevant details.

Gated Recurrent Unit (GRU): A Streamlined Alternative

Introduced in 2014 by Cho et al., the Gated Recurrent Unit (GRU) is a simplified variant of the LSTM that combines the forget and input gates into a single "update gate."

GRU Architecture

👆 Use the Architecture Explorer above and select "GRU" to see how it simplifies the LSTM design while maintaining effectiveness.

Mathematical Formulation

For input $\mathbf{x}_t$ at time step $t$ :

Update Gate: $\mathbf{z}_t = \sigma(\mathbf{W}_z \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)$

Reset Gate: $\mathbf{r}_t = \sigma(\mathbf{W}_r \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)$

Candidate Hidden State: $\tilde{\mathbf{h}}_t = \tanh(\mathbf{W} \cdot [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b})$

Final Hidden State: $\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$

LSTM vs. GRU: Comparison

Feature	LSTM	GRU
Parameters	More (4 sets of weights and biases)	Fewer (3 sets of weights and biases)
Memory unit	Cell state and hidden state	Hidden state only
Gates	Forget, input, and output gates	Update and reset gates
Training speed	Slower	Faster
Performance on very long dependencies	Slightly better	Good
Computational efficiency	More computation	Less computation

Note: GRUs typically train faster and require fewer parameters, but LSTMs may perform better on certain tasks, especially those requiring fine-grained memory control.

Bidirectional RNNs: Capturing Context from Both Directions

In many NLP tasks, understanding a word requires context from both past and future words. Bidirectional RNNs process the sequence in both forward and backward directions.

Bidirectional Architecture

👆 Use the Architecture Explorer above and select "Bidirectional" to see how information flows in both directions.

Benefits for NLP Tasks

Bidirectional processing is especially valuable for:

Named Entity Recognition
Part-of-Speech Tagging
Machine Translation
Question Answering

Example: Disambiguating Word Sense

The word "bank" has different meanings depending on context. Bidirectional RNNs can use both past and future context to determine the correct interpretation.

Example contexts:

"I went to the bank to deposit money" (financial institution)
"We sat by the river bank watching the sunset" (edge of water)
"The pilot had to bank the airplane to the left" (to tilt)

Bidirectional RNNs excel at these disambiguation tasks because they can consider the full sentence context.

Common NLP Applications of RNNs

Language Modeling

Language modeling is the task of predicting the next word given a sequence of previous words. This is a fundamental NLP task with applications in:

Speech recognition
Machine translation
Text generation
Spelling correction

Code Example: Simple Character-Level Language Model

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from collections import Counter
import random

# Sample text data
text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 
concerned with the interactions between computers and human language, in particular how to program computers to 
process and analyze large amounts of natural language data."""

# Prepare character-level data
chars = sorted(list(set(text)))
char_to_idx = {char: i for i, char in enumerate(chars)}
idx_to_char = {i: char for i, char in enumerate(chars)}
vocab_size = len(chars)

class CharLSTM(nn.Module):
    def __init__(self, vocab_size, hidden_size=128, num_layers=2):
        super(CharLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.vocab_size = vocab_size
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        
        # LSTM layer(s)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, 
                           batch_first=True, dropout=0.2)
        
        # Output layer
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, hidden=None):
        batch_size = x.size(0)
        
        if hidden is None:
            h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size)
            c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size)
            hidden = (h0, c0)
        
        # Embedding
        embedded = self.embedding(x)
        
        # LSTM forward pass
        lstm_out, hidden = self.lstm(embedded, hidden)
        
        # Output layer
        output = self.fc(lstm_out)
        
        return output, hidden

# Create training sequences
def create_sequences(text, seq_length=40):
    sequences = []
    targets = []
    
    for i in range(len(text) - seq_length):
        seq = text[i:i + seq_length]
        target = text[i + 1:i + seq_length + 1]
        
        sequences.append([char_to_idx[char] for char in seq])
        targets.append([char_to_idx[char] for char in target])
    
    return torch.tensor(sequences), torch.tensor(targets)

# Prepare data
seq_length = 40
X, y = create_sequences(text, seq_length)

# Initialize model
model = CharLSTM(vocab_size, hidden_size=128, num_layers=2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.002)

# Training function
def train_model(model, X, y, epochs=50, batch_size=32):
    model.train()
    
    for epoch in range(epochs):
        total_loss = 0
        hidden = None
        
        for i in range(0, len(X) - batch_size, batch_size):
            batch_X = X[i:i+batch_size]
            batch_y = y[i:i+batch_size]
            
            # Reset gradients
            optimizer.zero_grad()
            
            # Forward pass
            output, hidden = model(batch_X, hidden)
            
            # Detach hidden state to prevent backprop through entire sequence
            hidden = tuple([h.data for h in hidden])
            
            # Calculate loss
            loss = criterion(output.view(-1, vocab_size), batch_y.view(-1))
            
            # Backward pass
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
            optimizer.step()
            
            total_loss += loss.item()
        
        if epoch % 10 == 0:
            print(f'Epoch {epoch}, Loss: {total_loss/len(X):.4f}')

# Text generation function
def generate_text(model, seed_text, length=100, temperature=0.8):
    model.eval()
    
    # Convert seed text to indices
    current_seq = [char_to_idx[char] for char in seed_text[-seq_length:]]
    generated = seed_text
    
    with torch.no_grad():
        hidden = None
        
        for _ in range(length):
            # Prepare input
            x = torch.tensor([current_seq]).long()
            
            # Forward pass
            output, hidden = model(x, hidden)
            
            # Apply temperature
            logits = output[0, -1] / temperature
            probs = F.softmax(logits, dim=0)
            
            # Sample next character
            next_char_idx = torch.multinomial(probs, 1).item()
            next_char = idx_to_char[next_char_idx]
            
            # Update sequence and generated text
            generated += next_char
            current_seq = current_seq[1:] + [next_char_idx]
    
    return generated

# Example usage:
# train_model(model, X, y, epochs=50)
# generated_text = generate_text(model, "Natural language processing is ", length=100)
# print(generated_text)

print("Model architecture:")
print(model)

Sentiment Analysis

Sentiment analysis determines the emotional tone behind text, often used for customer reviews, social media monitoring, and brand analysis.

Code Example: Sentiment Classification with LSTM

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
texts = [
    "This movie was fantastic! I really enjoyed it.",
    "The plot was intriguing and kept me engaged.",
    "Terrible movie, waste of time and money.",
    "I hated the characters and the story made no sense.",
    "The acting was superb and the cinematography was beautiful.",
    "What a disappointment, I expected much better.",
    "Amazing storyline with incredible character development.",
    "Boring and predictable, couldn't wait for it to end.",
    "Outstanding performances from all the actors.",
    "The worst movie I've ever seen in my life."
]
labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]  # 1 for positive, 0 for negative

class TextTokenizer:
    def __init__(self, max_words=1000):
        self.max_words = max_words
        self.word_to_idx = {}
        self.idx_to_word = {}
        
    def fit_on_texts(self, texts):
        # Count word frequencies
        word_count = {}
        for text in texts:
            words = text.lower().split()
            for word in words:
                word_count[word] = word_count.get(word, 0) + 1
        
        # Sort by frequency and take top max_words
        sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
        vocab_words = [word for word, count in sorted_words[:self.max_words-2]]  # Reserve 2 for special tokens
        
        # Build vocabulary
        self.word_to_idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx_to_word = {0: '<PAD>', 1: '<UNK>'}
        
        for i, word in enumerate(vocab_words):
            self.word_to_idx[word] = i + 2
            self.idx_to_word[i + 2] = word
    
    def texts_to_sequences(self, texts):
        sequences = []
        for text in texts:
            words = text.lower().split()
            sequence = [self.word_to_idx.get(word, 1) for word in words]  # 1 is <UNK>
            sequences.append(sequence)
        return sequences

def pad_sequences(sequences, maxlen=None, padding='post', truncating='post'):
    if maxlen is None:
        maxlen = max(len(seq) for seq in sequences)
    
    padded = []
    for seq in sequences:
        if len(seq) > maxlen:
            if truncating == 'post':
                seq = seq[:maxlen]
            else:
                seq = seq[-maxlen:]
        
        if padding == 'post':
            seq = seq + [0] * (maxlen - len(seq))
        else:
            seq = [0] * (maxlen - len(seq)) + seq
            
        padded.append(seq)
    
    return torch.tensor(padded, dtype=torch.long)

# Tokenize the texts
max_words = 100
max_len = 20
tokenizer = TextTokenizer(max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
X = pad_sequences(sequences, maxlen=max_len)
y = torch.tensor(labels, dtype=torch.float32)

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=64, num_layers=1, dropout=0.2):
        super(SentimentLSTM, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, 
                           batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(hidden_dim, 32)
        self.fc2 = nn.Linear(32, 1)
        
    def forward(self, x):
        # Embedding
        embedded = self.embedding(x)
        
        # LSTM
        lstm_out, (hidden, _) = self.lstm(embedded)
        
        # Use the last hidden state (from the last time step)
        # Take the last non-padded output for each sequence
        batch_size = x.size(0)
        seq_lengths = (x != 0).sum(dim=1)  # Find actual sequence lengths
        
        # Gather the last relevant output for each sequence
        last_outputs = []
        for i in range(batch_size):
            last_idx = max(0, seq_lengths[i] - 1)
            last_outputs.append(lstm_out[i, last_idx, :])
        
        last_hidden = torch.stack(last_outputs)
        
        # Fully connected layers
        x = self.dropout(last_hidden)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.sigmoid(self.fc2(x))
        
        return x.squeeze()

# Initialize model
vocab_size = len(tokenizer.word_to_idx)
model = SentimentLSTM(vocab_size, embedding_dim=128, hidden_dim=64, dropout=0.2)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training function
def train_sentiment_model(model, X, y, epochs=100, batch_size=4):
    model.train()
    
    for epoch in range(epochs):
        total_loss = 0
        
        # Create batches
        for i in range(0, len(X), batch_size):
            batch_X = X[i:i+batch_size]
            batch_y = y[i:i+batch_size]
            
            # Reset gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        if epoch % 20 == 0:
            print(f'Epoch {epoch}, Loss: {total_loss:.4f}')

# Prediction function
def predict_sentiment(model, tokenizer, text, max_len=20):
    model.eval()
    
    # Tokenize and pad the input text
    sequence = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(sequence, maxlen=max_len)
    
    with torch.no_grad():
        prediction = model(padded).item()
    
    return {
        'text': text,
        'positive_probability': prediction,
        'negative_probability': 1 - prediction,
        'sentiment': 'Positive' if prediction > 0.5 else 'Negative'
    }

# Example usage:
# train_sentiment_model(model, X, y, epochs=100)
# result = predict_sentiment(model, tokenizer, "This movie was absolutely amazing!")
# print(result)

print("Model architecture:")
print(model)

Machine Translation with Encoder-Decoder Architecture

Machine translation uses a sequence-to-sequence (Seq2Seq) architecture with an encoder RNN and a decoder RNN.

Interactive Translation Demo

See how RNN encoder-decoder models with attention work for machine translation: