Essential NLP Tasks and Applications

Overview

In our previous lessons, we've explored the fundamental components of modern NLP systems, from text preprocessing and tokenization to transformer architectures, different generations techniques, and the evolution of language models. Now it's time to see how these foundational concepts come together in practical applications.

This lesson focuses on practical applications in NLP, covering tasks such as text classification, named entity recognition (NER), and question answering. We'll explore real-world use cases, implementation approaches, and evaluation metrics for each task, providing you with hands-on experience in building and deploying practical NLP solutions.

Learning Objectives

After completing this lesson, you will be able to:

Identify common NLP tasks and their appropriate applications
Implement text classification solutions for sentiment analysis and topic categorization
Develop named entity recognition systems for information extraction
Build question answering models for information retrieval
Select appropriate evaluation metrics for each NLP task
Apply best practices for model selection and deployment

Text Classification: Understanding and Categorizing Content

What is Text Classification?

Text classification is the task of assigning predefined categories to text documents. It's one of the most fundamental and widely used NLP tasks, with applications ranging from sentiment analysis and spam detection to content categorization and intent recognition.

Interactive Exploration: Text Classification in Action

Before diving into the theory, let's explore how text classification works with different preprocessing approaches:

Text Preprocessing Explorer

Preprocessing Steps Applied:

Lowercasing: Convert all text to lowercase to maintain consistency.
URL Removal: Remove web addresses that typically don't add semantic value.
Contraction Expansion: Convert contractions like it's → it is for standardization.
Special Character Removal: Remove punctuation and non-alphabetic characters.
Repeated Character Normalization: Reduce repeated letters (loooove → love) to standardize words.
Whitespace Normalization: Remove extra spaces and standardize spacing.

Processed Text:

(Press 'Preprocess' to see result)

Try different text inputs to see how preprocessing affects the features available for classification.

Analogy: Library Organization System

Think of text classification like a library's organization system:

Each book (document) needs to be placed in the right section (category)
Librarians (classifiers) use features like the book's content, title, and keywords
The goal is to make it easy for visitors to find books relevant to their interests
A well-organized library makes information retrieval efficient and accurate

Just as libraries organize books by genre or subject matter, text classification systems organize text by relevant categories, making it possible to efficiently process and retrieve large volumes of textual information.

Types of Text Classification Tasks

Task Type	Description	Examples	Common Applications
Sentiment Analysis	Identifying the emotional tone or opinion in text	Positive/Negative/Neutral	Customer feedback analysis, social media monitoring
Topic Classification	Categorizing text by subject matter	Sports, Politics, Technology, Entertainment	Content recommendation, news aggregation
Intent Recognition	Identifying the purpose or goal of a text	Purchase, Information, Support	Customer service automation, chatbots
Language Identification	Determining the language of a text	English, Spanish, French	Multilingual content routing, translation services
Spam Detection	Identifying unwanted or harmful messages	Spam/Not Spam	Email filtering, content moderation

Text Classification Approaches

Traditional Machine Learning Approaches

Traditional approaches to text classification typically follow these steps:

Feature Extraction: Convert text to numerical vectors using techniques like:
- Bag of Words (BoW)
- TF-IDF (Term Frequency-Inverse Document Frequency)
- N-grams
Model Training: Train a classifier using algorithms such as:
- Naive Bayes
- Support Vector Machines (SVM)
- Random Forests

Example: TF-IDF with SVM Classifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Sample data (replace with your dataset)
texts = [
    "I love this product, it works great!",
    "This is the worst purchase I've ever made",
    "Neutral experience, nothing special about it",
    "Excellent customer service and fast delivery",
    "Completely disappointed with the quality"
]
labels = ["positive", "negative", "neutral", "positive", "negative"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Create a pipeline with TF-IDF and SVM
classifier = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000)),
    ('svm', LinearSVC())
])

# Train the classifier
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, predictions))

# Classify a new text
new_text = ["The product exceeded my expectations"]
predicted_class = classifier.predict(new_text)[0]
print(f"The sentiment of '{new_text[0]}' is: {predicted_class}")

Deep Learning Approaches

Modern text classification often uses neural networks and transformer-based models:

Embedding + Neural Networks:
- Word embeddings (e.g., Word2Vec, GloVe)
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs, LSTMs, GRUs)
Transformer-based Models:
- Fine-tuning pre-trained models (e.g., BERT, RoBERTa, T5)
- Adapter-based fine-tuning for efficiency

Example: Fine-tuning BERT for Sentiment Analysis

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
num_labels = 3  # positive, negative, neutral
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load dataset (using the SST-2 dataset as an example)
dataset = load_dataset('glue', 'sst2')

# Map dataset labels (0, 1) to our target labels (negative, positive)
def map_labels(example):
    # Map 0 to negative (0) and 1 to positive (1)
    return {'labels': example['label']}

mapped_dataset = dataset.map(map_labels)

# Preprocess the data
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True, padding='max_length', max_length=128)

tokenized_datasets = mapped_dataset.map(preprocess_function, batched=True)

# Define metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    compute_metrics=compute_metrics,
)

# Fine-tune the model
# trainer.train()

# Test with a new example
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(probabilities, dim=-1).item()
    
    # Map prediction index to label
    labels = {0: "negative", 1: "positive"}
    return labels[prediction], probabilities[0][prediction].item()

# Example usage
text = "This movie was fantastic, I really enjoyed it!"
prediction, confidence = predict_sentiment(text)
print(f"Text: '{text}'
Sentiment: {prediction} (confidence: {confidence:.2f})")

Modern Model Architecture Exploration

Explore how different transformer architectures approach text classification tasks:

Loading interactive component...

This visualization shows how BERT's bidirectional training enables better text understanding for classification tasks.

Evaluating Text Classification Models

Selecting the right evaluation metrics is crucial for assessing text classification performance. Different metrics emphasize different aspects of performance and are appropriate for different scenarios.

Common Evaluation Metrics

Text classification models are typically evaluated using metrics like:

Accuracy: Percentage of correctly classified instances
Precision: Proportion of true positives among positive predictions
Recall: Proportion of true positives identified among all actual positives
F1 Score: Harmonic mean of precision and recall
AUC-ROC: Area under the Receiver Operating Characteristic curve

The performance of these metrics can vary significantly between balanced and imbalanced datasets.

Which Metrics to Use When

Accuracy:
- The proportion of correctly classified instances
- Best for balanced datasets with equal importance for all classes
- Can be misleading for imbalanced datasets
Precision:
- The ratio of true positives to all predicted positives
- Important when the cost of false positives is high
- Example: Spam detection (misclassifying legitimate emails is costly)
Recall:
- The ratio of true positives to all actual positives
- Important when the cost of false negatives is high
- Example: Toxic content detection (missing toxic content is costly)
F1 Score:
- The harmonic mean of precision and recall
- Balances precision and recall
- Good for imbalanced datasets
AUC-ROC:
- Area under the Receiver Operating Characteristic curve
- Measures discrimination ability across thresholds
- Less sensitive to class imbalance

Visualizing a confusion matrix (toy data)

Loading interactive component...

Metric selection cheat‑sheet

Balanced classes: Accuracy + macro‑F1
Imbalanced + high FP cost: Precision, PR‑AUC
Imbalanced + high FN cost: Recall, PR‑AUC
Multi‑class: macro‑F1 + per‑class F1; include confusion matrix

Handling Common Challenges in Text Classification

Class Imbalance

Many real-world classification problems have imbalanced class distributions:

Resampling Techniques:
- Oversampling: Duplicate instances from minority classes
- Undersampling: Remove instances from majority classes
- SMOTE: Generate synthetic examples for minority classes
Class Weighting:
- Assign higher weights to minority classes during training

Example: Class Weighting in PyTorch

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Example of imbalanced dataset (80% negative, 20% positive)
num_samples = 1000
X = torch.randn(num_samples, 10)  # 10 features per sample
y = torch.zeros(num_samples)
y[:200] = 1  # Only 20% positive samples

# Calculate class weights
class_counts = np.bincount(y.numpy().astype(int))
total_samples = len(y)
class_weights = torch.FloatTensor([total_samples / (len(class_counts) * count) for count in class_counts])
print(f"Class weights: {class_weights}")

# Create a simple classifier
model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 2)
)

# Create weighted loss function
weighted_loss = nn.CrossEntropyLoss(weight=class_weights)

# Create dataloader
dataset = TensorDataset(X, y.long())
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Training loop (just showing setup, not executing)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Example of one training step
for inputs, labels in dataloader:
    # Forward pass
    outputs = model(inputs)
    loss = weighted_loss(outputs, labels)
    
    # Backward pass (commented out)
    # optimizer.zero_grad()
    # loss.backward()
    # optimizer.step()
    
    print(f"Loss with class weighting: {loss.item()}")
    break

Named Entity Recognition: Extracting Structure from Text

What is Named Entity Recognition?

Named Entity Recognition (NER) is the task of identifying and classifying named entities in text into predefined categories such as names of persons, organizations, locations, dates, quantities, monetary values, and more.

Analogy: The Highlighter Approach

Think of NER as highlighting different categories of information in text:

A researcher reads a document and highlights different types of information with different colors
Yellow for people, blue for organizations, green for locations, pink for dates
This structured highlighting makes it easy to extract specific information types
The researcher must understand context to correctly identify entities

NER systems perform this highlighting automatically, enabling the extraction of structured information from unstructured text.

Applications of NER

NER enables various applications in the NLP ecosystem:

Information Extraction: Extracting structured data from unstructured text
Knowledge Graph Construction: Identifying entities and relationships for graph databases
Question Answering: Extracting entities to answer specific queries
Semantic Search: Improving search relevance with entity understanding
Content Recommendation: Personalizing content based on entities of interest

Understanding Contextual Representations for NER

Modern NER systems rely heavily on contextual embeddings. Explore how different words can have different meanings based on context:

Loading interactive component...

Notice how the same word can represent different entity types depending on context - this is why contextual embeddings are crucial for accurate NER.

Common NER Entity Types

The standard types of named entities include:

Person (PER): Names of individuals
Organization (ORG): Companies, institutions, agencies
Location (LOC): Countries, cities, geographical features
Date/Time (DATE): Temporal expressions
Money (MONEY): Monetary values
Percentage (PERCENT): Percentage values
Product (PROD): Products, works of art
Event (EVENT): Named events like wars, sports events
Miscellaneous (MISC): Entities that don't fit into other categories

Domain-specific NER systems may include additional categories like genes, proteins, diseases, drugs, etc.

NER Approaches

Traditional Sequence Labeling Approaches

NER is typically framed as a sequence labeling problem, where each token is assigned a tag:

BIO Tagging Scheme:
- B-X: Beginning of entity of type X
- I-X: Inside of entity of type X
- O: Outside of any entity
Traditional Models:
- Hidden Markov Models (HMMs)
- Conditional Random Fields (CRFs)
- Maximum Entropy Markov Models (MEMMs)

Example: CRF-based NER

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_classification_report

# Example data (usually this would be much larger)
train_data = [
    [("Apple", "B-ORG"), ("Inc.", "I-ORG"), ("is", "O"), ("based", "O"), ("in", "O"), ("Cupertino", "B-LOC"), ("California", "B-LOC"), (".", "O")],
    [("Tim", "B-PER"), ("Cook", "I-PER"), ("is", "O"), ("the", "O"), ("CEO", "O"), ("of", "O"), ("Apple", "B-ORG"), (".", "O")]
]

# Test data
test_data = [
    [("Microsoft", "B-ORG"), ("is", "O"), ("based", "O"), ("in", "O"), ("Redmond", "B-LOC"), (".", "O")]
]

# Feature extraction function
def word2features(sent, i):
    word = sent[i][0]
    
    # Basic features
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit()
    }
    
    # Features for previous word
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper()
        })
    else:
        features['BOS'] = True
        
    # Features for next word
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper()
        })
    else:
        features['EOS'] = True
        
    return features

# Extract features for a sentence
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

# Extract labels from a sentence
def sent2labels(sent):
    return [label for token, label in sent]

# Prepare data
X_train = [sent2features(s) for s in train_data]
y_train = [sent2labels(s) for s in train_data]

X_test = [sent2features(s) for s in test_data]
y_test = [sent2labels(s) for s in test_data]

# Train CRF model
crf = CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

crf.fit(X_train, y_train)

# Make predictions
y_pred = crf.predict(X_test)

# Print predictions
for i, sentence in enumerate(test_data):
    print("Sentence:", " ".join([word for word, _ in sentence]))
    print("Predicted:", " ".join(y_pred[i]))
    print("Actual:", " ".join(y_test[i]))
    print()

Deep Learning Approaches for NER

Modern NER systems use neural network architectures:

Bi-LSTM-CRF:
- Bidirectional LSTM to capture context in both directions
- CRF layer to model label dependencies
- Often combined with word and character embeddings
Transformer-based Models:
- Fine-tuning pre-trained models like BERT, RoBERTa, XLNet
- Token classification heads for sequence labeling
- Contextual embeddings capture rich semantic information

Example: BERT for NER

from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import pipeline
import torch

# Load pre-trained model and tokenizer
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Test on an example
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in Cupertino, California."
results = ner_pipeline(text)

# Process and print results
for entity in results:
    print(f"Entity: {entity['word']}")
    print(f"Type: {entity['entity_group']}")
    print(f"Confidence: {entity['score']:.4f}")
    print(f"Start-End: {entity['start']}-{entity['end']}")
    print()

# Manual prediction with more control
def predict_entities(text):
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get predicted labels
    predictions = torch.argmax(outputs.logits, dim=2)
    
    # Convert IDs to labels
    predicted_labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # Align tokens with predictions
    aligned_predictions = []
    current_entity = None
    
    for token, label in zip(tokens, predicted_labels):
        # Skip special tokens
        if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
            continue
            
        # Handle subword tokens (starting with ##)
        if token.startswith("##"):
            if current_entity:
                current_entity["word"] += token[2:]
            continue
        
        # B- prefix indicates beginning of entity
        if label.startswith("B-"):
            if current_entity:
                aligned_predictions.append(current_entity)
            current_entity = {"word": token, label: label[2:]}
        
        # I- prefix indicates continuation of entity
        elif label.startswith("I-") and current_entity and current_entity["label"] == label[2:]:
            current_entity["word"] += " " + token
        
        # O indicates outside any entity
        elif label == "O":
            if current_entity:
                aligned_predictions.append(current_entity)
                current_entity = None
    
    # Add final entity if exists
    if current_entity:
        aligned_predictions.append(current_entity)
    
    return aligned_predictions

# Example usage of manual prediction
manual_predictions = predict_entities(text)
print("
Manual Prediction Results:")
for entity in manual_predictions:
    print(f"Entity: {entity['word']} - Type: {entity['label']}")

Evaluating NER Systems

Evaluation for NER requires special consideration:

Entity-Level vs. Token-Level Evaluation

Token-level metrics: Evaluate each token's prediction independently
- Standard precision, recall, F1 score for each token
- Doesn't account for entity boundaries
Entity-level metrics: Evaluate complete entity predictions
- An entity is correct only if both the type and boundaries are correct
- More reflective of real-world performance

Common NER Evaluation Metrics

Metric	Description	When to Use	Limitations
Token-level F1	F1 score calculated for each token independently	When token classification accuracy is important	Doesn't account for entity boundaries
Entity-level F1	F1 score for complete entity predictions	When complete entity extraction is important	Strict matching may be too rigid
Partial Match F1	F1 score allowing partial entity matches	When partial entity recognition is acceptable	May overstate performance
Type-only F1	F1 score considering only entity types	When entity type is more important than exact boundaries	Ignores boundary errors

Calculating Entity-level F1 Score

from seqeval.metrics import classification_report, f1_score, precision_score, recall_score

# Example: ground truth and predictions
true_tags = [
    ['O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'O', 'B-PER', 'I-PER', 'O', 'O']
]

pred_tags = [
    ['O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'O'],     # Incomplete ORG entity
    ['O', 'B-LOC', 'I-LOC', 'O', 'B-ORG', 'I-ORG', 'O', 'O']  # Misclassified PER as ORG
]

# Calculate metrics
print(classification_report(true_tags, pred_tags))
print(f"Overall F1: {f1_score(true_tags, pred_tags):.4f}")
print(f"Precision: {precision_score(true_tags, pred_tags):.4f}")
print(f"Recall: {recall_score(true_tags, pred_tags):.4f}")

# Calculate metrics for specific entity types
print("
Metrics by entity type:")
entity_types = ['PER', 'ORG', 'LOC']
for entity_type in entity_types:
    # Filter for specific entity type using list comprehension
    true_filtered = [[t if t == f'B-{entity_type}' or t == f'I-{entity_type}' else 'O' for t in seq] for seq in true_tags]
    pred_filtered = [[t if t == f'B-{entity_type}' or t == f'I-{entity_type}' else 'O' for t in seq] for seq in pred_tags]
    
    # Calculate metrics
    entity_f1 = f1_score(true_filtered, pred_filtered)
    print(f"{entity_type} F1: {entity_f1:.4f}")

Tokenization Impact on NLP Tasks

Before moving to Question Answering, let's explore how different tokenization strategies affect NLP task performance:

Tokenizer Comparison

Comparedifferenttokenizationmethodseasily.

Different tokenization approaches can significantly impact model performance. Notice how subword tokenization handles out-of-vocabulary words better than word-level tokenization.

Question Answering: Finding Answers in Context

What is Question Answering?

Question Answering (QA) is the task of providing accurate answers to questions based on relevant context. Modern QA systems can extract answers from provided passages, retrieve relevant documents from a large corpus, or generate answers based on their learned knowledge.

Analogy: The Helpful Librarian

Think of QA systems as skilled librarians:

A librarian listens to your question and understands what you're looking for
They search through their collection to find relevant information
They can point to a specific passage in a book or synthesize information from multiple sources
They return a precise answer rather than simply a stack of books

Just as good librarians save time by providing direct answers rather than making users read entire books, QA systems extract or generate the specific information users need.

Types of Question Answering Systems

QA Type	Description	Input	Output	Examples
Extractive QA	Extracts answer spans from provided context	Question + Context passage	Text span from context	SQuAD, BERT QA
Retrieval QA	Retrieves documents then extracts answers	Question only	Answer from retrieved documents	DrQA, RAG
Generative QA	Generates free-text answers	Question (+ optional context)	Generated text answer	T5, GPT models
Knowledge-Based QA	Answers from structured knowledge bases	Question	Answer from knowledge base	KGQA systems

Extractive Question Answering

Extractive QA systems identify spans of text in a context passage that answer a given question:

Example: BERT for Extractive QA

from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch

# Load pre-trained model and tokenizer
model_name = "deepset/bert-base-cased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Example question and context
question = "Where was Alan Turing born?"
context = "Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was born in Maida Vale, London, while his father was on leave from his position with the Indian Civil Service."

# Tokenize input
inputs = tokenizer(question, context, return_tensors="pt", max_length=512, truncation=True, padding="max_length")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Get start and end positions
start_scores = outputs.start_logits
end_scores = outputs.end_logits

# Get the most likely answer span
start_idx = torch.argmax(start_scores)
end_idx = torch.argmax(end_scores)

# Convert token indices to actual text span
all_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
answer_tokens = all_tokens[start_idx : end_idx + 1]
answer = tokenizer.convert_tokens_to_string(answer_tokens)

# Clean up answer (remove special tokens and extra whitespace)
answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip()

print(f"Question: {question}")
print(f"Answer: {answer}")

# Confidence score
start_score = torch.max(start_scores).item()
end_score = torch.max(end_scores).item()
confidence = (start_score + end_score) / 2
print(f"Confidence: {confidence:.2f}")

Retrieval-based Question Answering

Retrieval QA combines information retrieval with answer extraction:

Document Retrieval: Find relevant documents from a corpus
Passage Ranking: Identify the most relevant passages
Answer Extraction: Extract the specific answer from top passages

Architecture of a Retrieval QA System

A typical retrieval QA system includes:

Question processing
Document retrieval component
Passage ranking system
Answer extraction module
Document corpus database

The system flows from question to retriever to ranker to reader to final answer, with the document corpus feeding into the retriever.

Example: Simple Retrieval QA with Dense Passage Retrieval

from transformers import DPRQuestionEncoder, DPRContextEncoder, AutoTokenizer, AutoModelForQuestionAnswering
import torch
import torch.nn.functional as F

# Sample corpus (in practice, this would be much larger)
corpus = [
    "Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist.",
    "Turing was born in Maida Vale, London, while his father was on leave from his position with the Indian Civil Service.",
    "Turing is widely considered to be the father of theoretical computer science and artificial intelligence.",
    "During the Second World War, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking center.",
    "Albert Einstein was a German-born theoretical physicist who developed the theory of relativity.",
    "Einstein was born in Ulm, in the Kingdom of Württemberg in the German Empire."
]

# Load DPR model for retrieval
question_encoder_name = "facebook/dpr-question_encoder-single-nq-base"
ctx_encoder_name = "facebook/dpr-ctx_encoder-single-nq-base"
question_tokenizer = AutoTokenizer.from_pretrained(question_encoder_name)
question_encoder = DPRQuestionEncoder.from_pretrained(question_encoder_name)
ctx_tokenizer = AutoTokenizer.from_pretrained(ctx_encoder_name)
ctx_encoder = DPRContextEncoder.from_pretrained(ctx_encoder_name)

# Load QA model for answer extraction
qa_model_name = "deepset/bert-base-cased-squad2"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)

# Encode corpus passages
def encode_corpus(corpus, tokenizer, encoder):
    corpus_embeddings = []
    for passage in corpus:
        inputs = tokenizer(passage, return_tensors="pt", max_length=512, truncation=True)
        with torch.no_grad():
            outputs = encoder(**inputs)
        corpus_embeddings.append(outputs.pooler_output[0])
    return corpus_embeddings

# Pre-encode corpus (this would typically be done offline)
corpus_embeddings = encode_corpus(corpus, ctx_tokenizer, ctx_encoder)

# Retrieve relevant passages
def retrieve_passages(question, k=2):
    # Encode question
    question_inputs = question_tokenizer(question, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        question_embedding = question_encoder(**question_inputs).pooler_output[0]
    
    # Calculate similarity with all passages
    similarities = [F.cosine_similarity(question_embedding, passage_emb, dim=0) for passage_emb in corpus_embeddings]
    
    # Get top-k passages
    top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:k]
    top_passages = [corpus[i] for i in top_indices]
    top_scores = [similarities[i].item() for i in top_indices]
    
    return list(zip(top_passages, top_scores))

# Extract answer from retrieved passages
def extract_answer(question, passages):
    best_answer = ""
    best_score = -float('inf')
    
    for passage, _ in passages:
        # Prepare input for QA model
        inputs = qa_tokenizer(question, passage, return_tensors="pt", max_length=512, truncation=True)
        
        # Get predictions
        with torch.no_grad():
            outputs = qa_model(**inputs)
        
        # Get scores and positions
        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        start_idx = torch.argmax(start_scores)
        end_idx = torch.argmax(end_scores)
        
        # Calculate confidence
        confidence = (torch.max(start_scores) + torch.max(end_scores)).item() / 2
        
        # Extract answer if better than previous
        if confidence > best_score:
            best_score = confidence
            
            # Convert indices to text
            all_tokens = qa_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
            answer_tokens = all_tokens[start_idx : end_idx + 1]
            answer = qa_tokenizer.convert_tokens_to_string(answer_tokens)
            
            # Clean up answer
            answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip()
            best_answer = answer
    
    return best_answer, best_score

# Question answering pipeline
def answer_question(question):
    print(f"Question: {question}")
    
    # Retrieve passages
    retrieved_passages = retrieve_passages(question, k=2)
    print("
Retrieved passages:")
    for i, (passage, score) in enumerate(retrieved_passages):
        print(f"{i+1}. [{score:.4f}] {passage}")
    
    # Extract answer
    answer, confidence = extract_answer(question, retrieved_passages)
    print(f"
Answer: {answer}")
    print(f"Confidence: {confidence:.4f}")

# Example usage
question = "Where was Alan Turing born?"
answer_question(question)

Generative Question Answering

Generative QA systems can produce free-form answers by synthesizing information:

Sequence-to-Sequence Models: Generate answers as sequences
Large Language Models: Leverage extensive pre-training to answer directly
Controlled Generation: Balance between extracting and hallucinating

Example: T5 for Generative QA

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load pre-trained model and tokenizer
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# T5 expects a text-to-text format with a specific prefix for QA
def answer_question(question, context=None):
    if context:
        input_text = f"question: {question} context: {context}"
    else:
        input_text = f"question: {question}"
    
    # Tokenize the input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    
    # Generate answer
    outputs = model.generate(
        inputs["input_ids"],
        max_length=64,
        num_beams=4,
        early_stopping=True
    )
    
    # Decode answer
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Example with context
question1 = "Where was Alan Turing born?"
context1 = "Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was born in Maida Vale, London, while his father was on leave from his position with the Indian Civil Service."
answer1 = answer_question(question1, context1)
print(f"Question: {question1}")
print(f"Context: {context1[:100]}...")
print(f"Answer: {answer1}
")

# Example without context (requires model to use its parametric knowledge)
question2 = "What is the capital of France?"
answer2 = answer_question(question2)
print(f"Question: {question2}")
print(f"Answer: {answer2}")

Understanding Attention in Question Answering

Question answering systems rely heavily on attention mechanisms to focus on relevant parts of the context. Explore how attention works:

Loading interactive component...

Attention patterns show how the model focuses on different parts of the input when generating answers. Cross-attention is particularly important for QA tasks.

Evaluating Question Answering Systems

Evaluation Metrics for QA Systems

Metric	Description	Best For	Limitations
Exact Match (EM)	Binary: 1 if prediction matches any reference answer exactly, 0 otherwise	Factoid QA with specific answers	Too strict for open-ended questions
F1 Score	Token overlap between prediction and reference	QA with slightly varying answers	Doesn't capture semantic similarity
ROUGE	N-gram overlap metrics, commonly used for summarization	Long-form QA	Doesn't handle paraphrasing well
BLEU	Machine translation metric applied to QA	Generative QA	Focuses on precision, not recall
BERTScore	Contextual embedding similarity between answers	Semantic evaluation	Computationally expensive

Calculating F1 Score for QA

def normalize_answer(s):
    """Normalize answer by removing articles, punctuation, and lowercase."""
    import re
    import string
    
    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)
    
    def white_space_fix(text):
        return ' '.join(text.split())
    
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    
    def lower(text):
        return text.lower()
    
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, ground_truth):
    """Compute exact match between prediction and ground truth."""
    return normalize_answer(prediction) == normalize_answer(ground_truth)

def compute_f1(prediction, ground_truth):
    """Compute token F1 score."""
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    
    # If either is empty, return 0
    if len(prediction_tokens) == 0 or len(ground_truth_tokens) == 0:
        return 0
    
    # Count common tokens
    common = sum(1 for token in prediction_tokens if token in ground_truth_tokens)
    
    # If no common tokens, return 0
    if common == 0:
        return 0
    
    # Compute precision, recall, and F1
    precision = common / len(prediction_tokens)
    recall = common / len(ground_truth_tokens)
    f1 = 2 * precision * recall / (precision + recall)
    
    return f1

# Example usage
prediction = "Maida Vale, London"
ground_truth = "Maida Vale in London"

em = compute_exact_match(prediction, ground_truth)
f1 = compute_f1(prediction, ground_truth)

print(f"Prediction: {prediction}")
print(f"Ground Truth: {ground_truth}")
print(f"Exact Match: {em}")
print(f"F1 Score: {f1:.4f}")

# Multiple references
references = ["Maida Vale in London", "London", "Maida Vale, London, England"]
max_f1 = max(compute_f1(prediction, reference) for reference in references)
any_em = any(compute_exact_match(prediction, reference) for reference in references)

print(f"
Best F1 Score across references: {max_f1:.4f}")
print(f"Any Exact Match: {any_em}")

Question Answering: Finding Answers in Context

What is Question Answering?

Analogy: The Helpful Librarian

Think of QA systems as skilled librarians:

A librarian listens to your question and understands what you're looking for
They search through their collection to find relevant information
They can point to a specific passage in a book or synthesize information from multiple sources
They return a precise answer rather than simply a stack of books

Just as good librarians save time by providing direct answers rather than making users read entire books, QA systems extract or generate the specific information users need.

Conclusion: From Fundamentals to Production

The practical NLP tasks we've explored—text classification, named entity recognition, and question answering—form the foundation of many real-world NLP applications. By understanding these fundamental tasks and how to implement them effectively, you're well-positioned to build sophisticated NLP systems that solve real problems.

What You've Learned in This Course

Throughout this NLP Fundamentals course, you've built a comprehensive understanding of:

Text Processing Foundations: From basic preprocessing to advanced tokenization techniques
Representation Learning: Traditional word embeddings to modern contextual representations
Architecture Evolution: The journey from RNNs to the transformer revolution
Generation Methods: Both deterministic and probabilistic approaches to text generation
Model Landscape: Understanding of modern language models and their capabilities
Practical Applications: Core NLP tasks and how to approach them effectively

Ready for Advanced Topics?

With this foundation, you're now prepared to tackle the engineering and production aspects of NLP. If you're interested in learning how to:

Train and fine-tune large language models from scratch
Optimize models for production deployment through quantization and acceleration
Build production systems like RAG applications and monitoring infrastructure
Implement alignment techniques to ensure model safety and helpfulness

Consider continuing with our "Advanced NLP: Training & Production Systems" course, which builds directly on the concepts you've learned here.

Key Takeaways

Text Classification is essential for categorizing content, enabling applications from sentiment analysis to content moderation.
Named Entity Recognition extracts structured information from unstructured text, providing the foundation for information extraction systems.
Question Answering combines retrieval, extraction, and generation to provide direct answers to user queries.
Evaluation Matters: Selecting appropriate evaluation metrics is critical for understanding model performance in real-world scenarios.
Modern Approaches: Deep learning and transformer-based models have dramatically improved performance across all these tasks.

Modern Model Performance Comparison

Explore how different modern language models perform across these practical NLP tasks:

Loading interactive component...

This comparison shows real-world performance of different models on standardized benchmarks for text classification, NER, and QA tasks.

Practice Exercises

Exercise 1: Multi-class Text Classification

Build a multi-class text classifier for news article categorization:

Use a dataset with multiple news categories (e.g., politics, sports, technology)
Compare performance of a traditional ML approach (TF-IDF + SVM) with a transformer-based approach
Analyze which categories are most often confused with each other
Implement appropriate evaluation metrics for a multi-class problem

Exercise 2: Domain-Specific NER

Develop a named entity recognition system for a specific domain:

Choose a domain of interest (e.g., medical, legal, finance)
Create or find a domain-specific dataset with entity annotations
Fine-tune a pre-trained NER model on your domain data
Evaluate performance on domain-specific entities compared to general entities

Exercise 3: End-to-End QA System

Implement a complete question answering system:

Build a retrieval-based QA system using a document collection of your choice
Implement both sparse (TF-IDF) and dense (embedding-based) retrieval
Compare extractive vs. generative approaches for answer generation
Evaluate using both automatic metrics and human assessment

Additional Resources

Text Classification

"Text Classification Algorithms: A Survey" - Comprehensive survey paper
Hugging Face Text Classification Tutorial
scikit-learn Text Classification Guide

Named Entity Recognition

Question Answering

"Neural Reading Comprehension and Beyond" - Danqi Chen's thesis
SQuAD: Stanford Question Answering Dataset
Hugging Face Question Answering Tutorial

NLP Fundamentals: Core Concepts and Architectures

Essential NLP Tasks and Applications

Overview

Learning Objectives

Text Classification: Understanding and Categorizing Content

What is Text Classification?

Interactive Exploration: Text Classification in Action

Text Preprocessing Explorer

Preprocessing Steps Applied:

Processed Text:

Analogy: Library Organization System

Types of Text Classification Tasks

Text Classification Approaches

Traditional Machine Learning Approaches

Example: TF-IDF with SVM Classifier

Deep Learning Approaches

Example: Fine-tuning BERT for Sentiment Analysis

Modern Model Architecture Exploration

Evaluating Text Classification Models

Common Evaluation Metrics

Which Metrics to Use When

Visualizing a confusion matrix (toy data)

Metric selection cheat‑sheet

Handling Common Challenges in Text Classification

Class Imbalance

Example: Class Weighting in PyTorch

Named Entity Recognition: Extracting Structure from Text

What is Named Entity Recognition?

Analogy: The Highlighter Approach

Applications of NER

Understanding Contextual Representations for NER

Common NER Entity Types

NER Approaches

Traditional Sequence Labeling Approaches

Example: CRF-based NER

Deep Learning Approaches for NER

Example: BERT for NER

Evaluating NER Systems

Entity-Level vs. Token-Level Evaluation

Common NER Evaluation Metrics

Calculating Entity-level F1 Score

Tokenization Impact on NLP Tasks

Tokenizer Comparison

Question Answering: Finding Answers in Context

What is Question Answering?

Analogy: The Helpful Librarian

Types of Question Answering Systems

Extractive Question Answering

Example: BERT for Extractive QA

Retrieval-based Question Answering

Architecture of a Retrieval QA System

Example: Simple Retrieval QA with Dense Passage Retrieval

Generative Question Answering

Example: T5 for Generative QA

Understanding Attention in Question Answering

Evaluating Question Answering Systems

Evaluation Metrics for QA Systems

Calculating F1 Score for QA

Question Answering: Finding Answers in Context

What is Question Answering?

Analogy: The Helpful Librarian

Conclusion: From Fundamentals to Production

What You've Learned in This Course

Ready for Advanced Topics?

Key Takeaways

Modern Model Performance Comparison

Practice Exercises

Exercise 1: Multi-class Text Classification

Exercise 2: Domain-Specific NER

Exercise 3: End-to-End QA System

Additional Resources

Text Classification

Named Entity Recognition

Question Answering