Dimensionality Reduction: PCA, t-SNE, UMAP

Introduction: The Curse of High Dimensions

Imagine trying to visualize a dataset with 1,000 features. Impossible, right? Or training a model where each sample has 10,000 dimensions – slow, memory-intensive, and prone to overfitting.

Dimensionality reduction solves this by projecting high-dimensional data into fewer dimensions while preserving essential structure. It's like looking at shadows: a 3D object casts a 2D shadow that captures its key shape.

In this lesson, we'll master three powerful techniques: PCA (linear, fast), t-SNE (non-linear, great for visualization), and UMAP (balances both, fast).


Prerequisites

  • Completed ML Fundamentals and Clustering Algorithms
  • Linear algebra: vectors, matrices, eigenvectors
  • Python with NumPy, scikit-learn
  • Nice to have: Understanding of variance and covariance

1. Why Dimensionality Reduction?

The Curse of Dimensionality

As dimensions increase, strange things happen:

  • Volume explodes: A 10D hypercube has 1,024 corners!
  • Distance loses meaning: All points become equally far apart
  • Data becomes sparse: Most of the volume is empty
  • Models suffer: Need exponentially more data to generalize

Analogy: In a 1D line, 10 points fill it well. In 2D, you need 100 points. In 10D? A million points barely scratch the surface!

Benefits of Dimensionality Reduction

BenefitExample
VisualizationPlot 1000D data in 2D/3D for exploration
SpeedFaster training with fewer features
MemoryStore less data
Noise reductionRemove irrelevant dimensions
MulticollinearityRemove correlated features
Better generalizationAvoid overfitting from too many features

Two Approaches

  1. Feature Selection: Choose a subset of original features (we covered this in ML Fundamentals)
  2. Feature Extraction: Create new features by combining originals (this lesson!)

2. Principal Component Analysis (PCA)

The Intuition

PCA finds directions of maximum variance in your data. Think of it as rotating your coordinate system to align with the data's natural spread.

Analogy: Imagine photographing a flat disk from different angles. One angle (face-on) captures the most information. That's the first principal component!

Key Idea: Project data onto fewer dimensions that capture the most variance.

Mathematical Foundation

Given data matrix XX (n samples × p features):

  1. Center the data: Xc=XXˉX_c = X - \bar{X}
  2. Compute covariance matrix: C=1n1XcTXcC = \frac{1}{n-1}X_c^TX_c
  3. Find eigenvectors/eigenvalues: Cvi=λiviC v_i = \lambda_i v_i
  4. Sort by eigenvalues: Largest λ\lambda → most variance
  5. Project: Z=XcVkZ = X_c V_k (keep top k eigenvectors)

Explained Variance:

Explained Variance Ratio=λki=1pλi\text{Explained Variance Ratio} = \frac{\lambda_k}{\sum_{i=1}^{p} \lambda_i}

Interactive Exploration

Loading interactive component...

Try this:

  1. Switch between datasets – see how PCA projects 3D → 2D
  2. Check "Explained Variance" – how much information is retained?
  3. Compare PCA with t-SNE and UMAP on the same data

Implementation from Scratch

Loading Python runtime...

Choosing the Number of Components

Loading Python runtime...

Rule of Thumb: Keep enough components to explain 90-95% of variance.

PCA Strengths & Limitations

StrengthsLimitations
✅ Fast and scalable❌ Only captures linear relationships
✅ Mathematically principled❌ Assumes variance = importance
✅ Reversible (can reconstruct)❌ Sensitive to scaling
✅ No hyperparameters❌ Hard to interpret components
✅ Noise reduction❌ Fails on non-linear manifolds

3. t-SNE: t-Distributed Stochastic Neighbor Embedding

The Intuition

t-SNE preserves local structure – keeping nearby points close and far points far. Unlike PCA, it can capture non-linear relationships.

Analogy: Imagine a crumpled paper (high-D) that you carefully unfold (2D) while preserving which parts touch.

Key Idea: Model pairwise similarities in high-D, then find low-D layout that preserves those similarities.

How t-SNE Works

  1. High-dimensional similarities: Model as Gaussian

    pji=exp(xixj2/2σi2)kiexp(xixk2/2σi2)p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}
  2. Low-dimensional similarities: Model as t-distribution (heavy-tailed)

    qij=(1+yiyj2)1kl(1+ykyl2)1q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}
  3. Optimize: Minimize KL-divergence between P and Q using gradient descent

Why t-distribution? Heavy tails prevent "crowding problem" – far points don't cluster together.

Interactive Exploration

Loading interactive component...

Try this:

  1. Adjust Perplexity – balances local vs global structure (typical: 5-50)
  2. Notice t-SNE is stochastic – different runs give different results
  3. Compare with PCA on the same "Swiss Roll" data

Implementation Example

Loading Python runtime...

The Perplexity Parameter

Perplexity roughly corresponds to the number of close neighbors for each point.

  • Low perplexity (5-15): Focuses on local structure, may create fragmented clusters
  • High perplexity (30-50): Considers more global structure, may merge clusters
  • Rule of thumb: Start with 30, adjust based on dataset size

Loading Python runtime...

t-SNE Strengths & Limitations

StrengthsLimitations
✅ Excellent for visualization❌ Very slow for large datasets
✅ Captures non-linear structure❌ Non-deterministic (random init)
✅ Preserves local neighborhoods❌ Doesn't preserve global structure
✅ Beautiful visualizations❌ Can't transform new data
✅ Works well for clusters❌ Perplexity needs tuning

Key Limitation: t-SNE is for visualization only – don't use distances in 2D for downstream tasks!


4. UMAP: Uniform Manifold Approximation and Projection

The Intuition

UMAP is the best of both worlds: preserves local structure like t-SNE but faster and better at preserving global structure.

Key Idea: Build a graph of the data manifold, then find a low-dimensional layout that preserves the graph structure.

Mathematical Approach

  1. Construct fuzzy topological graph in high dimensions
  2. Optimize layout in low dimensions to match graph structure
  3. Uses Riemannian geometry and algebraic topology (simplified implementation)

Advantages over t-SNE:

  • Faster (can handle millions of points)
  • Preserves more global structure
  • Can transform new data
  • Fewer hyperparameters

Interactive Exploration

Loading interactive component...

Try this:

  1. Adjust N Neighbors – similar to perplexity in t-SNE
  2. Notice UMAP is faster than t-SNE
  3. Compare global structure preservation across all three methods

Implementation Example

Loading Python runtime...

UMAP Hyperparameters

  • n_neighbors: Size of local neighborhood (default 15). Smaller → more local focus
  • min_dist: Minimum distance between points in embedding (default 0.1). Smaller → tighter clusters

UMAP Strengths & Limitations

StrengthsLimitations
✅ Fast – scales to millions❌ Less mature than PCA/t-SNE
✅ Preserves local AND global❌ Still stochastic
✅ Can transform new data❌ Hyperparameters affect results
✅ Strong theoretical foundation❌ Requires additional library
✅ Great for visualization❌ Sensitivity to hyperparameters

5. Method Comparison

Side-by-Side Comparison

Loading Python runtime...

Decision Matrix

CriterionPCAt-SNEUMAP
Speed⚡⚡⚡ Very Fast⚡ Slow⚡⚡ Fast
ScalabilityMillionsThousandsMillions
Non-linear❌ No✅ Yes✅ Yes
Global structure✅ Yes❌ No✅ Moderate
Local structure❌ No✅ Yes✅ Yes
Deterministic✅ Yes❌ No❌ No
Transform new data✅ Yes❌ No✅ Yes
Best forPreprocessingVisualizationGeneral purpose

When to Use Each

Use PCA when:

  • You need speed and scalability
  • Data is mostly linear
  • You need to transform new data
  • You want interpretable components
  • Preprocessing before other algorithms

Use t-SNE when:

  • Primary goal is visualization
  • Dataset is small-medium (<10k samples)
  • Local structure is most important
  • You have time for hyperparameter tuning

Use UMAP when:

  • You want visualization + modeling
  • Large datasets
  • Both local and global structure matter
  • You need to transform new data
  • Modern, general-purpose choice

6. Real-World Application: Image Dataset Exploration

Loading Python runtime...


Practice Problems

Problem 1: Explain the Difference

A colleague says "PCA and t-SNE are both dimensionality reduction, so they're interchangeable." Explain why this is incorrect.

Problem 2: Curse of Dimensionality

Show empirically that as dimensions increase, the ratio of distances between nearest and farthest neighbors approaches 1.

Problem 3: Face Recognition Pipeline

Build a face recognition system using:

  1. PCA for dimensionality reduction
  2. K-Means or DBSCAN for clustering
  3. Evaluate using silhouette score

Key Takeaways

Dimensionality reduction makes high-D data manageable and visualizable

PCA: Fast, linear, mathematically clean. Perfect for preprocessing.

t-SNE: Best for visualization, captures non-linear structure, preserves local neighborhoods

UMAP: Modern alternative, faster than t-SNE, balances local and global structure

Choose wisely: PCA for speed/preprocessing, t-SNE for visualization, UMAP for both

Applications: Visualization, preprocessing, noise reduction, feature engineering


What's Next?

Next up: Anomaly Detection – using dimensionality reduction and distance metrics to identify outliers!