Introduction: The Curse of High Dimensions
Imagine trying to visualize a dataset with 1,000 features. Impossible, right? Or training a model where each sample has 10,000 dimensions – slow, memory-intensive, and prone to overfitting.
Dimensionality reduction solves this by projecting high-dimensional data into fewer dimensions while preserving essential structure. It's like looking at shadows: a 3D object casts a 2D shadow that captures its key shape.
In this lesson, we'll master three powerful techniques: PCA (linear, fast), t-SNE (non-linear, great for visualization), and UMAP (balances both, fast).
Prerequisites
- Completed ML Fundamentals and Clustering Algorithms
- Linear algebra: vectors, matrices, eigenvectors
- Python with NumPy, scikit-learn
- Nice to have: Understanding of variance and covariance
1. Why Dimensionality Reduction?
The Curse of Dimensionality
As dimensions increase, strange things happen:
- Volume explodes: A 10D hypercube has 1,024 corners!
- Distance loses meaning: All points become equally far apart
- Data becomes sparse: Most of the volume is empty
- Models suffer: Need exponentially more data to generalize
Analogy: In a 1D line, 10 points fill it well. In 2D, you need 100 points. In 10D? A million points barely scratch the surface!
Benefits of Dimensionality Reduction
Benefit | Example |
---|---|
Visualization | Plot 1000D data in 2D/3D for exploration |
Speed | Faster training with fewer features |
Memory | Store less data |
Noise reduction | Remove irrelevant dimensions |
Multicollinearity | Remove correlated features |
Better generalization | Avoid overfitting from too many features |
Two Approaches
- Feature Selection: Choose a subset of original features (we covered this in ML Fundamentals)
- Feature Extraction: Create new features by combining originals (this lesson!)
2. Principal Component Analysis (PCA)
The Intuition
PCA finds directions of maximum variance in your data. Think of it as rotating your coordinate system to align with the data's natural spread.
Analogy: Imagine photographing a flat disk from different angles. One angle (face-on) captures the most information. That's the first principal component!
Key Idea: Project data onto fewer dimensions that capture the most variance.
Mathematical Foundation
Given data matrix (n samples × p features):
- Center the data:
- Compute covariance matrix:
- Find eigenvectors/eigenvalues:
- Sort by eigenvalues: Largest → most variance
- Project: (keep top k eigenvectors)
Explained Variance:
Interactive Exploration
Try this:
- Switch between datasets – see how PCA projects 3D → 2D
- Check "Explained Variance" – how much information is retained?
- Compare PCA with t-SNE and UMAP on the same data
Implementation from Scratch
Loading Python runtime...
Choosing the Number of Components
Loading Python runtime...
Rule of Thumb: Keep enough components to explain 90-95% of variance.
PCA Strengths & Limitations
Strengths | Limitations |
---|---|
✅ Fast and scalable | ❌ Only captures linear relationships |
✅ Mathematically principled | ❌ Assumes variance = importance |
✅ Reversible (can reconstruct) | ❌ Sensitive to scaling |
✅ No hyperparameters | ❌ Hard to interpret components |
✅ Noise reduction | ❌ Fails on non-linear manifolds |
3. t-SNE: t-Distributed Stochastic Neighbor Embedding
The Intuition
t-SNE preserves local structure – keeping nearby points close and far points far. Unlike PCA, it can capture non-linear relationships.
Analogy: Imagine a crumpled paper (high-D) that you carefully unfold (2D) while preserving which parts touch.
Key Idea: Model pairwise similarities in high-D, then find low-D layout that preserves those similarities.
How t-SNE Works
-
High-dimensional similarities: Model as Gaussian
-
Low-dimensional similarities: Model as t-distribution (heavy-tailed)
-
Optimize: Minimize KL-divergence between P and Q using gradient descent
Why t-distribution? Heavy tails prevent "crowding problem" – far points don't cluster together.
Interactive Exploration
Try this:
- Adjust Perplexity – balances local vs global structure (typical: 5-50)
- Notice t-SNE is stochastic – different runs give different results
- Compare with PCA on the same "Swiss Roll" data
Implementation Example
Loading Python runtime...
The Perplexity Parameter
Perplexity roughly corresponds to the number of close neighbors for each point.
- Low perplexity (5-15): Focuses on local structure, may create fragmented clusters
- High perplexity (30-50): Considers more global structure, may merge clusters
- Rule of thumb: Start with 30, adjust based on dataset size
Loading Python runtime...
t-SNE Strengths & Limitations
Strengths | Limitations |
---|---|
✅ Excellent for visualization | ❌ Very slow for large datasets |
✅ Captures non-linear structure | ❌ Non-deterministic (random init) |
✅ Preserves local neighborhoods | ❌ Doesn't preserve global structure |
✅ Beautiful visualizations | ❌ Can't transform new data |
✅ Works well for clusters | ❌ Perplexity needs tuning |
Key Limitation: t-SNE is for visualization only – don't use distances in 2D for downstream tasks!
4. UMAP: Uniform Manifold Approximation and Projection
The Intuition
UMAP is the best of both worlds: preserves local structure like t-SNE but faster and better at preserving global structure.
Key Idea: Build a graph of the data manifold, then find a low-dimensional layout that preserves the graph structure.
Mathematical Approach
- Construct fuzzy topological graph in high dimensions
- Optimize layout in low dimensions to match graph structure
- Uses Riemannian geometry and algebraic topology (simplified implementation)
Advantages over t-SNE:
- Faster (can handle millions of points)
- Preserves more global structure
- Can transform new data
- Fewer hyperparameters
Interactive Exploration
Try this:
- Adjust N Neighbors – similar to perplexity in t-SNE
- Notice UMAP is faster than t-SNE
- Compare global structure preservation across all three methods
Implementation Example
Loading Python runtime...
UMAP Hyperparameters
- n_neighbors: Size of local neighborhood (default 15). Smaller → more local focus
- min_dist: Minimum distance between points in embedding (default 0.1). Smaller → tighter clusters
UMAP Strengths & Limitations
Strengths | Limitations |
---|---|
✅ Fast – scales to millions | ❌ Less mature than PCA/t-SNE |
✅ Preserves local AND global | ❌ Still stochastic |
✅ Can transform new data | ❌ Hyperparameters affect results |
✅ Strong theoretical foundation | ❌ Requires additional library |
✅ Great for visualization | ❌ Sensitivity to hyperparameters |
5. Method Comparison
Side-by-Side Comparison
Loading Python runtime...
Decision Matrix
Criterion | PCA | t-SNE | UMAP |
---|---|---|---|
Speed | ⚡⚡⚡ Very Fast | ⚡ Slow | ⚡⚡ Fast |
Scalability | Millions | Thousands | Millions |
Non-linear | ❌ No | ✅ Yes | ✅ Yes |
Global structure | ✅ Yes | ❌ No | ✅ Moderate |
Local structure | ❌ No | ✅ Yes | ✅ Yes |
Deterministic | ✅ Yes | ❌ No | ❌ No |
Transform new data | ✅ Yes | ❌ No | ✅ Yes |
Best for | Preprocessing | Visualization | General purpose |
When to Use Each
Use PCA when:
- You need speed and scalability
- Data is mostly linear
- You need to transform new data
- You want interpretable components
- Preprocessing before other algorithms
Use t-SNE when:
- Primary goal is visualization
- Dataset is small-medium (<10k samples)
- Local structure is most important
- You have time for hyperparameter tuning
Use UMAP when:
- You want visualization + modeling
- Large datasets
- Both local and global structure matter
- You need to transform new data
- Modern, general-purpose choice
6. Real-World Application: Image Dataset Exploration
Loading Python runtime...
Practice Problems
Problem 1: Explain the Difference
A colleague says "PCA and t-SNE are both dimensionality reduction, so they're interchangeable." Explain why this is incorrect.
Problem 2: Curse of Dimensionality
Show empirically that as dimensions increase, the ratio of distances between nearest and farthest neighbors approaches 1.
Problem 3: Face Recognition Pipeline
Build a face recognition system using:
- PCA for dimensionality reduction
- K-Means or DBSCAN for clustering
- Evaluate using silhouette score
Key Takeaways
✅ Dimensionality reduction makes high-D data manageable and visualizable
✅ PCA: Fast, linear, mathematically clean. Perfect for preprocessing.
✅ t-SNE: Best for visualization, captures non-linear structure, preserves local neighborhoods
✅ UMAP: Modern alternative, faster than t-SNE, balances local and global structure
✅ Choose wisely: PCA for speed/preprocessing, t-SNE for visualization, UMAP for both
✅ Applications: Visualization, preprocessing, noise reduction, feature engineering
What's Next?
Next up: Anomaly Detection – using dimensionality reduction and distance metrics to identify outliers!