Introduction: Discovering Hidden Patterns
Imagine you're an e-commerce analyst with millions of customers. You want to group them into segments for targeted marketing ā but you don't have labels. This is unsupervised learning: finding structure in unlabeled data.
Clustering is the art of grouping similar data points together. It's like sorting a messy pile of photos into albums ā family, vacations, pets ā without anyone telling you the categories beforehand!
Key Insight: Clustering reveals natural groupings in data, enabling pattern discovery, data compression, and feature engineering.
Learning Objectives
- Master K-Means clustering and understand its assumptions
- Apply DBSCAN for density-based clustering
- Understand hierarchical clustering methods
- Choose the right algorithm for different data structures
- Evaluate clustering quality with appropriate metrics
- Handle real-world clustering challenges
1. K-Means Clustering
šÆ See K-Means run step-by-step before reading the math: Naftali Harris's interactive K-Means visualizer lets you click to drop points and step through each iteration. Sub-five-minute video does what hours of static slides can't.
The Intuition
K-Means partitions data into K clusters by iteratively:
- Assigning points to the nearest centroid
- Updating centroids as the mean of assigned points
Analogy: Imagine K post offices. Each house is served by the nearest office. As people move, the optimal office locations shift!
How K-Means Works
Algorithm Steps:
- Initialize: Randomly place K centroids
- Assign: Each point joins the nearest centroid's cluster
- Update: Move centroids to the mean of their cluster
- Repeat: Until centroids stop moving (convergence)
Mathematical Formulation:
Minimize within-cluster variance:
Where is the centroid of cluster .
Interactive Exploration
K-Means Strengths & Limitations
| Strengths | Limitations |
|---|---|
| ā Fast and scalable | ā Requires choosing K beforehand |
| ā Simple to implement | ā Sensitive to initialization |
| ā Guaranteed to converge | ā Assumes spherical clusters |
| ā Works well for convex clusters | ā Struggles with varying densities |
| ā Easy to interpret | ā Affected by outliers |
2. DBSCAN: Density-Based Clustering
The Intuition
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions separated by sparse regions. It can discover arbitrary-shaped clusters!
Analogy: Imagine cities from space at night. Dense lights = cities (clusters). Sparse lights = roads between cities (noise).
Key Advantage: Doesn't require specifying the number of clusters!
How DBSCAN Works
Core Concepts:
- ε (epsilon): Neighborhood radius
- MinPts: Minimum points to form a dense region
- Core point: Has ℠MinPts within ε radius
- Border point: Within ε of a core point but not core itself
- Noise: Not core or border
Algorithm:
- For each unvisited point, find all points within ε radius
- If ā„ MinPts found, start a new cluster and expand
- Mark isolated points as noise
Interactive Exploration
DBSCAN Strengths & Limitations
| Strengths | Limitations |
|---|---|
| ā Finds arbitrary-shaped clusters | ā Sensitive to ε and MinPts parameters |
| ā Automatically determines # clusters | ā Struggles with varying densities |
| ā Robust to outliers (marks as noise) | ā High-dimensional data challenges |
| ā No spherical assumption | ā Memory intensive for large datasets |
3. Hierarchical Clustering
The Intuition
Hierarchical clustering builds a tree (dendrogram) of nested clusters. You can cut the tree at different heights to get different numbers of clusters!
Analogy: Like a family tree ā individuals ā families ā communities ā cities ā countries.
Two Approaches
- Agglomerative (Bottom-Up): Start with each point as a cluster, merge closest pairs
- Divisive (Top-Down): Start with one cluster, recursively split
We focus on Agglomerative (more common).
Linkage Criteria
How to measure distance between clusters?
| Linkage | Distance Metric |
|---|---|
| Single | Minimum distance between any two points |
| Complete | Maximum distance between any two points |
| Average | Average distance between all pairs |
| Ward | Minimizes within-cluster variance |
Interactive Exploration
4. Comparing Clustering Methods
Decision Guide
Use K-Means when:
- You know the number of clusters
- Clusters are roughly spherical and similar size
- Speed is important
- Data is not too noisy
Use DBSCAN when:
- You don't know the number of clusters
- Clusters have arbitrary shapes
- Data contains outliers
- Varying cluster sizes
Use Hierarchical when:
- You want a hierarchy of clusters
- Small to medium datasets
- Need to explore different numbers of clusters
- Interpretability is important
4.5. Modern Clustering Techniques š
Spectral Clustering
Graph-based clustering for complex, non-convex shapes!
Analogy: Hikers moving uphill towards peaks = automatic mode detection.
Real-World: Customer Segmentation
E-commerce with 1M customers - which algorithm?
Solution: Try K-Means (fast), DBSCAN (noise-robust), Hierarchical (interpretable), then profile segments!
5. Evaluating Clustering Quality
Internal Metrics (No Labels)
Silhouette Score: Measures how similar points are to their cluster vs. other clusters
- Range: [-1, 1]
- Higher is better
- 1 = perfect clustering
Key Takeaways
ā K-Means: Fast, simple, requires K, assumes spherical clusters
ā DBSCAN: Finds arbitrary shapes, handles noise, automatic K, needs ε and MinPts
ā Hierarchical: Builds cluster hierarchy, flexible, interpretable dendrogram
ā Choose based on: Data structure, cluster shapes, whether K is known, outlier presence
ā Evaluation: Use silhouette score, elbow method, domain knowledge
What's Next?
Next lesson: Dimensionality Reduction ā using PCA, t-SNE, and UMAP to visualize high-dimensional clustered data!
Further Reading
Interactive Visualizations
- Naftali Harris ā Visualizing K-Means Clustering ā click to drop points, step through each K-Means iteration and watch centroids migrate. The classic interactive.
- Naftali Harris ā Visualizing DBSCAN Clustering ā companion piece that shows ε-neighborhoods and core-point propagation in real time.
- scikit-learn ā Comparing Clustering Algorithms ā the canonical 9-dataset Ć 10-algorithm grid. Print this one.
- Google PAIR ā Understanding UMAP ā bridges clustering and dimensionality reduction with interactive hyperparameter sliders.
Video Tutorials
- StatQuest ā K-means clustering and Hierarchical Clustering (Josh Starmer).
- Google ML Crash Course ā Clustering ā free interactive mini-course with exercises.
Papers & Articles
- DBSCAN Revisited, Revisited ā Schubert et al., TODS 2017. The definitive modern reference; worth reading before using DBSCAN in production.
- HDBSCAN: Density-Based Clustering Based on Hierarchical Density Estimates ā Campello, Moulavi, Sander. The stronger successor to DBSCAN; library:
hdbscan. - A Survey of Clustering Algorithms for Big Data ā still a good taxonomy if you want the map of the landscape.
Documentation & Books
- scikit-learn: Clustering ā the ranked comparison table is genuinely the best single page on clustering online.
- Book: Pattern Recognition and Machine Learning ā Bishop (Chapter 9).