Introduction: Discovering Hidden Patterns
Imagine you're an e-commerce analyst with millions of customers. You want to group them into segments for targeted marketing – but you don't have labels. This is unsupervised learning: finding structure in unlabeled data.
Clustering is the art of grouping similar data points together. It's like sorting a messy pile of photos into albums – family, vacations, pets – without anyone telling you the categories beforehand!
Key Insight: Clustering reveals natural groupings in data, enabling pattern discovery, data compression, and feature engineering.
Learning Objectives
- Master K-Means clustering and understand its assumptions
- Apply DBSCAN for density-based clustering
- Understand hierarchical clustering methods
- Choose the right algorithm for different data structures
- Evaluate clustering quality with appropriate metrics
- Handle real-world clustering challenges
1. K-Means Clustering
The Intuition
K-Means partitions data into K clusters by iteratively:
- Assigning points to the nearest centroid
- Updating centroids as the mean of assigned points
Analogy: Imagine K post offices. Each house is served by the nearest office. As people move, the optimal office locations shift!
How K-Means Works
Algorithm Steps:
- Initialize: Randomly place K centroids
- Assign: Each point joins the nearest centroid's cluster
- Update: Move centroids to the mean of their cluster
- Repeat: Until centroids stop moving (convergence)
Mathematical Formulation:
Minimize within-cluster variance:
Where is the centroid of cluster .
Interactive Exploration
Try this:
- Click "Generate Data" and watch K-Means iterate
- Adjust K (number of clusters) – how does it change results?
- Try different datasets – when does K-Means fail?
- Notice how K-Means creates spherical clusters
Implementation from Scratch
Loading Python runtime...
K-Means Strengths & Limitations
Strengths | Limitations |
---|---|
✅ Fast and scalable | ❌ Requires choosing K beforehand |
✅ Simple to implement | ❌ Sensitive to initialization |
✅ Guaranteed to converge | ❌ Assumes spherical clusters |
✅ Works well for convex clusters | ❌ Struggles with varying densities |
✅ Easy to interpret | ❌ Affected by outliers |
2. DBSCAN: Density-Based Clustering
The Intuition
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions separated by sparse regions. It can discover arbitrary-shaped clusters!
Analogy: Imagine cities from space at night. Dense lights = cities (clusters). Sparse lights = roads between cities (noise).
Key Advantage: Doesn't require specifying the number of clusters!
How DBSCAN Works
Core Concepts:
- ε (epsilon): Neighborhood radius
- MinPts: Minimum points to form a dense region
- Core point: Has ≥ MinPts within ε radius
- Border point: Within ε of a core point but not core itself
- Noise: Not core or border
Algorithm:
- For each unvisited point, find all points within ε radius
- If ≥ MinPts found, start a new cluster and expand
- Mark isolated points as noise
Interactive Exploration
Try this:
- Compare DBSCAN with K-Means on the "Moons" dataset
- Adjust ε (epsilon) – too small creates noise, too large merges clusters
- Adjust Min Points – controls density threshold
- Notice DBSCAN handles arbitrary shapes and identifies outliers
Implementation Example
Loading Python runtime...
DBSCAN Strengths & Limitations
Strengths | Limitations |
---|---|
✅ Finds arbitrary-shaped clusters | ❌ Sensitive to ε and MinPts parameters |
✅ Automatically determines # clusters | ❌ Struggles with varying densities |
✅ Robust to outliers (marks as noise) | ❌ High-dimensional data challenges |
✅ No spherical assumption | ❌ Memory intensive for large datasets |
3. Hierarchical Clustering
The Intuition
Hierarchical clustering builds a tree (dendrogram) of nested clusters. You can cut the tree at different heights to get different numbers of clusters!
Analogy: Like a family tree – individuals → families → communities → cities → countries.
Two Approaches
- Agglomerative (Bottom-Up): Start with each point as a cluster, merge closest pairs
- Divisive (Top-Down): Start with one cluster, recursively split
We focus on Agglomerative (more common).
Linkage Criteria
How to measure distance between clusters?
Linkage | Distance Metric |
---|---|
Single | Minimum distance between any two points |
Complete | Maximum distance between any two points |
Average | Average distance between all pairs |
Ward | Minimizes within-cluster variance |
Interactive Exploration
Try this:
- Watch the hierarchical clustering animation
- Try different linkage methods (Single, Complete, Average)
- Notice how the dendrogram shows cluster relationships
- Cut at different heights to get different numbers of clusters
Implementation Example
Loading Python runtime...
4. Comparing Clustering Methods
Loading Python runtime...
Decision Guide
Use K-Means when:
- You know the number of clusters
- Clusters are roughly spherical and similar size
- Speed is important
- Data is not too noisy
Use DBSCAN when:
- You don't know the number of clusters
- Clusters have arbitrary shapes
- Data contains outliers
- Varying cluster sizes
Use Hierarchical when:
- You want a hierarchy of clusters
- Small to medium datasets
- Need to explore different numbers of clusters
- Interpretability is important
5. Evaluating Clustering Quality
Internal Metrics (No Labels)
Silhouette Score: Measures how similar points are to their cluster vs. other clusters
- Range: [-1, 1]
- Higher is better
- 1 = perfect clustering
Loading Python runtime...
Key Takeaways
✅ K-Means: Fast, simple, requires K, assumes spherical clusters
✅ DBSCAN: Finds arbitrary shapes, handles noise, automatic K, needs ε and MinPts
✅ Hierarchical: Builds cluster hierarchy, flexible, interpretable dendrogram
✅ Choose based on: Data structure, cluster shapes, whether K is known, outlier presence
✅ Evaluation: Use silhouette score, elbow method, domain knowledge
What's Next?
Next lesson: Dimensionality Reduction – using PCA, t-SNE, and UMAP to visualize high-dimensional clustered data!