Clustering Algorithms: K-Means, DBSCAN, Hierarchical

Introduction: Discovering Hidden Patterns

Imagine you're an e-commerce analyst with millions of customers. You want to group them into segments for targeted marketing – but you don't have labels. This is unsupervised learning: finding structure in unlabeled data.

Clustering is the art of grouping similar data points together. It's like sorting a messy pile of photos into albums – family, vacations, pets – without anyone telling you the categories beforehand!

Key Insight: Clustering reveals natural groupings in data, enabling pattern discovery, data compression, and feature engineering.

Learning Objectives

Master K-Means clustering and understand its assumptions
Apply DBSCAN for density-based clustering
Understand hierarchical clustering methods
Choose the right algorithm for different data structures
Evaluate clustering quality with appropriate metrics
Handle real-world clustering challenges

1. K-Means Clustering

The Intuition

K-Means partitions data into K clusters by iteratively:

Assigning points to the nearest centroid
Updating centroids as the mean of assigned points

Analogy: Imagine K post offices. Each house is served by the nearest office. As people move, the optimal office locations shift!

How K-Means Works

Algorithm Steps:

Initialize: Randomly place K centroids
Assign: Each point joins the nearest centroid's cluster
Update: Move centroids to the mean of their cluster
Repeat: Until centroids stop moving (convergence)

Mathematical Formulation:

Minimize within-cluster variance:

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Where $\mu_k$ is the centroid of cluster $C_k$ .

Interactive Exploration

Loading interactive component...

Try this:

Click "Generate Data" and watch K-Means iterate
Adjust K (number of clusters) – how does it change results?
Try different datasets – when does K-Means fail?
Notice how K-Means creates spherical clusters

Implementation from Scratch

Loading Python runtime...

K-Means Strengths & Limitations

Strengths	Limitations
✅ Fast and scalable	❌ Requires choosing K beforehand
✅ Simple to implement	❌ Sensitive to initialization
✅ Guaranteed to converge	❌ Assumes spherical clusters
✅ Works well for convex clusters	❌ Struggles with varying densities
✅ Easy to interpret	❌ Affected by outliers

2. DBSCAN: Density-Based Clustering

The Intuition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions separated by sparse regions. It can discover arbitrary-shaped clusters!

Analogy: Imagine cities from space at night. Dense lights = cities (clusters). Sparse lights = roads between cities (noise).

Key Advantage: Doesn't require specifying the number of clusters!

How DBSCAN Works

Core Concepts:

ε (epsilon): Neighborhood radius
MinPts: Minimum points to form a dense region
Core point: Has ≥ MinPts within ε radius
Border point: Within ε of a core point but not core itself
Noise: Not core or border

Algorithm:

For each unvisited point, find all points within ε radius
If ≥ MinPts found, start a new cluster and expand
Mark isolated points as noise

Interactive Exploration

Loading interactive component...

Try this:

Compare DBSCAN with K-Means on the "Moons" dataset
Adjust ε (epsilon) – too small creates noise, too large merges clusters
Adjust Min Points – controls density threshold
Notice DBSCAN handles arbitrary shapes and identifies outliers

Implementation Example

Loading Python runtime...

DBSCAN Strengths & Limitations

Strengths	Limitations
✅ Finds arbitrary-shaped clusters	❌ Sensitive to ε and MinPts parameters
✅ Automatically determines # clusters	❌ Struggles with varying densities
✅ Robust to outliers (marks as noise)	❌ High-dimensional data challenges
✅ No spherical assumption	❌ Memory intensive for large datasets

3. Hierarchical Clustering

The Intuition

Hierarchical clustering builds a tree (dendrogram) of nested clusters. You can cut the tree at different heights to get different numbers of clusters!

Analogy: Like a family tree – individuals → families → communities → cities → countries.

Two Approaches

Agglomerative (Bottom-Up): Start with each point as a cluster, merge closest pairs
Divisive (Top-Down): Start with one cluster, recursively split

We focus on Agglomerative (more common).

Linkage Criteria

How to measure distance between clusters?

Linkage	Distance Metric
Single	Minimum distance between any two points
Complete	Maximum distance between any two points
Average	Average distance between all pairs
Ward	Minimizes within-cluster variance

Interactive Exploration

Loading interactive component...

Try this:

Watch the hierarchical clustering animation
Try different linkage methods (Single, Complete, Average)
Notice how the dendrogram shows cluster relationships
Cut at different heights to get different numbers of clusters

Implementation Example

Loading Python runtime...

4. Comparing Clustering Methods

Loading Python runtime...

Decision Guide

Use K-Means when:

You know the number of clusters
Clusters are roughly spherical and similar size
Speed is important
Data is not too noisy

Use DBSCAN when:

You don't know the number of clusters
Clusters have arbitrary shapes
Data contains outliers
Varying cluster sizes

Use Hierarchical when:

You want a hierarchy of clusters
Small to medium datasets
Need to explore different numbers of clusters
Interpretability is important

5. Evaluating Clustering Quality

Internal Metrics (No Labels)

Silhouette Score: Measures how similar points are to their cluster vs. other clusters

Range: [-1, 1]
Higher is better
1 = perfect clustering

Loading Python runtime...

Key Takeaways

✅ K-Means: Fast, simple, requires K, assumes spherical clusters

✅ DBSCAN: Finds arbitrary shapes, handles noise, automatic K, needs ε and MinPts

✅ Hierarchical: Builds cluster hierarchy, flexible, interpretable dendrogram

✅ Choose based on: Data structure, cluster shapes, whether K is known, outlier presence

✅ Evaluation: Use silhouette score, elbow method, domain knowledge

What's Next?

Next lesson: Dimensionality Reduction – using PCA, t-SNE, and UMAP to visualize high-dimensional clustered data!

Advanced ML: Unsupervised Learning & Production

Clustering Algorithms: K-Means, DBSCAN, Hierarchical

Introduction: Discovering Hidden Patterns

Learning Objectives

1. K-Means Clustering

The Intuition

How K-Means Works

Interactive Exploration

Implementation from Scratch

K-Means Strengths & Limitations

2. DBSCAN: Density-Based Clustering

The Intuition

How DBSCAN Works

Interactive Exploration

Implementation Example

DBSCAN Strengths & Limitations

3. Hierarchical Clustering

The Intuition

Two Approaches

Linkage Criteria

Interactive Exploration

Implementation Example

4. Comparing Clustering Methods

Decision Guide

5. Evaluating Clustering Quality

Internal Metrics (No Labels)

Key Takeaways

What's Next?