Clustering Algorithms: K-Means, DBSCAN, Hierarchical

Introduction: Discovering Hidden Patterns

Imagine you're an e-commerce analyst with millions of customers. You want to group them into segments for targeted marketing – but you don't have labels. This is unsupervised learning: finding structure in unlabeled data.

Clustering is the art of grouping similar data points together. It's like sorting a messy pile of photos into albums – family, vacations, pets – without anyone telling you the categories beforehand!

Key Insight: Clustering reveals natural groupings in data, enabling pattern discovery, data compression, and feature engineering.

Learning Objectives

Master K-Means clustering and understand its assumptions
Apply DBSCAN for density-based clustering
Understand hierarchical clustering methods
Choose the right algorithm for different data structures
Evaluate clustering quality with appropriate metrics
Handle real-world clustering challenges

1. K-Means Clustering

SEE

🎯 See K-Means run step-by-step before reading the math: Naftali Harris's interactive K-Means visualizer lets you click to drop points and step through each iteration. Sub-five-minute video does what hours of static slides can't.

The Intuition

K-Means partitions data into K clusters by iteratively:

Assigning points to the nearest centroid
Updating centroids as the mean of assigned points

Analogy: Imagine K post offices. Each house is served by the nearest office. As people move, the optimal office locations shift!

How K-Means Works

Algorithm Steps:

Initialize: Randomly place K centroids
Assign: Each point joins the nearest centroid's cluster
Update: Move centroids to the mean of their cluster
Repeat: Until centroids stop moving (convergence)

Mathematical Formulation:

Minimize within-cluster variance:

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Where $\mu_k$ is the centroid of cluster $C_k$ .

Interactive Exploration

FIG. 02Clustering Visualizer

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Interactive K-Means, DBSCAN, and Hierarchical clustering

Try it: Switch the algorithm and dataset selectors, then drag the K / parameter sliders and hit run — watch the cluster colors, centroid markers, and boundaries redraw live as you change each control.

Try this:

Click "Generate Data" and watch K-Means iterate
Adjust K (number of clusters) – how does it change results?
Try different datasets – when does K-Means fail?
Notice how K-Means creates spherical clusters

Implementation from Scratch

FIG. 04Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 04Interactive Python code execution environment

K-Means Strengths & Limitations

Strengths	Limitations
✅ Fast and scalable	❌ Requires choosing K beforehand
✅ Simple to implement	❌ Sensitive to initialization
✅ Guaranteed to converge	❌ Assumes spherical clusters
✅ Works well for convex clusters	❌ Struggles with varying densities
✅ Easy to interpret	❌ Affected by outliers

2. DBSCAN: Density-Based Clustering

The Intuition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions separated by sparse regions. It can discover arbitrary-shaped clusters!

Analogy: Imagine cities from space at night. Dense lights = cities (clusters). Sparse lights = roads between cities (noise).

Key Advantage: Doesn't require specifying the number of clusters!

How DBSCAN Works

Core Concepts:

ε (epsilon): Neighborhood radius
MinPts: Minimum points to form a dense region
Core point: Has ≥ MinPts within ε radius
Border point: Within ε of a core point but not core itself
Noise: Not core or border

Algorithm:

For each unvisited point, find all points within ε radius
If ≥ MinPts found, start a new cluster and expand
Mark isolated points as noise

Interactive Exploration

FIG. 06Clustering Visualizer

INTERACTIVE

LOADING INSTRUMENT

Fig. 06Interactive K-Means, DBSCAN, and Hierarchical clustering

Try this:

Compare DBSCAN with K-Means on the "Moons" dataset
Adjust ε (epsilon) – too small creates noise, too large merges clusters
Adjust Min Points – controls density threshold
Notice DBSCAN handles arbitrary shapes and identifies outliers

Implementation Example

FIG. 08Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 08Interactive Python code execution environment

DBSCAN Strengths & Limitations

Strengths	Limitations
✅ Finds arbitrary-shaped clusters	❌ Sensitive to ε and MinPts parameters
✅ Automatically determines # clusters	❌ Struggles with varying densities
✅ Robust to outliers (marks as noise)	❌ High-dimensional data challenges
✅ No spherical assumption	❌ Memory intensive for large datasets

3. Hierarchical Clustering

The Intuition

Hierarchical clustering builds a tree (dendrogram) of nested clusters. You can cut the tree at different heights to get different numbers of clusters!

Analogy: Like a family tree – individuals → families → communities → cities → countries.

Two Approaches

Agglomerative (Bottom-Up): Start with each point as a cluster, merge closest pairs
Divisive (Top-Down): Start with one cluster, recursively split

We focus on Agglomerative (more common).

Linkage Criteria

How to measure distance between clusters?

Linkage	Distance Metric
Single	Minimum distance between any two points
Complete	Maximum distance between any two points
Average	Average distance between all pairs
Ward	Minimizes within-cluster variance

Interactive Exploration

FIG. 10Clustering Visualizer

INTERACTIVE

LOADING INSTRUMENT

Fig. 10Interactive K-Means, DBSCAN, and Hierarchical clustering

Try this:

Watch the hierarchical clustering animation
Try different linkage methods (Single, Complete, Average)
Notice how the dendrogram shows cluster relationships
Cut at different heights to get different numbers of clusters

Implementation Example

FIG. 12Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 12Interactive Python code execution environment

4. Comparing Clustering Methods

Rather than re-running all three algorithms here, switch between them on the same dataset in the interactive visualizer above — pick the Blobs, Moons, and Circles datasets in turn and flip the algorithm selector to see how each one handles convex vs. non-convex vs. nested shapes side by side.

Dataset	K-Means	DBSCAN	Hierarchical
Blobs (convex)	✅ Clean split	✅ Works	✅ Works
Moons (non-convex)	❌ Cuts straight through	✅ Follows the shape	⚠️ Single-linkage only
Circles (nested)	❌ Fails	✅ Separates rings	⚠️ Single-linkage only

Key takeaway: K-Means is great for convex (blob) shapes but fails on complex structures; DBSCAN handles non-convex shapes and identifies noise; hierarchical is flexible but its result depends heavily on the linkage criterion.

Decision Guide

Use K-Means when:

You know the number of clusters
Clusters are roughly spherical and similar size
Speed is important
Data is not too noisy

Use DBSCAN when:

You don't know the number of clusters
Clusters have arbitrary shapes
Data contains outliers
Varying cluster sizes

Use Hierarchical when:

You want a hierarchy of clusters
Small to medium datasets
Need to explore different numbers of clusters
Interpretability is important

4.5. Modern Clustering Techniques 🆕

Spectral Clustering

Graph-based clustering for complex, non-convex shapes!

FIG. 14Clustering Visualizer

INTERACTIVE

LOADING INSTRUMENT

Fig. 14Interactive K-Means, DBSCAN, and Hierarchical clustering

Key idea: Transform data into graph, find clusters using eigenvectors.

Mean Shift

Automatic cluster discovery - no need to specify K!

FIG. 16Clustering Visualizer

INTERACTIVE

LOADING INSTRUMENT

Fig. 16Interactive K-Means, DBSCAN, and Hierarchical clustering

Analogy: Hikers moving uphill towards peaks = automatic mode detection.

Real-World: Customer Segmentation

E-commerce with 1M customers - which algorithm?

Solution: Try K-Means (fast), DBSCAN (noise-robust), Hierarchical (interpretable), then profile segments!

5. Evaluating Clustering Quality

Internal Metrics (No Labels)

Silhouette Score: Measures how similar points are to their cluster vs. other clusters

Range: [-1, 1]
Higher is better
1 = perfect clustering

FIG. 18Python Code Executor

INTERACTIVE

LOADING INSTRUMENT

Fig. 18Interactive Python code execution environment

Key Takeaways

✅ K-Means: Fast, simple, requires K, assumes spherical clusters

✅ DBSCAN: Finds arbitrary shapes, handles noise, automatic K, needs ε and MinPts

✅ Hierarchical: Builds cluster hierarchy, flexible, interpretable dendrogram

✅ Choose based on: Data structure, cluster shapes, whether K is known, outlier presence

✅ Evaluation: Use silhouette score, elbow method, domain knowledge

What's Next?

Next lesson: Dimensionality Reduction – using PCA, t-SNE, and UMAP to visualize high-dimensional clustered data!

Clustering Algorithms: K-Means, DBSCAN, Hierarchical

Introduction: Discovering Hidden Patterns

Learning Objectives

1. K-Means Clustering

The Intuition

How K-Means Works

Interactive Exploration

Implementation from Scratch

K-Means Strengths & Limitations

2. DBSCAN: Density-Based Clustering

The Intuition

How DBSCAN Works

Interactive Exploration

Implementation Example

DBSCAN Strengths & Limitations

3. Hierarchical Clustering

The Intuition

Two Approaches

Linkage Criteria

Interactive Exploration

Implementation Example

4. Comparing Clustering Methods

Decision Guide

4.5. Modern Clustering Techniques 🆕

Spectral Clustering

Mean Shift

Real-World: Customer Segmentation

5. Evaluating Clustering Quality

Internal Metrics (No Labels)

Key Takeaways

What's Next?

Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books