Clustering Algorithms: K-Means, DBSCAN, Hierarchical

Introduction: Discovering Hidden Patterns

Imagine you're an e-commerce analyst with millions of customers. You want to group them into segments for targeted marketing – but you don't have labels. This is unsupervised learning: finding structure in unlabeled data.

Clustering is the art of grouping similar data points together. It's like sorting a messy pile of photos into albums – family, vacations, pets – without anyone telling you the categories beforehand!

Key Insight: Clustering reveals natural groupings in data, enabling pattern discovery, data compression, and feature engineering.

Learning Objectives

  • Master K-Means clustering and understand its assumptions
  • Apply DBSCAN for density-based clustering
  • Understand hierarchical clustering methods
  • Choose the right algorithm for different data structures
  • Evaluate clustering quality with appropriate metrics
  • Handle real-world clustering challenges

1. K-Means Clustering

The Intuition

K-Means partitions data into K clusters by iteratively:

  1. Assigning points to the nearest centroid
  2. Updating centroids as the mean of assigned points

Analogy: Imagine K post offices. Each house is served by the nearest office. As people move, the optimal office locations shift!

How K-Means Works

Algorithm Steps:

  1. Initialize: Randomly place K centroids
  2. Assign: Each point joins the nearest centroid's cluster
  3. Update: Move centroids to the mean of their cluster
  4. Repeat: Until centroids stop moving (convergence)

Mathematical Formulation:

Minimize within-cluster variance:

J=k=1KxiCkxiμk2J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Where μk\mu_k is the centroid of cluster CkC_k.

Interactive Exploration

Loading interactive component...

Try this:

  1. Click "Generate Data" and watch K-Means iterate
  2. Adjust K (number of clusters) – how does it change results?
  3. Try different datasets – when does K-Means fail?
  4. Notice how K-Means creates spherical clusters

Implementation from Scratch

Loading Python runtime...

K-Means Strengths & Limitations

StrengthsLimitations
✅ Fast and scalable❌ Requires choosing K beforehand
✅ Simple to implement❌ Sensitive to initialization
✅ Guaranteed to converge❌ Assumes spherical clusters
✅ Works well for convex clusters❌ Struggles with varying densities
✅ Easy to interpret❌ Affected by outliers

2. DBSCAN: Density-Based Clustering

The Intuition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions separated by sparse regions. It can discover arbitrary-shaped clusters!

Analogy: Imagine cities from space at night. Dense lights = cities (clusters). Sparse lights = roads between cities (noise).

Key Advantage: Doesn't require specifying the number of clusters!

How DBSCAN Works

Core Concepts:

  • ε (epsilon): Neighborhood radius
  • MinPts: Minimum points to form a dense region
  • Core point: Has ≥ MinPts within ε radius
  • Border point: Within ε of a core point but not core itself
  • Noise: Not core or border

Algorithm:

  1. For each unvisited point, find all points within ε radius
  2. If ≥ MinPts found, start a new cluster and expand
  3. Mark isolated points as noise

Interactive Exploration

Loading interactive component...

Try this:

  1. Compare DBSCAN with K-Means on the "Moons" dataset
  2. Adjust ε (epsilon) – too small creates noise, too large merges clusters
  3. Adjust Min Points – controls density threshold
  4. Notice DBSCAN handles arbitrary shapes and identifies outliers

Implementation Example

Loading Python runtime...

DBSCAN Strengths & Limitations

StrengthsLimitations
✅ Finds arbitrary-shaped clusters❌ Sensitive to ε and MinPts parameters
✅ Automatically determines # clusters❌ Struggles with varying densities
✅ Robust to outliers (marks as noise)❌ High-dimensional data challenges
✅ No spherical assumption❌ Memory intensive for large datasets

3. Hierarchical Clustering

The Intuition

Hierarchical clustering builds a tree (dendrogram) of nested clusters. You can cut the tree at different heights to get different numbers of clusters!

Analogy: Like a family tree – individuals → families → communities → cities → countries.

Two Approaches

  1. Agglomerative (Bottom-Up): Start with each point as a cluster, merge closest pairs
  2. Divisive (Top-Down): Start with one cluster, recursively split

We focus on Agglomerative (more common).

Linkage Criteria

How to measure distance between clusters?

LinkageDistance Metric
SingleMinimum distance between any two points
CompleteMaximum distance between any two points
AverageAverage distance between all pairs
WardMinimizes within-cluster variance

Interactive Exploration

Loading interactive component...

Try this:

  1. Watch the hierarchical clustering animation
  2. Try different linkage methods (Single, Complete, Average)
  3. Notice how the dendrogram shows cluster relationships
  4. Cut at different heights to get different numbers of clusters

Implementation Example

Loading Python runtime...


4. Comparing Clustering Methods

Loading Python runtime...

Decision Guide

Use K-Means when:

  • You know the number of clusters
  • Clusters are roughly spherical and similar size
  • Speed is important
  • Data is not too noisy

Use DBSCAN when:

  • You don't know the number of clusters
  • Clusters have arbitrary shapes
  • Data contains outliers
  • Varying cluster sizes

Use Hierarchical when:

  • You want a hierarchy of clusters
  • Small to medium datasets
  • Need to explore different numbers of clusters
  • Interpretability is important

5. Evaluating Clustering Quality

Internal Metrics (No Labels)

Silhouette Score: Measures how similar points are to their cluster vs. other clusters

  • Range: [-1, 1]
  • Higher is better
  • 1 = perfect clustering

Loading Python runtime...


Key Takeaways

K-Means: Fast, simple, requires K, assumes spherical clusters

DBSCAN: Finds arbitrary shapes, handles noise, automatic K, needs ε and MinPts

Hierarchical: Builds cluster hierarchy, flexible, interpretable dendrogram

Choose based on: Data structure, cluster shapes, whether K is known, outlier presence

Evaluation: Use silhouette score, elbow method, domain knowledge


What's Next?

Next lesson: Dimensionality Reduction – using PCA, t-SNE, and UMAP to visualize high-dimensional clustered data!