Introduction: Soft Clustering with Probabilities
Remember K-Means? It assigns each point to exactly ONE cluster – a hard assignment. But what if a point is right between two clusters? What if clusters overlap?
Gaussian Mixture Models (GMMs) solve this with soft clustering: each point can belong to multiple clusters with different probabilities!
Key Insight: GMMs model data as a mixture of Gaussian distributions, providing probabilistic cluster assignments and enabling anomaly detection, density estimation, and generative modeling.
Learning Objectives
- Understand Gaussian distributions and mixture models
- Master the Expectation-Maximization (EM) algorithm
- Compare GMMs with K-Means
- Use GMMs for anomaly detection
- Apply GMMs to real-world clustering problems
- Handle model selection (choosing number of components)
1. From Gaussian to Mixture
Single Gaussian Distribution
A Gaussian (normal) distribution is defined by:
- Mean : center
- Covariance : spread and orientation
Probability density:
Mixture of Gaussians
A mixture is a weighted sum of multiple Gaussians:
Where:
- = number of components
- = mixing coefficient (weight) for component
- and
Interpretation: Data is generated by first choosing a cluster with probability , then sampling from .
2. The EM Algorithm
Since we don't know which component generated each point, we use Expectation-Maximization (EM):
E-Step (Expectation)
Compute responsibility = probability that point belongs to component :
Interpretation: "How much" does component explain point ?
M-Step (Maximization)
Update parameters using responsibilities:
Algorithm Steps
- Initialize: Random , ,
- E-step: Compute responsibilities
- M-step: Update parameters using responsibilities
- Repeat until convergence (log-likelihood stops improving)
3. GMM vs K-Means
Comparison
| Aspect | K-Means | GMM |
|---|---|---|
| Assignment | Hard (0 or 1) | Soft (probabilities) |
| Cluster shape | Spherical | Elliptical (any orientation) |
| Algorithm | Lloyd's algorithm | EM algorithm |
| Output | Cluster labels | Probabilities + density model |
| Generative | No | Yes (can sample new data) |
| Anomaly detection | Difficult | Natural (low likelihood) |
| Speed | Faster | Slower |
When to Use Each
4. Anomaly Detection with GMM
GMMs naturally support anomaly detection: points with low likelihood under the model are anomalies!
5. Model Selection: Choosing K
How many components should we use? Use information criteria:
Bayesian Information Criterion (BIC)
Where:
- = likelihood
- = number of parameters
- = number of samples
Lower BIC = better model. BIC penalizes model complexity.
Key Takeaways
✅ GMMs model data as a mixture of Gaussian distributions
✅ Soft clustering: Points can belong to multiple clusters with probabilities
✅ EM algorithm: Iteratively estimates responsibilities (E-step) and updates parameters (M-step)
✅ Advantages over K-Means: Elliptical clusters, probabilistic assignments, generative model
✅ Anomaly detection: Natural with likelihood-based scoring
✅ Model selection: Use BIC or AIC to choose number of components
What's Next?
Next lesson: Neural Networks Fundamentals – from biological inspiration to backpropagation!
Further Reading
Interactive Visualizations
- scikit-learn — GMM Covariance Comparison — see how
full,tied,diag, andsphericalcovariance matrices change cluster shape on the same data. - scikit-learn — Density Estimation: GMM — runnable example with concentric distributions and overlapping components.
- Distill — Gaussian Process — neighboring topic with the same Gaussian-distribution intuition you're using here.
Video Tutorials
- StatQuest — Gaussian Mixture Models, Clearly Explained (Josh Starmer).
- StatQuest — Expectation-Maximization — the underlying algorithm, broken down.
- Mathematicalmonk — EM Algorithm Series — the most rigorous free walkthrough.
Papers & Articles
- Maximum Likelihood from Incomplete Data via the EM Algorithm — Dempster, Laird, Rubin, JRSS 1977. The foundational EM paper.
- A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants — Neal & Hinton, 1998. Why EM works and when it doesn't.
- Variational Inference: A Review for Statisticians — Blei, Kucukelbir, McAuliffe, JASA 2017. The modern probabilistic-modeling generalization of EM.
Documentation & Books
- scikit-learn: Gaussian Mixture Models — covers
GaussianMixtureandBayesianGaussianMixture(which auto-selects K). - Book: Pattern Recognition and Machine Learning — Bishop (Chapter 9). Still the gold standard for GMMs and EM.
- Book: Bayesian Reasoning and Machine Learning — Barber (free PDF).