ADVANCED ML: UNSUPERVISED LEARNING & PRODUCTION / L04GAUSSIAN MIXTURE MODELS & EM ALGORITHM
课程 · 12 · 04 / 12
LESSON 04 · ADVANCED · 60 MIN · ◆ 1 INSTRUMENT

Gaussian Mixture Models & EM Algorithm

Probabilistic clustering with GMMs. Understand the Expectation-Maximization algorithm and Bayesian approaches to clustering.

Introduction: Soft Clustering with Probabilities

Remember K-Means? It assigns each point to exactly ONE cluster – a hard assignment. But what if a point is right between two clusters? What if clusters overlap?

Gaussian Mixture Models (GMMs) solve this with soft clustering: each point can belong to multiple clusters with different probabilities!

Key Insight: GMMs model data as a mixture of Gaussian distributions, providing probabilistic cluster assignments and enabling anomaly detection, density estimation, and generative modeling.

Learning Objectives

  • Understand Gaussian distributions and mixture models
  • Master the Expectation-Maximization (EM) algorithm
  • Compare GMMs with K-Means
  • Use GMMs for anomaly detection
  • Apply GMMs to real-world clustering problems
  • Handle model selection (choosing number of components)

1. From Gaussian to Mixture

Single Gaussian Distribution

A Gaussian (normal) distribution is defined by:

  • Mean μ\mu: center
  • Covariance Σ\Sigma: spread and orientation

Probability density:

N(xμ,Σ)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)
FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

Mixture of Gaussians

A mixture is a weighted sum of multiple Gaussians:

p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

Where:

  • KK = number of components
  • πk\pi_k = mixing coefficient (weight) for component kk
  • k=1Kπk=1\sum_{k=1}^{K} \pi_k = 1 and πk0\pi_k \geq 0

Interpretation: Data is generated by first choosing a cluster kk with probability πk\pi_k, then sampling from N(μk,Σk)\mathcal{N}(\mu_k, \Sigma_k).

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

2. The EM Algorithm

Since we don't know which component generated each point, we use Expectation-Maximization (EM):

E-Step (Expectation)

Compute responsibility γik\gamma_{ik} = probability that point ii belongs to component kk:

γik=πkN(xiμk,Σk)j=1KπjN(xiμj,Σj)\gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}

Interpretation: "How much" does component kk explain point ii?

M-Step (Maximization)

Update parameters using responsibilities:

πk=1Ni=1Nγik\pi_k = \frac{1}{N}\sum_{i=1}^{N} \gamma_{ik} μk=i=1Nγikxii=1Nγik\mu_k = \frac{\sum_{i=1}^{N} \gamma_{ik} x_i}{\sum_{i=1}^{N} \gamma_{ik}} Σk=i=1Nγik(xiμk)(xiμk)Ti=1Nγik\Sigma_k = \frac{\sum_{i=1}^{N} \gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^{N} \gamma_{ik}}

Algorithm Steps

  1. Initialize: Random μk\mu_k, Σk\Sigma_k, πk\pi_k
  2. E-step: Compute responsibilities γik\gamma_{ik}
  3. M-step: Update parameters using responsibilities
  4. Repeat until convergence (log-likelihood stops improving)
FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

3. GMM vs K-Means

Comparison

AspectK-MeansGMM
AssignmentHard (0 or 1)Soft (probabilities)
Cluster shapeSphericalElliptical (any orientation)
AlgorithmLloyd's algorithmEM algorithm
OutputCluster labelsProbabilities + density model
GenerativeNoYes (can sample new data)
Anomaly detectionDifficultNatural (low likelihood)
SpeedFasterSlower

When to Use Each

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

4. Anomaly Detection with GMM

GMMs naturally support anomaly detection: points with low likelihood under the model are anomalies!

Anomaly Score(x)=logp(x)\text{Anomaly Score}(x) = -\log p(x)
FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

5. Model Selection: Choosing K

How many components should we use? Use information criteria:

Bayesian Information Criterion (BIC)

BIC=2logL+klogn\text{BIC} = -2 \log \mathcal{L} + k \log n

Where:

  • L\mathcal{L} = likelihood
  • kk = number of parameters
  • nn = number of samples

Lower BIC = better model. BIC penalizes model complexity.

FIG. 12Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 12Interactive Python code execution environment

Key Takeaways

GMMs model data as a mixture of Gaussian distributions

Soft clustering: Points can belong to multiple clusters with probabilities

EM algorithm: Iteratively estimates responsibilities (E-step) and updates parameters (M-step)

Advantages over K-Means: Elliptical clusters, probabilistic assignments, generative model

Anomaly detection: Natural with likelihood-based scoring

Model selection: Use BIC or AIC to choose number of components


What's Next?

Next lesson: Neural Networks Fundamentals – from biological inspiration to backpropagation!


Further Reading

Interactive Visualizations

Video Tutorials

Papers & Articles

Documentation & Books

  • scikit-learn: Gaussian Mixture Models — covers GaussianMixture and BayesianGaussianMixture (which auto-selects K).
  • Book: Pattern Recognition and Machine Learning — Bishop (Chapter 9). Still the gold standard for GMMs and EM.
  • Book: Bayesian Reasoning and Machine Learning — Barber (free PDF).