Gaussian Mixture Models & EM Algorithm

Introduction: Soft Clustering with Probabilities

Remember K-Means? It assigns each point to exactly ONE cluster – a hard assignment. But what if a point is right between two clusters? What if clusters overlap?

Gaussian Mixture Models (GMMs) solve this with soft clustering: each point can belong to multiple clusters with different probabilities!

Key Insight: GMMs model data as a mixture of Gaussian distributions, providing probabilistic cluster assignments and enabling anomaly detection, density estimation, and generative modeling.

Learning Objectives

  • Understand Gaussian distributions and mixture models
  • Master the Expectation-Maximization (EM) algorithm
  • Compare GMMs with K-Means
  • Use GMMs for anomaly detection
  • Apply GMMs to real-world clustering problems
  • Handle model selection (choosing number of components)

1. From Gaussian to Mixture

Single Gaussian Distribution

A Gaussian (normal) distribution is defined by:

  • Mean μ\mu: center
  • Covariance Σ\Sigma: spread and orientation

Probability density:

N(xμ,Σ)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)

Loading Python runtime...

Mixture of Gaussians

A mixture is a weighted sum of multiple Gaussians:

p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

Where:

  • KK = number of components
  • πk\pi_k = mixing coefficient (weight) for component kk
  • k=1Kπk=1\sum_{k=1}^{K} \pi_k = 1 and πk0\pi_k \geq 0

Interpretation: Data is generated by first choosing a cluster kk with probability πk\pi_k, then sampling from N(μk,Σk)\mathcal{N}(\mu_k, \Sigma_k).

Loading Python runtime...


2. The EM Algorithm

Since we don't know which component generated each point, we use Expectation-Maximization (EM):

E-Step (Expectation)

Compute responsibility γik\gamma_{ik} = probability that point ii belongs to component kk:

γik=πkN(xiμk,Σk)j=1KπjN(xiμj,Σj)\gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}

Interpretation: "How much" does component kk explain point ii?

M-Step (Maximization)

Update parameters using responsibilities:

πk=1Ni=1Nγik\pi_k = \frac{1}{N}\sum_{i=1}^{N} \gamma_{ik} μk=i=1Nγikxii=1Nγik\mu_k = \frac{\sum_{i=1}^{N} \gamma_{ik} x_i}{\sum_{i=1}^{N} \gamma_{ik}} Σk=i=1Nγik(xiμk)(xiμk)Ti=1Nγik\Sigma_k = \frac{\sum_{i=1}^{N} \gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^{N} \gamma_{ik}}

Algorithm Steps

  1. Initialize: Random μk\mu_k, Σk\Sigma_k, πk\pi_k
  2. E-step: Compute responsibilities γik\gamma_{ik}
  3. M-step: Update parameters using responsibilities
  4. Repeat until convergence (log-likelihood stops improving)

Loading Python runtime...


3. GMM vs K-Means

Comparison

AspectK-MeansGMM
AssignmentHard (0 or 1)Soft (probabilities)
Cluster shapeSphericalElliptical (any orientation)
AlgorithmLloyd's algorithmEM algorithm
OutputCluster labelsProbabilities + density model
GenerativeNoYes (can sample new data)
Anomaly detectionDifficultNatural (low likelihood)
SpeedFasterSlower

When to Use Each

Loading Python runtime...


4. Anomaly Detection with GMM

GMMs naturally support anomaly detection: points with low likelihood under the model are anomalies!

Anomaly Score(x)=logp(x)\text{Anomaly Score}(x) = -\log p(x)

Loading Python runtime...


5. Model Selection: Choosing K

How many components should we use? Use information criteria:

Bayesian Information Criterion (BIC)

BIC=2logL+klogn\text{BIC} = -2 \log \mathcal{L} + k \log n

Where:

  • L\mathcal{L} = likelihood
  • kk = number of parameters
  • nn = number of samples

Lower BIC = better model. BIC penalizes model complexity.

Loading Python runtime...


Key Takeaways

GMMs model data as a mixture of Gaussian distributions

Soft clustering: Points can belong to multiple clusters with probabilities

EM algorithm: Iteratively estimates responsibilities (E-step) and updates parameters (M-step)

Advantages over K-Means: Elliptical clusters, probabilistic assignments, generative model

Anomaly detection: Natural with likelihood-based scoring

Model selection: Use BIC or AIC to choose number of components


What's Next?

Next lesson: Neural Networks Fundamentals – from biological inspiration to backpropagation!