Gaussian Mixture Models & EM Algorithm

Introduction: Soft Clustering with Probabilities

Remember K-Means? It assigns each point to exactly ONE cluster – a hard assignment. But what if a point is right between two clusters? What if clusters overlap?

Gaussian Mixture Models (GMMs) solve this with soft clustering: each point can belong to multiple clusters with different probabilities!

Key Insight: GMMs model data as a mixture of Gaussian distributions, providing probabilistic cluster assignments and enabling anomaly detection, density estimation, and generative modeling.

Learning Objectives

Understand Gaussian distributions and mixture models
Master the Expectation-Maximization (EM) algorithm
Compare GMMs with K-Means
Use GMMs for anomaly detection
Apply GMMs to real-world clustering problems
Handle model selection (choosing number of components)

1. From Gaussian to Mixture

Single Gaussian Distribution

A Gaussian (normal) distribution is defined by:

Mean $\mu$ : center
Covariance $\Sigma$ : spread and orientation

Probability density:

\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)

Mixture of Gaussians

A mixture is a weighted sum of multiple Gaussians:

p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

Where:

$K$ = number of components
$\pi_k$ = mixing coefficient (weight) for component $k$
$\sum_{k=1}^{K} \pi_k = 1$ and $\pi_k \geq 0$

Interpretation: Data is generated by first choosing a cluster $k$ with probability $\pi_k$ , then sampling from $\mathcal{N}(\mu_k, \Sigma_k)$ .

2. The EM Algorithm

Since we don't know which component generated each point, we use Expectation-Maximization (EM):

E-Step (Expectation)

Compute responsibility $\gamma_{ik}$ = probability that point $i$ belongs to component $k$ :

\gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}

Interpretation: "How much" does component $k$ explain point $i$ ?

M-Step (Maximization)

Update parameters using responsibilities:

\pi_k = \frac{1}{N}\sum_{i=1}^{N} \gamma_{ik}

\mu_k = \frac{\sum_{i=1}^{N} \gamma_{ik} x_i}{\sum_{i=1}^{N} \gamma_{ik}}

\Sigma_k = \frac{\sum_{i=1}^{N} \gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^{N} \gamma_{ik}}

Algorithm Steps

Initialize: Random $\mu_k$ , $\Sigma_k$ , $\pi_k$
E-step: Compute responsibilities $\gamma_{ik}$
M-step: Update parameters using responsibilities
Repeat until convergence (log-likelihood stops improving)

3. GMM vs K-Means

Comparison

Aspect	K-Means	GMM
Assignment	Hard (0 or 1)	Soft (probabilities)
Cluster shape	Spherical	Elliptical (any orientation)
Algorithm	Lloyd's algorithm	EM algorithm
Output	Cluster labels	Probabilities + density model
Generative	No	Yes (can sample new data)
Anomaly detection	Difficult	Natural (low likelihood)
Speed	Faster	Slower

When to Use Each

4. Anomaly Detection with GMM

GMMs naturally support anomaly detection: points with low likelihood under the model are anomalies!

\text{Anomaly Score}(x) = -\log p(x)

5. Model Selection: Choosing K

How many components should we use? Use information criteria:

Bayesian Information Criterion (BIC)

\text{BIC} = -2 \log \mathcal{L} + k \log n

Where:

$\mathcal{L}$ = likelihood
$k$ = number of parameters
$n$ = number of samples

Lower BIC = better model. BIC penalizes model complexity.

Key Takeaways

✅ GMMs model data as a mixture of Gaussian distributions

✅ Soft clustering: Points can belong to multiple clusters with probabilities

✅ EM algorithm: Iteratively estimates responsibilities (E-step) and updates parameters (M-step)

✅ Advantages over K-Means: Elliptical clusters, probabilistic assignments, generative model

✅ Anomaly detection: Natural with likelihood-based scoring

✅ Model selection: Use BIC or AIC to choose number of components

What's Next?

Next lesson: Neural Networks Fundamentals – from biological inspiration to backpropagation!