How EM Finds Hidden Structure

Gaussian Mixture Model — each cluster is a Gaussian component
Components K =
μ1μ2π1=0.54π2=0.46

Each ellipse shows the 1.5σ contour. Points are colored by their most likely component (hard assignment shown, EM uses soft).

There's a class of problems in statistics where the obstacle isn't computing the answer — it's that you don't quite have all the information you'd need to compute it. The data you can see depends on something you can't see. What do you do?

This situation arises constantly. Gene expression measurements come from a mixture of cell types, but you don't know which gene belongs to which cell type. Audio recordings contain overlapping speech from multiple speakers, but you don't know who said what. Survey responses reflect different population subgroups, but you don't know which respondent belongs to which group.

The Expectation-Maximization algorithm — EM — is the standard answer to this class of problems. And the Gaussian Mixture Model is the clearest, most pedagogically useful example of it. Understanding GMM + EM gives you a template that applies across an enormous range of latent-variable models.

The setup

Suppose your data is a cloud of points in 2D, and you suspect it was generated by some number of overlapping Gaussian distributions — clusters with different centers, sizes, and orientations. If you knew which cluster each point came from, fitting would be trivial: separate the points by cluster, fit a Gaussian to each group using maximum likelihood, done.

But you don't know which cluster each point came from. That's the whole problem.

The Gaussian Mixture Model makes this precise. We posit KK components. Component kk has:

  • Weight πk\pi_k: the prior probability that a randomly drawn point comes from this component. (These must sum to 1.)
  • Mean μk\boldsymbol{\mu}_k: the center of the Gaussian.
  • Covariance Σk\Sigma_k: the shape and orientation.

The probability density of a data point x\mathbf{x} under this model is:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \Sigma_k)

The problem is to estimate all these parameters — all the πk\pi_k, μk\boldsymbol{\mu}_k, Σk\Sigma_k — from data x1,,xN\mathbf{x}_1, \ldots, \mathbf{x}_N.

The natural approach is maximum likelihood: find the parameters that maximize the log-likelihood

(θ)=n=1Nlogk=1KπkN(xnμk,Σk)\ell(\theta) = \sum_{n=1}^N \log \sum_{k=1}^K \pi_k \,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\Sigma_k)

The log of a sum has no closed-form solution. You can't just take a derivative and set it to zero. This is where EM comes in.

The key insight

EM's insight is to introduce the missing information explicitly as a latent variable and work with it.

For each data point xn\mathbf{x}_n, introduce an unobserved variable zn{1,,K}z_n \in \{1, \ldots, K\} indicating which component generated it. Now we have a complete data model:

p(xn,zn=k)=πkN(xnμk,Σk)p(\mathbf{x}_n, z_n=k) = \pi_k \,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\Sigma_k)

The joint probability of everything, written out:

p(X,Zθ)=n=1Np(xn,znθ)p(\mathbf{X}, \mathbf{Z}|\theta) = \prod_{n=1}^N p(\mathbf{x}_n, z_n|\theta)

The complete-data log-likelihood is:

logp(X,Zθ)=nk1[zn=k][logπk+logN(xnμk,Σk)]\log p(\mathbf{X}, \mathbf{Z}|\theta) = \sum_n \sum_k \mathbf{1}[z_n=k] \left[\log\pi_k + \log\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\right]

This is much easier to maximize than (θ)\ell(\theta): it decomposes into separate terms for each component, each of which is a standard Gaussian MLE problem.

The catch: we don't know Z\mathbf{Z}, so we can't directly maximize this. But we can take its expectation given the data and current parameter estimates.

The two steps

EM alternates between two operations:

E-step (Expectation).

With current parameters θold\theta^{\text{old}} fixed, compute the expected complete-data log-likelihood:

Q(θθold)=EZX,θold[logp(X,Zθ)]Q(\theta|\theta^{\text{old}}) = E_{\mathbf{Z}|\mathbf{X},\theta^{\text{old}}}\left[\log p(\mathbf{X},\mathbf{Z}|\theta)\right]

The key quantity is the responsibility — the posterior probability that component kk generated point nn:

γnk=p(zn=kxn,θold)=πkoldN(xnμkold,Σkold)jπjoldN(xnμjold,Σjold)\gamma_{nk} = p(z_n=k|\mathbf{x}_n, \theta^{\text{old}}) = \frac{\pi_k^{\text{old}}\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k^{\text{old}},\Sigma_k^{\text{old}})}{\sum_{j}\pi_j^{\text{old}}\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j^{\text{old}},\Sigma_j^{\text{old}})}

This is Bayes' theorem: prior (πk\pi_k) times likelihood (N\mathcal{N}) divided by evidence. The γnk\gamma_{nk} replace the unknown 1[zn=k]\mathbf{1}[z_n=k] in the complete-data log-likelihood.

For each data point, the responsibilities form a probability vector summing to 1. A point right between two clusters might get (0.5,0.5)(0.5, 0.5); a point deep inside one cluster might get (0.98,0.02)(0.98, 0.02).

M-step (Maximization).

With responsibilities γnk\gamma_{nk} fixed, maximize Q(θθold)Q(\theta|\theta^{\text{old}}) over θ\theta. Because QQ decomposes over components, each component's parameters can be updated independently. Define the effective count Nk=nγnkN_k = \sum_n \gamma_{nk}:

π^k=NkN\hat{\pi}_k = \frac{N_k}{N}

μ^k=1Nknγnkxn\hat{\boldsymbol{\mu}}_k = \frac{1}{N_k}\sum_n \gamma_{nk}\mathbf{x}_n

Σ^k=1Nknγnk(xnμ^k)(xnμ^k)T\hat{\Sigma}_k = \frac{1}{N_k}\sum_n \gamma_{nk}(\mathbf{x}_n-\hat{\boldsymbol{\mu}}_k)(\mathbf{x}_n-\hat{\boldsymbol{\mu}}_k)^T

Each formula is the responsibility-weighted version of the ordinary MLE formula. The weight of point nn in the update for component kk is exactly how responsible component kk is for that point.

These updates are intuitive. The new mixing weight πk\pi_k is the fraction of the data (by responsibility) that belongs to component kk. The new mean μk\boldsymbol{\mu}_k is the responsibility-weighted centroid. The new covariance is the responsibility-weighted scatter matrix.

Why it works

EM is guaranteed to converge — specifically, to never decrease the observed log-likelihood (θ)\ell(\theta).

The proof goes via Jensen's inequality. For any distribution qq over latent variables, we can write:

(θ)=logp(Xθ)Zq(Z)logp(X,Zθ)q(Z)L(q,θ)=:ELBO\ell(\theta) = \log p(\mathbf{X}|\theta) \geq \underbrace{\sum_{\mathbf{Z}} q(\mathbf{Z})\log\frac{p(\mathbf{X},\mathbf{Z}|\theta)}{q(\mathbf{Z})}}_{\mathcal{L}(q,\theta)} =: \text{ELBO}

The ELBO (Evidence Lower BOund) is a lower bound on the likelihood. EM can be understood as coordinate ascent on the ELBO:

  • E-step: maximize L\mathcal{L} over qq with θ\theta fixed. The optimal qq is the posterior p(ZX,θ)p(\mathbf{Z}|\mathbf{X},\theta), which makes the bound tight: L=(θ)\mathcal{L} = \ell(\theta).
  • M-step: maximize L\mathcal{L} over θ\theta with qq fixed. Since qq is fixed and the bound just got tighter at the current θ\theta, increasing L\mathcal{L} necessarily increases \ell.

The beautiful consequence: you can never go downhill. Each EM step either increases the log-likelihood or keeps it the same. Unlike gradient descent, there's no learning rate to tune, no oscillation, no overshooting.

Local maxima

EM's guarantee is only that it converges to a local maximum, not a global one. The GMM likelihood surface has many local maxima. Two strategies help: (1) run EM from multiple random initializations and take the best result, or (2) initialize means with k-means (which is fast and usually gives good starting points).

An additional degeneracy: a component with σk0\sigma_k \to 0 that collapses onto a single data point produces a likelihood that diverges to ++\infty — a spurious maximum. Fix this by adding a small regularization to covariance matrices: ΣkΣk+ϵI\Sigma_k \leftarrow \Sigma_k + \epsilon I.

Choosing K

A critical question: how many components? More components always fit better — a GMM with K=NK = N can put one component on each data point and achieve arbitrarily high likelihood. This is overfitting.

Standard remedies:

BIC (Bayesian Information Criterion): BIC=2+plnN\text{BIC} = -2\ell + p\ln N, where pp is the number of free parameters. For a full-covariance GMM with KK components in dd dimensions: p=K(d+d(d+1)2)+(K1)p = K\left(d + \frac{d(d+1)}{2}\right) + (K - 1). Each component contributes a mean vector and a symmetric covariance matrix; the final (K1)(K - 1) counts the independent mixing weights. BIC penalizes complexity and is asymptotically consistent (selects the true KK as NN \to \infty if the true model is a GMM).

AIC (Akaike Information Criterion): AIC=2+2p\text{AIC} = -2\ell + 2p. Weaker penalty than BIC — tends to prefer slightly more complex models.

Bayesian GMM: place a Dirichlet process prior over the mixing weights. The model can in principle use infinitely many components, but regularization prunes unused ones (those with tiny πk\pi_k). The "right" number of components emerges from inference.

Fit a GMM for K=1,2,,KmaxK = 1, 2, \ldots, K_{\max}, plot BIC vs. KK, and look for the elbow or minimum. In practice, domain knowledge often constrains the reasonable range.

The reach of EM

The GMM is where most people encounter EM for the first time, but the algorithm's reach is far wider. Anywhere you have a latent-variable model where direct likelihood maximization is intractable, EM applies.

Hidden Markov Models (HMMs): sequences of observations, each generated by a hidden state that follows a Markov chain. The EM algorithm for HMMs is called Baum-Welch. The E-step runs the forward-backward algorithm to compute state posteriors; the M-step updates transition and emission probabilities.

Factor Analysis: a linear latent-variable model where observations are linear functions of low-dimensional latent factors plus noise. EM fits the factor loadings and noise variances.

Mixture of Experts: the routing network in a mixture-of-experts model can be trained with EM, treating expert assignments as latent variables.

EM for missing data: even without a mixture model, when data has missing values at random (MAR), EM can impute the missing entries in the E-step and update model parameters in the M-step.

In each case, the structure is the same: unobserved quantities are introduced, posteriors are computed in the E-step, and parameters are updated in the M-step. The core idea — iterate between inferring the latent structure and updating the parameters — applies across all of them.

A summary of the GMM parameters

After running EM to convergence, you have:

ParameterMeaning
πk\pi_kFraction of data from component kk
μk\boldsymbol{\mu}_kCenter of cluster kk
Σk\Sigma_kShape and orientation of cluster kk
γnk\gamma_{nk}Probability point nn came from component kk
p(x)p(\mathbf{x})Full density estimate at any point x\mathbf{x}

These are the objects you use downstream: hard cluster assignments (argmaxkγnk\arg\max_k \gamma_{nk}), soft assignments for probabilistic inference, density estimates for anomaly detection, or samples from the generative model for data augmentation.

The Gaussian Mixture Model is simultaneously a clustering algorithm, a density estimator, and a generative model. EM is what ties these together — turning an intractable optimization into a convergent sequence of tractable steps. It is one of the most influential ideas in all of computational statistics.

See Gaussian Mixture Models for the formal definitions, exercises, and worked derivations. For the broader context of latent variable modeling, see Principal Component Analysis (a related linear latent variable model) and K-Means Clustering (the hard-assignment limit).