Astro Starter Blog
Published at

Expectation Maximization Algorithm

Introduction into expectation maximization algorithm for fitting a mixture of parametric probabalistic models to data

Authors
  • avatar
    Name
    Dr. Dmitrij Sitenko
  • Mathematical Image Processing at Ruprecht Karls University Heidelberg
Table of Contents

Expectation Maximization (EM)

A widely applied parametric counterpart to soft-k-means clustering for acquiring prototypes on a measurable space X\mathcal{X} is based on the assumption of i.i.d. random samples xiXnx_i \in X_n and by the following parametric family of mixture distributions:

p(x,Γ)=j[k]πjp(x,θj)Γ=(θ,π),p(x,\Gamma) = \sum \limits_{j\in [k]} \pi_jp(x,\theta_j)\qquad \Gamma = (\theta,\pi),

with unknown parameters:

Γ=(θ,π),θ=(θ1,,θk)π=(π1,,πk)TΔk.\Gamma = (\theta,\pi), \qquad \theta = (\theta_1,\dots,\theta_k) \qquad \pi = (\pi_1,\dots,\pi_k)^T \in \Delta_k.

Here, the different parameters θjΘ,j[k]\theta_j \in \Theta, j \in [k] parameterize the mixture distribution on X\mathcal{X} that partitions the set Xn\mathcal{X}_n into kk different clusters with proportions πS\pi \in \mathcal{S}, where each xix_i has the density p(xi,θj)p(x_i,\theta_j).

Starting from this statistical perspective, where an approximation Γ^=(θ^,Γ^)\hat{\Gamma} = (\hat{\theta},\hat{\Gamma}) to model parameters Γ\Gamma is given, clustering amounts to fitting the mixture distribution p(x,Γ)p(x, \Gamma) by estimating true model parameters Γ\Gamma via maximum log-likelihood estimation of:

L(θ)=i[n]log(j[k]πjp(xi,θj)).L(\theta) = \sum \limits_{i \in [n]}\log \big(\sum \limits_{j \in [k]} \pi_j p(x_i,\theta_j) \big).

Making a further assumption that each xix_i is generated by exactly one component distribution p(j,Γ)p(j,\Gamma) and augmenting the set:

(Xn,Yn)=(x1,,xn,y1,yn),xiXn,yi[k](\mathcal{X}_n,\mathcal{Y}_n) = (x_1,\dots,x_n,y_1,\dots y_n), \qquad x_i \in \mathcal{X}_n, y_i \in [k]

where yiy_i corresponds to class assignments of points xix_i to the associated distribution p(xi,θyi)p(x_i,\theta_{y_i}), optimization is performed by instead maximizing the following lower bound to L(θ)L(\theta):

j[k]i[n]p(jxi,Γ^)log(p(xi,j,Γ)p(jxi,Γ^)).\sum_{j \in [k]}\sum_{i \in [n]}p(j|x_i,\hat{\Gamma})\log\left(\frac{p(x_i,j,\Gamma)}{p(j|x_i,\hat{\Gamma})}\right).

Maximization of this lower bound amounts to performing the so-called EM-iterates (expectation-maximization), see [Bishop, 2007], that result in the updates with initialization Γ(0)=Γ^\Gamma^{(0)} = \hat{\Gamma}:

Γ(t)(θ(t),π(t)).\Gamma^{(t)}(\theta^{(t)},\pi^{(t)}).

EM Algorithm Updates:

  1. Expectation Step:

    p(jxi,Γ(t))=πj(t)p(xi,Γ(t))l[k]πl(t)p(xi,Γ(t))p(j|x_i,\Gamma^{(t)}) = \frac{\pi_j^{(t)}p(x_i,\Gamma^{(t)})}{\sum_{l \in [k]}\pi_l^{(t)}p(x_i,\Gamma^{(t)})}

  2. Maximization Step for π\pi and θ\theta:

    πj(t+1)=1ni[n]p(jxi,Γ(t)),\pi_j^{(t+1)} = \frac{1}{n}\sum \limits_{i \in [n]}p(j|x_i, \Gamma^{(t)}),

    θ(t+1)=argmaxθji[n]p(jxi,Γ(t))logp(xi,θj).\theta^{(t+1)} = \arg\max \limits_{\theta_j}\sum \limits_{i \in [n]} p(j|x_i,\Gamma^{(t)})\log p(x_i,\theta_j).

The procedure consists of two main steps: (i)(i) expectation over the conditional distributions p(jxi,Γ^)p(j|x_i,\hat{\Gamma}) yielding a lower bound for the objective L(θ)L(\theta), and (ii)(ii) maximization over parameter θ\theta.

Exponential Family Distribution Case:

For the case where the components p(x,θj)p(x,\theta_j) belong to an exponential family of distributions represented in terms of Bregman divergence:

p(x,Γ)=j[k]πjexp(Df(F(x),νj))bf(x),p(x,\Gamma) = \sum_{j \in [k]} \pi_j \exp(-D_{f}(F(x),\nu_j))b_f(x),

where parameters are expressed via conjugation through:

Γ=(ν,π),ν=(ν1,,νk)withνj=ψ(θj),j[k].\Gamma = (\nu,\pi), \qquad \nu = (\nu_1,\dots,\nu_k) \quad \text{with} \quad \nu_j = \psi(\theta_j), \quad j \in [k].

The EM updates simplify to:

  1. Expectation Step with Bregman Divergence:

    p(jxi,Γ(t))=πj(t)exp(Df(F(xi),ψ(θj(t))))l[k]πl(t)exp(Df(F(xi),ψ(θl(t))))p(j|x_i,\Gamma^{(t)}) = \frac{\pi_j^{(t)}\exp(-D_f(F(x_i),\psi(\theta_j^{(t)})))}{\sum_{l \in [k]}\pi_l^{(t)}\exp(-D_f(F(x_i),\psi(\theta^{(t)}_l)))}

  2. Maximization Step:

    πj(t+1)=1ni[n]p(jxi,Γ(t)),\pi_j^{(t+1)} = \frac{1}{n}\sum \limits_{i \in [n]}p(j|x_i, \Gamma^{(t)}),

    νij(t)=p(jxi,Γ(t))s[n]p(jxs,Γ(t)),\nu^{(t)}_{ij} = \frac{p(j|x_i,\Gamma^{(t)})}{\sum_{s \in [n] }p(j|x_s,\Gamma^{(t)})},

    μj(t+1)=argminμji[n]νij(t)Df(F(xi),μj),\mu_j^{(t+1)} = \arg\min \limits_{\mu_j}\sum_{i \in [n]}\nu^{(t)}_{ij} D_f(F(x_i),\mu_j),

where the final step admits a closed-form solution similar to mean shift updates:

μj(t)=i[n]νij(t)F(xi).\mu_j^{(t)} = \sum_{i \in [n]} \nu_{ij}^{(t)}F(x_i).

For detailed treatment of EM iterates in connection with Bregman divergences, see [Banerjee, 2005].

Lets consider a simple example of applying the EM algorithm to fitting a mixture of 3 gaussions. Given a data set D\mathcal{D} generated from p(x,μi,Σi)p(x,\mu_i,\Sigma_i) with

μ1=(1.92.1),μ2=(6.87.1),μ3=(1.26.9) Σ1=(0.950.450.451.05),Σ2=(1.10.250.251.0),Σ3=(1.00.00.01.1) \mu_1 = \begin{pmatrix} 1.9 \\ 2.1 \end{pmatrix}, \quad \mu_2 = \begin{pmatrix} 6.8 \\ 7.1 \end{pmatrix}, \quad \mu_3 = \begin{pmatrix} 1.2 \\ 6.9 \end{pmatrix} \quad \\\\\ \Sigma_1 = \begin{pmatrix} 0.95 & 0.45 \\ 0.45 & 1.05 \end{pmatrix}, \quad \Sigma_2 = \begin{pmatrix} 1.1 & -0.25 \\ -0.25 & 1.0 \end{pmatrix}, \quad \Sigma_3 = \begin{pmatrix} 1.0 & 0.0 \\ 0.0 & 1.1 \end{pmatrix}

A four node Tree Graph