Astro Starter Blog
Published at

Graphical Models and Exponential Families

Todo

Authors
  • avatar
    Name
    Dr. Dmitrij Sitenko
  • Mathematical Image Processing at Ruprecht Karls University Heidelberg
Table of Contents

Exponential families provide an elegant mathematical framework for analyzing a wide range of graphical models, particularly through the lens of convex optimization. These models are characterized by their ability to represent probability distributions in exponential form. Specifically, an exponential family distribution can be written as:

pθ(x1,x2,,xn)=exp{θ,ϕ(x)  A(θ)},A(θ)=logXmexpθ,ϕ(x)ν(dx).(1)p_{\theta}(x_{1},x_{2},\dots,x_{n})=\exp\big\{\langle\theta,\phi(x)\rangle\ -\ A(\theta)\big\}, \qquad A(\theta)=\log\int_{X^{m}}\exp\langle\theta,\phi(x)\rangle\nu(d x). \quad (1)

Here A(θ)A(\theta) is called the log partition functions , and the goal in many cases is to estimate the parameter θ\theta given a dataset D={xi}i=1N\mathcal{D} = \{ x_i \}_{i = 1}^N our goal is now to determine a parameter θΘRp\theta \in \Theta \in \R^{p}. Instead of performing the parameter inference via maximum likeleedhood estimation objective

θMLE=arg maxθ Θp(θD)\theta^{\text{MLE}} = \argmax \limits_{\theta \in \ \Theta} p(\theta|\mathcal{D})

we proceed differently and search for a parameter θ\theta by matching expactation under some distribution pθp_{\theta} to empirical moments of the data

Epθ(x)[ϕα(x)]=μ^αμ^α:=1ni=1nϕα(Xi),for all αZ \mathbb{E}_{p_{\theta}(x)}[\phi_{\alpha}(x)] = \widehat{\mu}_{\alpha} \qquad \widehat{\mu}_{\alpha}:=\frac{1}{n}\sum_{i=1}^{n}\phi_{\alpha}(X^{i}),\quad\mathrm{for~all~}\alpha\in\mathbb{Z}

Examples of such moments are the empirical mean ϕα=xα\phi_{\alpha} = x_{\alpha} and variance ϕα=xα2\phi_{\alpha} = x_{\alpha}^2. The key intuition behind this concept is that the inference task within a parameter space Θ\Theta is underdetermined in the sense that there exists a vast number of parameters θ\theta satisfying this property. This raises the quastion of how to choose the perfekt parameter θ\theta^{\ast} between them ? From the statistical perspective we could go for the parameter θ\theta^{\ast} that parametrizes a distribution which is the most less informative, i.e. does not has any bias towards any assumption except the observed data (no priors are known). In mathematical terms we seek for θ\theta^{\ast} that fulfills the principle of optimality as a distribution with highest entropy (maximal uncertainty)

θEnt=arg maxθ ΘH(pθ)H(p):=X(logp(x))p(x)ν(dx).\theta^{\text{Ent}} = \argmax \limits_{\theta \in \ \Theta} H(p_{\theta}) \qquad H(p):=-\int_{\mathcal{X}}(\log p(x))p(x)\,\nu(d x).

It turns out that each distributions with the highest entropy can be represented in exponential form (1)(1) for some θΘ\theta \in \Theta belonging to a convex set

Θ:={θRdA(θ)<+}(4)\Theta:=\{\theta\in\mathbb{R}^{d}\mid A(\theta)\lt +\infty\} \quad (4)

Depending on the properties of the set (4)(4) we differ the following classes of exponential families

  • Regular representation: Θ\Theta is assumed to an open set
  • Minimal representation: If there is no nonzero vector aRda \in \mathbb{R}^d such that the linear combination
a,ϕ(x)=αCaαϕα(x)(5)\left\langle a,\phi(x)\right\rangle=\sum_{\alpha\in\cal C}a_{\alpha}\phi_{\alpha}(x) \qquad (5)

in not a contant. In other words, we want a unique representation of the underlying family.

  • Overcomplete representation: In cotranst to minimal representation we want the opposite and want an representation for which there exists nonzero aRda \in \mathbb{R}^d with beeing a constant function. An important subclass here is characterized for sufficient statistic ϕα(xα)=δα(xα)\phi_{\alpha}(x_{\alpha}) = \delta_{\alpha}(x_{\alpha}) for which the moments (1) coincides with the underlying marginals of the distribution pθp_{\theta}.

Graphical Models as Exponential Families

We now consider a deeper connection between graphical models and exponential families by a bunch of examples:

Ising Model: Given a graph G=(V,E)G = (V,E) where each node xiXix_i \in X_i is random variable of some underlying Bernoulli distribution pτ(xi=x)=τix(1τi)1xp_{\tau}(x_i = x) = \tau_i^{x}(1-\tau_i)^{1-x} where the edge set evwEe_{vw} \in E encodes the interaction of two neighbouring variables on the graph GG described. The correspodning graphical model can be represented in terms of a factor graph with compatibility function ψvw(xv,xw)\psi_{vw}(x_v,x_w)

p(x1,,xn)=1ZvVψv(xv)(v,w)Eψvw(xv,xw) p(x_1,\dots,x_n) = \frac{1}{Z} \prod_{v \in V}\psi_v(x_v)\prod_{(v,w)\in E}\psi_{vw}(x_v,x_w)

In order to arrive at the exponential representation using explog=1\exp\log = 1 it follows

p(x1,,xn)=exp(vlogθ(xv)+(v,w)Elog(θvw(xv,xw))logZ)p(x_1,\dots,x_n) = \exp( \sum_{v} \log \theta(x_v) + \sum_{(v,w) \in E}\log(\theta_{vw}(x_v,x_w)) -\log Z)

where the Ising Model is given by choosing lineaer interaction ψvw(xv,xw)=exp(θvwxvxw)\psi_{vw}(x_v,x_w) = \exp(\theta_{vw}x_v x_w) and ψv(xv)=τvxv(1τv)1xvθv(xv)=xvlogτvlog(1τv)+log(1τv)\psi_v(x_v) = \tau_v^{x_v}(1-\tau_v)^{1-x_v} \Leftrightarrow \theta_v(x_v) = x_v \frac{\log \tau_v}{\log(1-\tau_v)} + \log(1-\tau_v)

Due to the finite sum and due to

xvθvxv+(v,w)Eθvwxvxw=0θvxv=vwθvwxvxwxv,xw{0,1}0\sum_{x_v}\theta_vx_v+\sum_{(v,w)\in E}\theta_{vw}x_v x_w = 0 \Leftrightarrow \theta_v x_v = \sum_{vw}\theta_{vw}x_v x_w \qquad \forall x_v,x_w \in \{0,1\} 0

which is only the case if all θ\theta are zero this is a regualr exponential and a minimal exponential familiy

  • Todo Potts Models, Metric Labeling
  • Gaussian Markov Random Fields* are a special type of undirected graphical model where the nodes represent random variables that follow a multivariate normal (Gaussian) distribution, and the conditional independence structure is encoded by a graph

The data xiDx_i \in \mathcal{D} is generated by the multivariate distribution with parameters μRm,ΣRm×m\mu \in \mathbb{R}^m,\Sigma \in \mathbb{R}^{m \times m}

p(x)=1(2π)n/2Σ1/2exp(12(xμ)TΣ1(xμ))xiN(μ,Σ)p({\bf x})=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}({\bf x}-\mu)^{T}\Sigma^{-1}({\bf x}-\mu)\right) \quad x_i \mathcal{N}(\mu,\Sigma)

where Σ\Sigma is a dense covariance matrix. The correspodning exponential family is obtained by

pθ(x)=exp{θ,x+12tr(ΘT,xxT)A(θ,Θ)},p_{\theta}(x)=\exp \left\{ \langle \theta,x \rangle+\frac{1}{2} \text{tr}(\Theta^T,xx^{T})-A(\theta,\Theta)\right\},

with sufficient statistic given by xi,xi2,xixjx_i,x_i^2,x_ix_j where now due to the Hemmersley-Clifford-Theorem the matrix Θ\Theta is the negative of the precision matrix PP which sparsity reflects the graph strucutre.