Graphical Models and Exponential Families

Exponential families provide an elegant mathematical framework for analyzing a wide range of graphical models, particularly through the lens of convex optimization. These models are characterized by their ability to represent probability distributions in exponential form. Specifically, an exponential family distribution can be written as:

p_{\theta}(x_{1},x_{2},\dots,x_{n})=\exp\big\{\langle\theta,\phi(x)\rangle\ -\ A(\theta)\big\}, \qquad A(\theta)=\log\int_{X^{m}}\exp\langle\theta,\phi(x)\rangle\nu(d x). \quad (1)

Here $A(\theta)$ is called the log partition functions , and the goal in many cases is to estimate the parameter $\theta$ given a dataset $\mathcal{D} = \{ x_i \}_{i = 1}^N$ our goal is now to determine a parameter $\theta \in \Theta \in \R^{p}$ . Instead of performing the parameter inference via maximum likeleedhood estimation objective

\theta^{\text{MLE}} = \argmax \limits_{\theta \in \ \Theta} p(\theta|\mathcal{D})

we proceed differently and search for a parameter $\theta$ by matching expactation under some distribution $p_{\theta}$ to empirical moments of the data

\mathbb{E}_{p_{\theta}(x)}[\phi_{\alpha}(x)] = \widehat{\mu}_{\alpha} \qquad \widehat{\mu}_{\alpha}:=\frac{1}{n}\sum_{i=1}^{n}\phi_{\alpha}(X^{i}),\quad\mathrm{for~all~}\alpha\in\mathbb{Z}

Examples of such moments are the empirical mean $\phi_{\alpha} = x_{\alpha}$ and variance $\phi_{\alpha} = x_{\alpha}^2$ . The key intuition behind this concept is that the inference task within a parameter space $\Theta$ is underdetermined in the sense that there exists a vast number of parameters $\theta$ satisfying this property. This raises the quastion of how to choose the perfekt parameter $\theta^{\ast}$ between them ? From the statistical perspective we could go for the parameter $\theta^{\ast}$ that parametrizes a distribution which is the most less informative, i.e. does not has any bias towards any assumption except the observed data (no priors are known). In mathematical terms we seek for $\theta^{\ast}$ that fulfills the principle of optimality as a distribution with highest entropy (maximal uncertainty)

\theta^{\text{Ent}} = \argmax \limits_{\theta \in \ \Theta} H(p_{\theta}) \qquad H(p):=-\int_{\mathcal{X}}(\log p(x))p(x)\,\nu(d x).

It turns out that each distributions with the highest entropy can be represented in exponential form $(1)$ for some $\theta \in \Theta$ belonging to a convex set

\Theta:=\{\theta\in\mathbb{R}^{d}\mid A(\theta)\lt +\infty\} \quad (4)

Depending on the properties of the set $(4)$ we differ the following classes of exponential families

Regular representation: $\Theta$ is assumed to an open set
Minimal representation: If there is no nonzero vector $a \in \mathbb{R}^d$ such that the linear combination

\left\langle a,\phi(x)\right\rangle=\sum_{\alpha\in\cal C}a_{\alpha}\phi_{\alpha}(x) \qquad (5)

in not a contant. In other words, we want a unique representation of the underlying family.

Overcomplete representation: In cotranst to minimal representation we want the opposite and want an representation for which there exists nonzero $a \in \mathbb{R}^d$ with beeing a constant function. An important subclass here is characterized for sufficient statistic $\phi_{\alpha}(x_{\alpha}) = \delta_{\alpha}(x_{\alpha})$ for which the moments (1) coincides with the underlying marginals of the distribution $p_{\theta}$ .

Graphical Models as Exponential Families

We now consider a deeper connection between graphical models and exponential families by a bunch of examples:

Ising Model: Given a graph $G = (V,E)$ where each node $x_i \in X_i$ is random variable of some underlying Bernoulli distribution $p_{\tau}(x_i = x) = \tau_i^{x}(1-\tau_i)^{1-x}$ where the edge set $e_{vw} \in E$ encodes the interaction of two neighbouring variables on the graph $G$ described. The correspodning graphical model can be represented in terms of a factor graph with compatibility function $\psi_{vw}(x_v,x_w)$

$p(x_1,\dots,x_n) = \frac{1}{Z} \prod_{v \in V}\psi_v(x_v)\prod_{(v,w)\in E}\psi_{vw}(x_v,x_w)$

In order to arrive at the exponential representation using $\exp\log = 1$ it follows

$p(x_1,\dots,x_n) = \exp( \sum_{v} \log \theta(x_v) + \sum_{(v,w) \in E}\log(\theta_{vw}(x_v,x_w)) -\log Z)$

where the Ising Model is given by choosing lineaer interaction $\psi_{vw}(x_v,x_w) = \exp(\theta_{vw}x_v x_w)$ and $\psi_v(x_v) = \tau_v^{x_v}(1-\tau_v)^{1-x_v} \Leftrightarrow \theta_v(x_v) = x_v \frac{\log \tau_v}{\log(1-\tau_v)} + \log(1-\tau_v)$

Due to the finite sum and due to

$\sum_{x_v}\theta_vx_v+\sum_{(v,w)\in E}\theta_{vw}x_v x_w = 0 \Leftrightarrow \theta_v x_v = \sum_{vw}\theta_{vw}x_v x_w \qquad \forall x_v,x_w \in \{0,1\} 0$

which is only the case if all $\theta$ are zero this is a regualr exponential and a minimal exponential familiy

Todo Potts Models, Metric Labeling

Gaussian Markov Random Fields* are a special type of undirected graphical model where the nodes represent random variables that follow a multivariate normal (Gaussian) distribution, and the conditional independence structure is encoded by a graph

The data $x_i \in \mathcal{D}$ is generated by the multivariate distribution with parameters $\mu \in \mathbb{R}^m,\Sigma \in \mathbb{R}^{m \times m}$

p({\bf x})=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}({\bf x}-\mu)^{T}\Sigma^{-1}({\bf x}-\mu)\right) \quad x_i \mathcal{N}(\mu,\Sigma)

where $\Sigma$ is a dense covariance matrix. The correspodning exponential family is obtained by

$p_{\theta}(x)=\exp \left\{ \langle \theta,x \rangle+\frac{1}{2} \text{tr}(\Theta^T,xx^{T})-A(\theta,\Theta)\right\},$

with sufficient statistic given by $x_i,x_i^2,x_ix_j$ where now due to the Hemmersley-Clifford-Theorem the matrix $\Theta$ is the negative of the precision matrix $P$ which sparsity reflects the graph strucutre.