Why Graphical Models?

Imagine you are given a dataset $\{(x_i, y_i)\}_{i=1}^N$ , consisting of $n \times n$ squared images, where $n$ is the number of pixels in each image $x_i$ . Here, $y_i$ indicates the corresponding image class, i.e., the ground truth classification of the underlying data point $x_i$ , such as the digit label in the MNIST dataset:

The holy grail is to obtain the joint probability distribution $p(x, y)$ underlying the data generation process. For a binary dataset of size $28 \times 28$ , this boils down to an inference task with $2^{28 \cdot 28} - 1$ parameters, which is computationally infeasible. However, if we have access to:

The statistical properties of the underlying distribution, and
The geometry of the space where the data points reside,

we can incorporate this knowledge into our inference process. The key idea of graphical models is to factorize $p(x, y)$ over the underlying graph $G = (V, E)$ , where $V$ are the nodes and $E$ are the edges. Each node $v \in V$ models a random variable $X_v$ , which interacts with its neighboring nodes $w \in \mathcal{N}(v)$ via edges $e_{vw}$ .

In general, by choosing an ordering of nodes $x_1, \dots, x_n$ on a graph, the distribution can be factorized using conditional distributions via an autoregressive factorization:

p(x_1,\dots,x_n) = \textcolor{red}{p(x_1)}\frac{\textcolor{green}{p(x_1,x_2)}}{\textcolor{red}{p(x_1)}}\frac{p(x_1,x_2,x_3)}{\textcolor{green}{p(x_1,x_2)}}\cdots \frac{p(x_1,\cdots,x_{n})}{p(x_1,\cdots,x_{n-1})} = p(x_1)\prod_{i = 2}^n p(x_i | x_1, \cdots, x_{i-1}) \quad \text{(1)}

Based on how the joint distribution is factorized, the following common classes of graphical models arise:

Directed Graphical Models

In Directed Graphical Models, also known as Bayesian Networks or Markov Random Fields (MRFs), the edges are directed, meaning $e_{vw} \in E$ but $e_{wv} \notin E$ . For each vertex $v \in V$ , we define:

Parents of node $v \in V$ : This is the set $\pi(v)$ , which consists of all nodes $w$ such that there is a directed path from $w \to s \to \dots \to v$ .
A directed graph is a Directed Acyclic Graph (DAG) if there are no cycles (i.e., no paths of the form $v \to s \to \dots \to v$ ).
If $G = (V, E)$ is a DAG, we can specify a partial order on the node set $V$ , denoting $v < w$ if $w \in \pi(v)$ , meaning $w$ is an ancestor of $v$ . Given this ordering, the autoregressive factorization in equation (1) can be further factorized over the sets of all ancestors via:

p(x_1,\dots,x_n) = \prod_{v \in V}p(x_v|\pi(v)) \hspace{2cm} \text{(2)}

Undirected Graphical Models

In Undirected Graphical Models, where both $e_{vw} \in E$ and $e_{wv} \in E$ , the factorization in equation (2) does not hold because there is no ancestor-like partial ordering. Instead, the graph is decomposed into cliques $C \subset V$ , which are subsets of fully connected nodes (i.e., $e_{vw} \in E$ for all $v, w \in C$ ).

Each clique $C$ is associated with a compatibility function $\psi_C(x_C)$ , leading to the factorization:

p(x_1,\dots,x_n) = \frac{1}{Z}\prod_{C \in \mathcal{C}} \psi_C(x_C) \qquad Z = \sum_{x}\prod_{C \in \mathcal{C}} \psi_C(x_C) \hspace{2cm} \text{(3)}

In practice, cliques are chosen as maximal fully connected subsets for convenience.

Factor Graphs

For complex graphs with high inter-node dependencies, determining compatibility functions that factorize as in equation (3) can be challenging. To address this, Factor Graphs augment the node set $V$ with factor nodes $V_a$ , which represent cliques of the graph. The edge set is also augmented with edges $e_{va}$ whenever $v \in C_a$ .

The resulting graph is bipartite (i.e., the node set is separable into two disjoint sets), yielding the factorization:

p(x_1,\dots,x_n) = \prod_{i}\psi_i(x_i)\prod_{a: i \in a}\psi_a(x_a)

For example, if the graph is a tree, the compatibility functions on the factor nodes $x_a$ and variable nodes $x_i$ are given in terms of the marginals (for more details, refer to the blog post).

Inference Problems

Given a probabilistic graphical model $p(x)$ on a graph $G = (V, E)$ , inference typically involves two main tasks:

Marginal computation: Summing over an exponential number of configurations.
Finding modes: Identifying the configuration $x^{\ast}$ with the highest probability, i.e.,

x^{\ast} = \argmax \limits_{x \in d^{n}} p(x) \qquad \qquad \text{(5)}

which is generally an NP-hard integer programming problem. For the particular case of acyclic graphs with no cycles, both of this tasks can be solved via dynamical programming paradigm.