Astro Starter Blog
Published at

Concepts of Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models and Factorizations

Authors
  • avatar
    Name
    Dr. Dmitrij Sitenko
  • Mathematical Image Processing at Ruprecht Karls University Heidelberg
Table of Contents

Why Graphical Models?

Imagine you are given a dataset {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^N, consisting of n×nn \times n squared images, where nn is the number of pixels in each image xix_i. Here, yiy_i indicates the corresponding image class, i.e., the ground truth classification of the underlying data point xix_i, such as the digit label in the MNIST dataset:

Samples from MNIST distribution

The holy grail is to obtain the joint probability distribution p(x,y)p(x, y) underlying the data generation process. For a binary dataset of size 28×2828 \times 28, this boils down to an inference task with 2282812^{28 \cdot 28} - 1 parameters, which is computationally infeasible. However, if we have access to:

  • The statistical properties of the underlying distribution, and
  • The geometry of the space where the data points reside,

we can incorporate this knowledge into our inference process. The key idea of graphical models is to factorize p(x,y)p(x, y) over the underlying graph G=(V,E)G = (V, E), where VV are the nodes and EE are the edges. Each node vVv \in V models a random variable XvX_v, which interacts with its neighboring nodes wN(v)w \in \mathcal{N}(v) via edges evwe_{vw}.

In general, by choosing an ordering of nodes x1,,xnx_1, \dots, x_n on a graph, the distribution can be factorized using conditional distributions via an autoregressive factorization:

p(x1,,xn)=p(x1)p(x1,x2)p(x1)p(x1,x2,x3)p(x1,x2)p(x1,,xn)p(x1,,xn1)=p(x1)i=2np(xix1,,xi1)(1)p(x_1,\dots,x_n) = \textcolor{red}{p(x_1)}\frac{\textcolor{green}{p(x_1,x_2)}}{\textcolor{red}{p(x_1)}}\frac{p(x_1,x_2,x_3)}{\textcolor{green}{p(x_1,x_2)}}\cdots \frac{p(x_1,\cdots,x_{n})}{p(x_1,\cdots,x_{n-1})} = p(x_1)\prod_{i = 2}^n p(x_i | x_1, \cdots, x_{i-1}) \quad \text{(1)}

Based on how the joint distribution is factorized, the following common classes of graphical models arise:

Directed Graphical Models

In Directed Graphical Models, also known as Bayesian Networks or Markov Random Fields (MRFs), the edges are directed, meaning evwEe_{vw} \in E but ewvEe_{wv} \notin E. For each vertex vVv \in V, we define:

  • Parents of node vVv \in V: This is the set π(v)\pi(v), which consists of all nodes ww such that there is a directed path from wsvw \to s \to \dots \to v.
  • A directed graph is a Directed Acyclic Graph (DAG) if there are no cycles (i.e., no paths of the form vsvv \to s \to \dots \to v).
  • If G=(V,E)G = (V, E) is a DAG, we can specify a partial order on the node set VV, denoting v<wv < w if wπ(v)w \in \pi(v), meaning ww is an ancestor of vv. Given this ordering, the autoregressive factorization in equation (1) can be further factorized over the sets of all ancestors via:
p(x1,,xn)=vVp(xvπ(v))(2)p(x_1,\dots,x_n) = \prod_{v \in V}p(x_v|\pi(v)) \hspace{2cm} \text{(2)}

Undirected Graphical Models

In Undirected Graphical Models, where both evwEe_{vw} \in E and ewvEe_{wv} \in E, the factorization in equation (2) does not hold because there is no ancestor-like partial ordering. Instead, the graph is decomposed into cliques CVC \subset V, which are subsets of fully connected nodes (i.e., evwEe_{vw} \in E for all v,wCv, w \in C).

Each clique CC is associated with a compatibility function ψC(xC)\psi_C(x_C), leading to the factorization:

p(x1,,xn)=1ZCCψC(xC)Z=xCCψC(xC)(3)p(x_1,\dots,x_n) = \frac{1}{Z}\prod_{C \in \mathcal{C}} \psi_C(x_C) \qquad Z = \sum_{x}\prod_{C \in \mathcal{C}} \psi_C(x_C) \hspace{2cm} \text{(3)}

In practice, cliques are chosen as maximal fully connected subsets for convenience.

Factor Graphs

For complex graphs with high inter-node dependencies, determining compatibility functions that factorize as in equation (3) can be challenging. To address this, Factor Graphs augment the node set VV with factor nodes VaV_a, which represent cliques of the graph. The edge set is also augmented with edges evae_{va} whenever vCav \in C_a.

The resulting graph is bipartite (i.e., the node set is separable into two disjoint sets), yielding the factorization:

p(x1,,xn)=iψi(xi)a:iaψa(xa)p(x_1,\dots,x_n) = \prod_{i}\psi_i(x_i)\prod_{a: i \in a}\psi_a(x_a)

For example, if the graph is a tree, the compatibility functions on the factor nodes xax_a and variable nodes xix_i are given in terms of the marginals (for more details, refer to the blog post).

Inference Problems

Given a probabilistic graphical model p(x)p(x) on a graph G=(V,E)G = (V, E), inference typically involves two main tasks:

  1. Marginal computation: Summing over an exponential number of configurations.
  2. Finding modes: Identifying the configuration xx^{\ast} with the highest probability, i.e.,
x=arg maxxdnp(x)(5)x^{\ast} = \argmax \limits_{x \in d^{n}} p(x) \qquad \qquad \text{(5)}

which is generally an NP-hard integer programming problem. For the particular case of acyclic graphs with no cycles, both of this tasks can be solved via dynamical programming paradigm.