In this post we give an intuitive understanding of key concepts behind a broad class of machine learning models known as energy based models which are widely applied to various task including:

Data generation and Data reconstruction
Feature extraction
Data compression

Motivation

Throughout the post we will explain how energy based models tackle the above problems but remembering the quote

Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.
Marie Curie

let us first start from the origins and try to find out what this models actually are and from where did they emerged from. From our previos post on exponential families and graphical models we already know how exponential families emerges as solutions to maximum entropy problems constrained to mean like contraints. From the perspective of statistical physics, the mean parameters are interpreted as averaged energies $\mathbb{E}_{p}[E(x)]$ which are fixed constant in our universe that we can measure and have access to.
In other words, given only the energy function $E(x)$ without any assumptions on the underlying generating distribution $p(x)$ , we are in the state of maximal uncertainty and in order to determine $p(x)$ we need to solve maximum entropy problem:

p = \argmax \limits_{p} H(p) \qquad H(p):=-\int_{\mathcal{X}}(\log p(x))p(x)\,\nu(d x) \\ \int_{\mathcal{X}}p(x) = 1 \qquad \mathbb{E}_{p}[E(x)] = \langle E \rangle \qquad (1)

The corresponding dual representation of (1) is given in terms of Lagrange multipliers associated to each contraint

p^{\ast},\lambda^{\ast},\beta^{\ast} = \argmax \limits_{p,\lambda,\mu} H(p) + \beta(\langle E \rangle-\mathbb{E}_p[E])+\lambda(1-\int_X p(x)d\nu(x)) \qquad (2)

Taking the functional derivative with respect to $p(x)$ to zero yields optimal condition for $p^{\ast}(x)$

-\log p^{\ast}(x)-1-\beta E(x)-\lambda = 0 \quad \Leftrightarrow \quad p^{\ast}(x) = \frac{1}{e^{1+\lambda}} e^{-\beta E(x)} \qquad (3)

where the Lagrange multiplier $\lambda>0$ now corresponds to to the log-partition function via $e^{\lambda^{\ast}+1} = \int_{x \in X}e^{-\beta E(x)}d \nu(x)$ and therefore can be expressed in closed from

\lambda^{\ast}+1 = \log( \int_{x \in X}e^{-\beta E(x)}d \nu(x) ) = \log Z

Plugging in this expression into (3) yields the closed form expression for $p^{\ast}$

p^{\ast}(x) = \frac{1}{Z} \int_{x \in X}e^{-\beta E(x)}d \nu(x) \qquad (4)

For the particular case of the discrete state space $X$ the lefthand side of (4) boilds down to

p^{\ast}(x) = \frac{1}{Z} \sum \limits_{x \in X}e^{-\beta E(x)} \quad (5)

In statistical physics the maximum entropy solution $(5)$ is denoted as the Boltzmann-Gibbs distribution which in the language of machine learning nowdays is referenced as energy based model due to the energy term $E(x)$ which one wants to minimize in practice. However, because this names are equivalent they yields to equivalent chalanges. This means evaluating the probability $p^{\ast}(x)$ for a particular observed data point is hard as we need access to the partion function $Z$ .

Inserting representation (5) into (1) we get an equivalent formulation of the entropy

H(p^{\ast}) = \beta \int_{X}p^{\ast}(x)E(x) d\nu(x) + \log(Z) = \beta \langle E \rangle + \log(Z) \qquad (6)

where $-\log Z$ and $\langle E \rangle$ are called the Gibbs free energy and averaged energy respectively. In particular, decomposition $(6)$ was the beginning of the still ongoing reseach on approximations ansatzes to evaluate the uncractable partition function $Z$ including:

Mean field methods
Kikuchy clustering and Bethe approximations
Region based entropy approximation
Deep learning driven approaches.

We continue with the last point by focussing on the discrete case and aim to find a variational model for $p^{\ast}$ that is parametrized by a neural network through the energy term $E_{\theta}(x)$ . This idea dates back to year 1980 with the formulation of Boltzmann machines and Hopffield model

Hopfield Model

The Hopfield model is the grandfather of the neural networks which idea originates from modeling associative memory motivated by the rule that synapsis connections that activate simultaniously are stronger alone activated neuron known as the Hebbian rule

neurons that fire together wire together

Mathematically, for the special case of $N$ neurons which takes only fire and not fire binary values the strengthens of such synapsis connections is represented

w_{ij}= \begin{cases} \frac{1}{N}\sum_{\mu=1}^{p}x_{i}^{\mu}x_{j}^{\mu},\qquad &i\neq j \\ 0,\qquad &i = j \end{cases}. \qquad (7)

where $x_k^{\mu}$ represents the $i$ -th bit of some binary neurons firing pattern $\mu$ . The weights $\omega_{ij}>0$ defines a symmetric matrix as a sum of p binary outerproduct $W = \sum_{\mu =1}^p x^{\mu}(x^{\mu})^T$ which in turn defines an energy of the underlying network

E_{Theta}(x) =-{\frac{1}{2}}\sum_{i=1}^{N}\sum_{j=1}^{N}w_{i j}x_{i}x_{j}+\sum_{i=1}^{N}\theta_{i}x_{i} = \frac{1}{2} \langle x,W x \rangle+\langle \theta, x \rangle \qquad (8)

We are now ready to build the Hopfield energy based model by inserting (8) into expression (5) which yields a family of distribution over all binary patterns of length $N$ for each pair of parameters $\Theta = (W,\theta)$ . The model parameters are sequentially updated according to

x_i =\mathrm{sign}\left(\sum_{j}w_{ij} x_{j}+\theta_{i}\right) ,\qquad \mathrm{sign}(x)=\begin{cases} -1 \qquad &x\leq 0 \\ +1 \qquad &x\gt 0 \end{cases}

Boltzmann Machines

One drawback of Hopffield networks is the lack of modeling higher order dependenies between the data due to energy (8) which only contains pairwise intractions. Incorporating higher order terms will results in tensors $W$ , which itself requires efficient contraction routines as explained in the post on tensor neworks. A different apporach to model higher order dependencies beyond pairwise interaction is the idea on purification from quantum statistical physics. The idea is view the unknown distribution $p(x)$ as a marginal of higher dimensional distribution $p(x,h)$ with $p(x) = \sum_{h}p(x,h)$ . The Boltzmann machine takes the form (5) with

E(x,h)=-\frac{1}{2}\sum_{i,j}x_{i}l_{i j}x_{j}-\frac{1}{2}\sum_{i j}h_{i}j_{i j}h_{j}-\sum_{i j}x_{i}w_{i j}h_{j}-\sum_{i}\theta_{i}x_{i}-\sum_{j}\eta_{j}h_{j} \\ =- \frac{1}{2} \langle x,Lx \rangle- \frac{1}{2} \langle h,Jh \rangle-\langle x,W h \rangle-\langle \theta,x \rangle -\langle \eta,h \rangle \qquad (9)

Restricted Boltzmann Machines (RBMs)

The Restricited Boltzmann_Machine is obtained from (9) by dropping dependencies encoded by $L$ and $J$ matrices, i.e. removing all interrelations between visisble nodes $x_i,x_j$ and hidden nodes $h_i,h_j$ respectivaly.

E(x,h)=-\sum_{i j}x_{i}w_{i j}h_{j}-\sum_{i}\theta_{i}x_{i}-\sum_{j}\eta_{j}h_{j} =-\langle x,W h \rangle-\langle \theta,x \rangle -\langle \eta,h \rangle \qquad (10)

Assuming $x_i \in \{0,1\}$ , the key advantage of an RBM over more generall Boltzmann machines are the closed form expression for the conditional probabilities

p(x_i = 1|h) = \frac{p(x_i,h)}{p(h)} = \frac{\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 1}\left(\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\right)}{\left(\sum \limits_{x \in \{ 0,1 \}^n}\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\right) } \qquad (11) \\ = \frac{1}{1+\frac{\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 0}\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)}{\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 1}\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)}} = \frac{1}{1+\exp(-\langle W_i,h \rangle-\theta_i)}

where the last step follows from

$\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 0}\\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\exp(-\langle W_i,h \rangle-\theta_i)$ which is equals to $\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 1}\left(\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\right)$ Analagous calculation for $p(h_j = 1|x)$ yields the closed form expression

p(h_j = 1|x) = = \frac{1}{1+e^{-(\langle W^T,x \rangle-\eta_j)}}, \qquad p(x_i = 1|h) = \frac{1}{1+e^{-\langle W_i,h \rangle-\theta_i}} \qquad (12)