Astro Starter Blog
Published at

Energy Based Models

This block covers ideas on energy based models starting physics via Gibbs distribution abd Restricted Boltzmann Machine.

Authors
  • avatar
    Name
    Dr. Dmitrij Sitenko
  • Mathematical Image Processing at Ruprecht Karls University Heidelberg
Table of Contents

In this post we give an intuitive understanding of key concepts behind a broad class of machine learning models known as energy based models which are widely applied to various task including:

  • Data generation and Data reconstruction
  • Feature extraction
  • Data compression

Motivation

Throughout the post we will explain how energy based models tackle the above problems but remembering the quote

Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.
Marie Curie

let us first start from the origins and try to find out what this models actually are and from where did they emerged from. From our previos post on exponential families and graphical models we already know how exponential families emerges as solutions to maximum entropy problems constrained to mean like contraints. From the perspective of statistical physics, the mean parameters are interpreted as averaged energies Ep[E(x)]\mathbb{E}_{p}[E(x)] which are fixed constant in our universe that we can measure and have access to.
In other words, given only the energy function E(x)E(x) without any assumptions on the underlying generating distribution p(x)p(x), we are in the state of maximal uncertainty and in order to determine p(x)p(x) we need to solve maximum entropy problem:

p=arg maxpH(p)H(p):=X(logp(x))p(x)ν(dx)Xp(x)=1Ep[E(x)]=E(1)p = \argmax \limits_{p} H(p) \qquad H(p):=-\int_{\mathcal{X}}(\log p(x))p(x)\,\nu(d x) \\ \int_{\mathcal{X}}p(x) = 1 \qquad \mathbb{E}_{p}[E(x)] = \langle E \rangle \qquad (1)

The corresponding dual representation of (1) is given in terms of Lagrange multipliers associated to each contraint

p,λ,β=arg maxp,λ,μH(p)+β(EEp[E])+λ(1Xp(x)dν(x))(2)p^{\ast},\lambda^{\ast},\beta^{\ast} = \argmax \limits_{p,\lambda,\mu} H(p) + \beta(\langle E \rangle-\mathbb{E}_p[E])+\lambda(1-\int_X p(x)d\nu(x)) \qquad (2)

Taking the functional derivative with respect to p(x)p(x) to zero yields optimal condition for p(x)p^{\ast}(x)

logp(x)1βE(x)λ=0p(x)=1e1+λeβE(x)(3)-\log p^{\ast}(x)-1-\beta E(x)-\lambda = 0 \quad \Leftrightarrow \quad p^{\ast}(x) = \frac{1}{e^{1+\lambda}} e^{-\beta E(x)} \qquad (3)

where the Lagrange multiplier λ>0\lambda>0 now corresponds to to the log-partition function via eλ+1=xXeβE(x)dν(x)e^{\lambda^{\ast}+1} = \int_{x \in X}e^{-\beta E(x)}d \nu(x) and therefore can be expressed in closed from

λ+1=log(xXeβE(x)dν(x))=logZ\lambda^{\ast}+1 = \log( \int_{x \in X}e^{-\beta E(x)}d \nu(x) ) = \log Z

Plugging in this expression into (3) yields the closed form expression for pp^{\ast}

p(x)=1ZxXeβE(x)dν(x)(4)p^{\ast}(x) = \frac{1}{Z} \int_{x \in X}e^{-\beta E(x)}d \nu(x) \qquad (4)

For the particular case of the discrete state space XX the lefthand side of (4) boilds down to

p(x)=1ZxXeβE(x)(5)p^{\ast}(x) = \frac{1}{Z} \sum \limits_{x \in X}e^{-\beta E(x)} \quad (5)

In statistical physics the maximum entropy solution (5)(5) is denoted as the Boltzmann-Gibbs distribution which in the language of machine learning nowdays is referenced as energy based model due to the energy term E(x)E(x) which one wants to minimize in practice. However, because this names are equivalent they yields to equivalent chalanges. This means evaluating the probability p(x)p^{\ast}(x) for a particular observed data point is hard as we need access to the partion function ZZ.

Inserting representation (5) into (1) we get an equivalent formulation of the entropy

H(p)=βXp(x)E(x)dν(x)+log(Z)=βE+log(Z)(6)H(p^{\ast}) = \beta \int_{X}p^{\ast}(x)E(x) d\nu(x) + \log(Z) = \beta \langle E \rangle + \log(Z) \qquad (6)

where logZ-\log Z and E\langle E \rangle are called the Gibbs free energy and averaged energy respectively. In particular, decomposition (6)(6) was the beginning of the still ongoing reseach on approximations ansatzes to evaluate the uncractable partition function ZZ including:

  • Mean field methods
  • Kikuchy clustering and Bethe approximations
  • Region based entropy approximation
  • Deep learning driven approaches.

We continue with the last point by focussing on the discrete case and aim to find a variational model for pp^{\ast} that is parametrized by a neural network through the energy term Eθ(x)E_{\theta}(x). This idea dates back to year 1980 with the formulation of Boltzmann machines and Hopffield model

Hopfield Model

The Hopfield model is the grandfather of the neural networks which idea originates from modeling associative memory motivated by the rule that synapsis connections that activate simultaniously are stronger alone activated neuron known as the Hebbian rule

neurons that fire together wire together

Mathematically, for the special case of NN neurons which takes only fire and not fire binary values the strengthens of such synapsis connections is represented

wij={1Nμ=1pxiμxjμ,ij0,i=j.(7)w_{ij}= \begin{cases} \frac{1}{N}\sum_{\mu=1}^{p}x_{i}^{\mu}x_{j}^{\mu},\qquad &i\neq j \\ 0,\qquad &i = j \end{cases}. \qquad (7)

where xkμx_k^{\mu} represents the ii-th bit of some binary neurons firing pattern μ\mu. The weights ωij>0\omega_{ij}>0 defines a symmetric matrix as a sum of p binary outerproduct W=μ=1pxμ(xμ)TW = \sum_{\mu =1}^p x^{\mu}(x^{\mu})^T which in turn defines an energy of the underlying network

ETheta(x)=12i=1Nj=1Nwijxixj+i=1Nθixi=12x,Wx+θ,x(8)E_{Theta}(x) =-{\frac{1}{2}}\sum_{i=1}^{N}\sum_{j=1}^{N}w_{i j}x_{i}x_{j}+\sum_{i=1}^{N}\theta_{i}x_{i} = \frac{1}{2} \langle x,W x \rangle+\langle \theta, x \rangle \qquad (8)

We are now ready to build the Hopfield energy based model by inserting (8) into expression (5) which yields a family of distribution over all binary patterns of length NN for each pair of parameters Θ=(W,θ)\Theta = (W,\theta). The model parameters are sequentially updated according to

xi=sign(jwijxj+θi),sign(x)={1x0+1x>0x_i =\mathrm{sign}\left(\sum_{j}w_{ij} x_{j}+\theta_{i}\right) ,\qquad \mathrm{sign}(x)=\begin{cases} -1 \qquad &x\leq 0 \\ +1 \qquad &x\gt 0 \end{cases}

Boltzmann Machines

One drawback of Hopffield networks is the lack of modeling higher order dependenies between the data due to energy (8) which only contains pairwise intractions. Incorporating higher order terms will results in tensors WW, which itself requires efficient contraction routines as explained in the post on tensor neworks. A different apporach to model higher order dependencies beyond pairwise interaction is the idea on purification from quantum statistical physics. The idea is view the unknown distribution p(x)p(x) as a marginal of higher dimensional distribution p(x,h)p(x,h) with p(x)=hp(x,h)p(x) = \sum_{h}p(x,h). The Boltzmann machine takes the form (5) with

E(x,h)=12i,jxilijxj12ijhijijhjijxiwijhjiθixijηjhj=12x,Lx12h,Jhx,Whθ,xη,h(9)E(x,h)=-\frac{1}{2}\sum_{i,j}x_{i}l_{i j}x_{j}-\frac{1}{2}\sum_{i j}h_{i}j_{i j}h_{j}-\sum_{i j}x_{i}w_{i j}h_{j}-\sum_{i}\theta_{i}x_{i}-\sum_{j}\eta_{j}h_{j} \\ =- \frac{1}{2} \langle x,Lx \rangle- \frac{1}{2} \langle h,Jh \rangle-\langle x,W h \rangle-\langle \theta,x \rangle -\langle \eta,h \rangle \qquad (9)

Restricted Boltzmann Machines (RBMs)

The Restricited Boltzmann_Machine is obtained from (9) by dropping dependencies encoded by LL and JJ matrices, i.e. removing all interrelations between visisble nodes xi,xjx_i,x_j and hidden nodes hi,hjh_i,h_j respectivaly.

E(x,h)=ijxiwijhjiθixijηjhj=x,Whθ,xη,h(10)E(x,h)=-\sum_{i j}x_{i}w_{i j}h_{j}-\sum_{i}\theta_{i}x_{i}-\sum_{j}\eta_{j}h_{j} =-\langle x,W h \rangle-\langle \theta,x \rangle -\langle \eta,h \rangle \qquad (10)
RBM with 6 visible and 3 hidden nodes

Assuming xi{0,1}x_i \in \{0,1\}, the key advantage of an RBM over more generall Boltzmann machines are the closed form expression for the conditional probabilities

p(xi=1h)=p(xi,h)p(h)=x{0,1}n,xi=1(exp(x,Whθ,xη,h))(x{0,1}nexp(x,Whθ,xη,h))(11)=11+x{0,1}n,xi=0exp(x,Whθ,xη,h)x{0,1}n,xi=1exp(x,Whθ,xη,h)=11+exp(Wi,hθi) p(x_i = 1|h) = \frac{p(x_i,h)}{p(h)} = \frac{\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 1}\left(\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\right)}{\left(\sum \limits_{x \in \{ 0,1 \}^n}\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\right) } \qquad (11) \\ = \frac{1}{1+\frac{\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 0}\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)}{\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 1}\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)}} = \frac{1}{1+\exp(-\langle W_i,h \rangle-\theta_i)}

where the last step follows from

x{0,1}n,xi=0exp(x,Whθ,xη,h)exp(Wi,hθi)\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 0}\\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\exp(-\langle W_i,h \rangle-\theta_i) which is equals to x{0,1}n,xi=1(exp(x,Whθ,xη,h))\sum \limits_{x \in \{ 0,1 \}^{n}, x_i = 1}\left(\exp(- \langle x, W h \rangle -\langle \theta,x\rangle-\langle \eta,h \rangle)\right) Analagous calculation for p(hj=1x)p(h_j = 1|x) yields the closed form expression

p(hj=1x)==11+e(WT,xηj),p(xi=1h)=11+eWi,hθi(12)p(h_j = 1|x) = = \frac{1}{1+e^{-(\langle W^T,x \rangle-\eta_j)}}, \qquad p(x_i = 1|h) = \frac{1}{1+e^{-\langle W_i,h \rangle-\theta_i}} \qquad (12)