- Published at
Energy Based Models
This block covers ideas on energy based models starting physics via Gibbs distribution abd Restricted Boltzmann Machine.
- Authors
- Name
- Dr. Dmitrij Sitenko
- Mathematical Image Processing at Ruprecht Karls University Heidelberg
In this post we give an intuitive understanding of key concepts behind a broad class of machine learning models known as energy based models which are widely applied to various task including:
- Data generation and Data reconstruction
- Feature extraction
- Data compression
Motivation
Throughout the post we will explain how energy based models tackle the above problems but remembering the quote
Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.
Marie Curie
let us first start from the origins and try to find out what this models actually are and from where did they emerged from.
From our previos post on exponential families and graphical models
we already know how exponential families emerges as solutions to maximum entropy problems constrained to mean like contraints.
From the perspective of statistical physics, the mean parameters are interpreted as averaged energies which are fixed constant in our universe that we can measure and have access to.
In other words, given only the energy function without any assumptions on the underlying generating distribution ,
we are in the state of maximal uncertainty and in order to determine we need to solve maximum entropy problem:
The corresponding dual representation of (1) is given in terms of Lagrange multipliers associated to each contraint
Taking the functional derivative with respect to to zero yields optimal condition for
where the Lagrange multiplier now corresponds to to the log-partition function via and therefore can be expressed in closed from
Plugging in this expression into (3) yields the closed form expression for
For the particular case of the discrete state space the lefthand side of (4) boilds down to
In statistical physics the maximum entropy solution is denoted as the Boltzmann-Gibbs distribution which in the language of machine learning nowdays is referenced as energy based model due to the energy term which one wants to minimize in practice. However, because this names are equivalent they yields to equivalent chalanges. This means evaluating the probability for a particular observed data point is hard as we need access to the partion function .
Inserting representation (5) into (1) we get an equivalent formulation of the entropy
where and are called the Gibbs free energy and averaged energy respectively. In particular, decomposition was the beginning of the still ongoing reseach on approximations ansatzes to evaluate the uncractable partition function including:
- Mean field methods
- Kikuchy clustering and Bethe approximations
- Region based entropy approximation
- Deep learning driven approaches.
We continue with the last point by focussing on the discrete case and aim to find a variational model for that is parametrized by a neural network through the energy term . This idea dates back to year 1980 with the formulation of Boltzmann machines and Hopffield model
Hopfield Model
The Hopfield model is the grandfather of the neural networks which idea originates from modeling associative memory motivated by the rule that synapsis connections that activate simultaniously are stronger alone activated neuron known as the Hebbian rule
neurons that fire together wire together
Mathematically, for the special case of neurons which takes only fire and not fire binary values the strengthens of such synapsis connections is represented
where represents the -th bit of some binary neurons firing pattern . The weights defines a symmetric matrix as a sum of p binary outerproduct which in turn defines an energy of the underlying network
We are now ready to build the Hopfield energy based model by inserting (8) into expression (5) which yields a family of distribution over all binary patterns of length for each pair of parameters . The model parameters are sequentially updated according to
Boltzmann Machines
One drawback of Hopffield networks is the lack of modeling higher order dependenies between the data due to energy (8) which only contains pairwise intractions. Incorporating higher order terms will results in tensors , which itself requires efficient contraction routines as explained in the post on tensor neworks. A different apporach to model higher order dependencies beyond pairwise interaction is the idea on purification from quantum statistical physics. The idea is view the unknown distribution as a marginal of higher dimensional distribution with . The Boltzmann machine takes the form (5) with
Restricted Boltzmann Machines (RBMs)
The Restricited Boltzmann_Machine is obtained from (9) by dropping dependencies encoded by and matrices, i.e. removing all interrelations between visisble nodes and hidden nodes respectivaly.
Assuming , the key advantage of an RBM over more generall Boltzmann machines are the closed form expression for the conditional probabilities
where the last step follows from
which is equals to Analagous calculation for yields the closed form expression