Astro Starter Blog
Published at

Convex Hypertree Based Reparametrization and Convex Upperbounds

Todo

Authors
  • avatar
    Name
    Dr. Dmitrij Sitenko
  • Mathematical Image Processing at Ruprecht Karls University Heidelberg
Table of Contents

To motivate why reparametrizations of probability distributions are important we first consider a simple scenario of exponenential family pθ,θΘRpp_{\theta}, \theta \in \Theta \in \R^{p} on a tree structured graph GT=(VT,ET)G^T = (V^T,E^T). We already know one simple clique factorization (edge and nodes on VTV^T)

p(x)=1Z  CGψC(xC)=1ZvVTψv(xv)vwψvw(xv,xw)(1)p(x)={\frac{1}{Z}}\;\prod_{C\in G}\psi_{C}(x_{C}) = \frac{1}{Z}\prod_{v \in V^T}\psi_{v}(x_v)\prod_{vw}\psi_{vw}(x_v,x_w) \qquad (1)

As we already know from the concepts of graphical models a major chalange of inference tasks is the access to partition function ZZ. Assumming for a moment that we have access to the marginals p(xv,xw)p(x_v,x_w), then we can factorize (1)(1) directly in terms of the margnials via

p(x)=vVTp(xv)(v,w)ETp(xv,xw)p(xv)p(xw)(2)p(x)=\prod_{v\in V^T}\,p(x_{v})\prod_{(v\,,w)\in E^T}\,\frac{p(x_{v},{\mathcal x}_{w})}{p(x_{v})p(x_{w})} \qquad (2)

where now due to the proper normalization of the marginals the partition function vanishes, i.e. Z=1Z = 1. Now given as exponential family with initial regular parametrization θast\theta^{\\ast} leveraging factorization (2) implies

pθ(x)=exp(vVTθvϕv(xv)+vwθvwϕvw(xv,xw))(3)p_{\theta^{\ast}}(x)=\exp\left(\sum_{v \in V^T} \theta^{\ast}_{v}\phi_v(x_v)+\sum_{vw}\theta^{\ast}_{vw}\phi_{vw}(x_v,x_w)\right) \qquad (3)

where the correspodning log partition function becomes A(θ)=0A(\theta^{\ast}) = 0. In this way we can observe that instead computing normalization contansts we alternatively can seek for a reparametrization parameter θ\theta^{\ast} which factorizes according to (2)(2). If the graph is a tree, such reparametrization is easely obtained by marginal inference via message passing updates in linear time. However, the major challenge is to define proper reparametrizations updates on graphs with cycles. This is due to the following reasons

  • (i)(i) The geometry of the underlying parameter space Θ\Theta directly depends on the complex shape of the underlying marginal polytop, i.e. in the presence of cycles this is a highly contrained set.
  • (ii)(ii) Each reparametrization must index the same joint probability distribution, i.e. wee need access the global properties of the underlying distribution.

In turns out, that for a particular class of exponential families, the gloabl shape of distribution in (ii)(ii) is reconstructed from a family of much simpler structures (chains, trees, hypertrees, etc.) which also yields approximations and error bounds to the marginal polytope in (i)(i)

Inference Problems on Tree-Graphs and Bethe Approximation

We first consider a generall parametrization of exponential family (3)

pθ(x)exp{sVθs(xs)+(s,t)Eθst(xs,xt)},p\theta(x)\propto\exp\big\{\sum_{s\in V}\theta_{s}(x_{s})+\sum_{(s,t)\in E}\theta_{s t}(x_{s},x_{t})\big\},

which arises from (3) for the particular choice of sufficient statistics given by overcomplete representation

θs(xs):=iθs;j1s;j(xs),θst(xs,xt):=(j,k)θst;jk1st;jk(xs,xt)\theta_{s}(x_{s}):=\sum_{i}\theta_{s;j}\mathbb{1}_{s;j}(x_{s}), \qquad \theta_{s t}(x_{s},x_{t}):=\sum_{(j,k)}\theta_{s t;j k}\mathbb{1}_{s t;j k}(x_{s},x_{t})

which is also commonly denoted as nonidentifable due to the existance of affine subpace AθA^{\theta} such that θAθ\theta \in A^{\theta} indexis the same distribution.

and introduce the set of all realizable marginals

M(G):={μRd  p  with marginals  μs(xs),μst(xs,xt)}(Marginal Polytope).\mathbb{M}(G):=\{\mu\in\mathbb{R}^{d}\mid\exists\;p\;{\mathrm{with~marginals}}\;\mu_{s}(x_{s}),\,\mu_{s t}(x_{s},x_{t})\} \qquad (\text{Marginal Polytope}).

The above set is denoted as the marginal polytop and can equivalently be written as a convex hull over vector corresponding to each configuration xx or equivalently as an intersection over an exponential number of half spaces for each configuration.
If the graph is a tree (no cycles) it can be shown that the marginal polytope admits a much simpler characterization in terms of locally consistent marginal distributions 0ξL(G)0 \leq \xi \in \mathbb{L}(G) which satisfy the pairwise marginalization constraints

xsξs(xs)=1,xtξst(xs,xt)=ξs(xs),xsXs\sum_{x_{s}}\xi_{s}(x_{s})=1, \qquad \sum_{x_{t}^{\prime}}\xi_{s t}(x_{s},x_{t}^{\prime}) = \xi_{s}(x_{s}),\quad\forall\,x_{s}\in\mathcal{X}_{s}

xtξst(xs,xt)=ξs(xs),xsXs,xsξst(xs,xt)=ξt(xt),xtXt\sum_{x_{t}^{\prime}}\xi_{s t}(x_{s},x_{t}^{\prime})\,=\,\xi_{s}(x_{s}),\quad\forall\,x_{s}\in\mathcal{X}_{s}, \qquad \sum_{x_{s}^{\prime}}\xi_{s t}(x_{s}^{\prime},x_{t})\,=\,\xi_{t}(x_{t}),\quad\forall\,x_{t}\in X_{t}

An important observation is that if the graph is a tree than the two sets agree M(G)=L(G)\mathbb{M}(G) = \mathbb{L}(G) and in the presence of cycles M(G)L(G)\mathbb{M}(G) \subset \mathbb{L}(G), i.e. it always serves as an upperbound to the complex marginal polytope.

In order to see the difference between the two sets let us consider a classical example of a one cycle graph G=({1,2,3},{(12),(23),(31)})G = (\{1,2,3\},\{(12),(23),(31)\}). For each parameter ξst\xi_{st} we consider variational marginals

ξs(xs):= [0.5   0.5]ξst(xs,xt):=[τst0.5ξst0.5+βstτst],\xi_{s}(x_{s}):=\ [0.5\ \ \ 0.5] \qquad \xi_{s t}(x_{s},x_{t}):=\left[\begin{array}{c c}{{\tau_{s t}}}&{{0.5-\xi_{s t}}}\\ {{0.5+\beta_{s t}}}&{{\tau_{s t}}}\end{array}\right],

which satisfy marginalization conditions and therefore ξL(G)\xi \in \mathbb{L}(G). In order to seek for the corresponding joint distrubution p(x1,x2,x3)p(x_1,x_2,x_3) on graph GG with such marginals we check all the marginalization constraints while keeping the node marginals ξ1=ξ2=ξ3=12\xi_1 = \xi_2= \xi_3 = \frac{1}{2} fixed.

ξs(xs)=xt,xuξstu(xs,xt,xu)ξst(xs,xt)=xuξstu(xs,xt,xu)\xi_{s}(x_{s})\,=\,\sum_{x_{t}^{\prime},x_{u}^{\prime}}\xi_{s t u}(x_{s},x_{t}^{\prime},x_{u}^{\prime})\qquad \xi_{s t}(x_{s},x_{t})\,=\,\sum_{x_{u}^{\prime}}\xi_{s t u}(x_{s},x_{t},x_{u}^{\prime})

which after small algebraic calculation can be reduced to the following set of inequalities

ξstu0ξstuξs+ξst+ξt+ξsuξstu1ξs+τt+ξst+ξst+ξsu+ξtuξstuξst,ξsu,ξtu.\begin{array}{l}{{\displaystyle \xi_{s t u}\ge0}}\\ {{\displaystyle \xi_{s t u}\ge-\xi_{s}+\xi_{s t}+\xi_{t}+\xi_{s u}}}\\ {{\displaystyle \xi_{s t u}\le1-\xi_{s}+\tau_{t}+\xi_{s t}+\xi_{s t}+\xi_{s u}+\xi_{t u}}}\\ {{\displaystyle\xi_{s t u}\le\xi_{s t},\xi_{s u},\xi_{t u}.}}\end{array}

Returning to our example the above set of inequalities boilds down to

ξ12+ξ23ξ1312,ξ12ξ23+ξ1312,ξ12+ξ23+ξ1312,and ξ12+ξ23+ξ1312.\begin{array}{l l}{{\xi_{12}+\xi_{23}-\xi_{13}\leq\frac{1}{2},}}&{{\xi_{12}-\xi_{23}+\xi_{13}\leq\frac{1}{2},}}\\ {{-\xi_{12}+\xi_{23}+\xi_{13}\leq\frac{1}{2},}}&{{\mathrm{and~}\quad \xi_{12}+\xi_{23}+\xi_{13}\leq\frac{1}{2}.}}\end{array}

which illustrates how the marginal polytope is contained within the 3D cube [0,12]3[0,\frac{1}{2}]^3. Now after plugging in the expression for the local polytope it is easely seen that

τ12+τ23τ13=0.4+0.40.1>12.\tau_{12}+\tau_{23}-\tau_{13}\,=\,0.4+0.4-0.1\,\gt \,{\frac{1}{2}}.

which shows that for τst=0.4\tau_{st} = 0.4 there exists no probability distribution on G\mathcal{G} with the marginal given by (add number ).

Bethe Entropy Approximation

Not only the marginal polytope on tree-structured graphs posesses a particular simple structure but also the Bethe entropy admits a closed form expression

H(pμ)=sVHs(μs)(s,t)ETIst(μst). H(p_{\mu}) = \sum_{s\in V}H_{s}(\mu_{s})-\sum_{(s,t)\in E^T}I_{s t}(\mu_{s t}).

with

Hs(ξs):=xsXsξs(xs)logξs(xs),Ist(ξst):=(xs,xt)Xt×Xsξst(xs,xt)logξst(xs,xt)ξs(xs)ξt(xt)H_{s}(\xi_{s}):=-\sum_{x_{s}\in{\mathcal{X}}_{s}}\xi_{s}(x_{s})\log\xi_{s}(x_{s}), \qquad I_{s t}(\xi_{s t}):=\sum_{(x_{s},x_{t})\in X_t\times X_s}\xi_{s t}(x_{s},x_{t})\log\frac{\xi_{s t}(x_{s},x_{t})}{\xi_{s}(x_{s})\xi t(x_{t})} Replacing the edge set ETE^T with arbitrary edge sets correspodning to graphs with cycles yields the Bethe entropy approximation

The corresponding approximation to he log partition function is then given (add link) by the Bethe variational problem

maxξL(G) ⁣{θ,ξ>+sVHs(ξs)(s,t)EIst(ξst)}.\begin{array}{c}{{\mathrm{max}}}_{\xi \in L(G)}\end{array}\!\biggr\{\bigl\langle\theta,\xi\bigr\gt +\sum_{s\in V}H_{s}(\xi_{s})-\sum_{(s,t)\in E}I_{s t}(\xi_{s t})\Bigr\}.

The Auto-Encoder-Decoder The Auto-Encoder-Decoder

Bethe Loop Series Expansion Formula

V(E~):={tV(t,u)E~for  some  u}.V(\widetilde{E}):=\left\{t\in V\mid(t,u)\in\widetilde{E}\,\mathrm{for\;some\;}u\right\}.

degree of each vertex

ds(E~):={tV(s,t)E~}.d_{s}(\widetilde E):=\,\{t\in V\,|\,\left(s,t\right)\in\widetilde E\}.

A generalized loop on GG is any subgraph G(E~)GG(\tilde{E}) \subseteq G such for all sG(E~)s \in G(\tilde{E}) it holds ds(E~)1d_s(\tilde{E})\neq 1

The Auto-Encoder-Decoder

Given some stationary points of pseudomarginals after runnign message passing algorithm which minize the Bethe variational problem we obtain the following error expression w

A(θ)=ABethe(θ)+log{1+EE~βE~sVEτs[(Xsτs)ds(E~)]}.A(\theta)=A_{\mathrm{Bethe}}(\theta)+\log \Big\{ 1+\sum_{E \subseteq \tilde{E}}\beta_{\tilde{E}} \prod_{s\in V} \mathbb{E}_{\tau_{s}}[(X_{s}-\tau_{s})^{d_{s}(\tilde{E})}]\Big\}.

where

βst:=τstτsτtτs(1τs)τt(1τt),βE~:=(s,t)E~βst.\beta_{s t}:=\,\frac{\tau_{s t}-\tau_{s}\tau_{t}}{\tau_{s}({\bf1}-\tau_{s})\tau_{t}({\bf1}-\tau_{t})}, \qquad \beta_{\tilde{E}}:=\prod_{(s,t)\in\tilde{E}}\beta_{s t}.

denotes the edge weights associated to set of pseudomarginals ξ\xi. In view of the above error bound we can make the following observations

  • The error in the Bethe approximation is influenced by generalized loops in the graph, particularly those subgraphs G(E~)G(\tilde{E}) where every vertex has a degree greater than one. The contribution of each loop to the error is weighted by βE~\beta_{\tilde{E}} which depends on the pseudomarginal estimates.
  • For tree-structured graphs (where loops are absent), the error term vanishes, and the Bethe approximation becomes exact.

This result provides a systematic understanding of the accuracy of the Bethe entropy approximation and its dependence on the underlying graph structure.

Reparametrization Updates

Upper Bound on the Partition Function

Error Bounds