Download presentation
Presentation is loading. Please wait.
Published byJonathan Henderson Modified over 9 years ago
1
Genome Evolution. Amos Tanay 2010 Genome evolution: Lecture 9: Variational inference and Belief propagation
2
Genome Evolution. Amos Tanay 2010 Expectation-Maximization
3
Genome Evolution. Amos Tanay 2010 Log-likelihood to Free Energy We have so far worked on computing the likelihood: Better: when q a distribution, the free energy bounds the likelihood: Computing likelihood is hard. We can reformulate the problem by adding parameters and transforming it into an optimization problem. Given a trial function q, define the free energy of the model as: The free energy is exactly the likelihood when q is the posterior: D(q || p(h|s)) Likelihood
4
Genome Evolution. Amos Tanay 2010 Energy?? What energy? In statistical mechanics, a system at temperature T with states x and an energy function E(x) is characterized by Boltzman’s law: If we think of P(h|s, ): Given a model p(h,s|T) (a BN), we can define the energy using Boltzman’s law Z is the partition function:
5
Genome Evolution. Amos Tanay 2010 Free Energy and Variational Free Energy The Helmoholtz free energy is defined in physics as: The average energy is: The variational transformation introduce trial functions q(h), and set the variational free energy (or Gibbs free energy) to: This free energy is important in statistical mechanics, but it is difficult to compute, as our probabilistic Z (= p(s)) The variational entropy is: And as before:
6
Genome Evolution. Amos Tanay 2010 Solving the variational optimization problem So instead of computing p(s), we can search for q that optimizes the free energy This is still hard as before, but we can simplify the problem by restricting q (this is where the additional degrees of freedom become important) Maxmizing U?Maxmizing H? Focus on max configurationsSpread out the distribution
7
Genome Evolution. Amos Tanay 2010 Simplest variational approximation: Mean Field Let’s assume complete independence among r.v.’s posteriors: Under this assumption we can try optimizing the q i – (looking for minimal energy!) Maxmizing U?Maxmizing H? Focus on max configurationsSpread out the distribution
8
Genome Evolution. Amos Tanay 2010 Mean Field Inference We optimize iteratively: Select i (sequentially, or using any method) Optimize q i to minimize F MF (q 1,..,q i,…,q n ) while fixing all other qs Terminate when F MF cannot be improved further Remember: F MF always bound the likelihood q i optimization can usually be done efficiently
9
Genome Evolution. Amos Tanay 2010 Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its q i while making sure it is a distribution: To ease notation, assume the left (l) and right (r) children are hidden The energy decomposes, and only few terms are affected:
10
Genome Evolution. Amos Tanay 2010 Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its q i while making sure it is a distribution:
11
Genome Evolution. Amos Tanay 2010 Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hjihji h j-1 i h j-1 pai h j pai hjlhjl h j+1 l hjrhjr h j+1 r h j+1 i h j pai h j-1 r
12
Genome Evolution. Amos Tanay 2010 Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hjihji h j-1 i h j-1 pai h j pai hjlhjl h j+1 l hjrhjr h j+1 r h j+1 i h j pai h j-1 r As before, the optimal solution is derived by making logq i equals the sum of affected terms:
13
Genome Evolution. Amos Tanay 2010 Because the MF trial function is very crude Simple Mean Field is usually not a good idea Why? For example, we said before that the joint posteriors cannot be approximated by independent product of the hidden variables posteriors ACA C A/C
14
Genome Evolution. Amos Tanay 2010 Exploiting additional structure The approximation specify independent distributions for each loci, but maintain the tree dependencies. We can greatly improve accuracy by generalizing the mean field algorithm using larger building blocks We now optimize each tree q separately, given the current other tree potentials. The key point is that optimizing for any given tree is efficient: we just use a modified up-down algorithm
15
Genome Evolution. Amos Tanay 2010 Tree based variational inference Each tree is only affected by the tree before and the tree after:
16
Genome Evolution. Amos Tanay 2010 Tree based variational inference We got the same functional form as we had for the simple tree, so we can use the up-down algorithm to optimize q j.
17
Genome Evolution. Amos Tanay 2010 Chain cluster variational inference We can use any partition of a BN to trees and derive a similar MF algorithm For example, instead of trees we can use the Markov chains in each species What will work better for us? Depends on the strength of dependencies at each dimension – we should try to capture as much “dependency” as possible
18
Genome Evolution. Amos Tanay 2010 Simple Tree: Inference as message passing s s ss s s s You are P(H|our data) I am P(H|all data) DATA
19
Genome Evolution. Amos Tanay 2010 Factor graphs Defining the joint probability for a set of random variables given: 1)Any set of node subsets (hypergraph) 2)Functions on the node subsets (Potentials) Joint distribution: Partition function: If the potentials are condition probabilities, what will be Z? Things are difficult when there are several modes Factor R.V. Not necessarily 1! (can you think of an example?)
20
Genome Evolution. Amos Tanay 2010 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 DBN PhyloHMM h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 Converting directional models to factor graphs (Loops!) Well defined Z=1 Z!=1
21
Genome Evolution. Amos Tanay 2010 More definitions The model: Potentials can be defined on discrete, real valued etc. it is also common to define general log-linear models directly: Inference: Learning: Find the factors parameterization:
22
Genome Evolution. Amos Tanay 2010 Inference in factor graphs: Algorithms Directed models are sometimes more natural and easy to understand. Their popularity stems from their original role as expressing knowledge in AI They are not very natural for modeling physical phenomena, except for time-dependent processes Undirected models are analogous to well-developed models in statistical physics (e.g., spin glass models) We borrow computational ideas from physicists (these people are big with approximations) The models are convex which give them important algorithmic properties Dynamic programming: Forward sampling (likelihood weighting): Metropolis/Gibbs: Mean field: Structural variational inference: No (also not in BN!) No Yes
23
Genome Evolution. Amos Tanay 2010 Belief propagation in a factor graph Remember, a factor graph is defined given a set of random variables (use indices i,j,k.) and a set of factors on groups of variables (use indices a,b..) Think of messages as transmitting beliefs: a->i : given my other inputs variables, and ignoring your message, you are x i->a : given my other inputs factors and my potential, and ignoring your message, you are x x a refers to an assignment of values to the inputs of the factor a Z is the partition function (which is hard to compute) The BP algorithm is constructed by computing and updating messages: Messages from factors to variables: Messages from variables to factors: (any value attainable by x i )->real values
24
Genome Evolution. Amos Tanay 2010 Messages update rules: Messages from variables to factors: Messages from factors to variables: a i a i
25
Genome Evolution. Amos Tanay 2010 The algorithm proceeds by updating messages: Define the beliefs as approximating single variables posterios (p(h i |s)): Algorithm: Initialize all messages to uniform Iterate until no message change: Update factors to variables messages Update variables to factors messages Why this is different than the mean field algorithm?
26
Genome Evolution. Amos Tanay 2010 Beliefs on factor inputs This is far from mean field, since for example: The update rules can be viewed as derived from constraints on the beliefs: 1.requirement on the variables beliefs (b i ) 2.requirement on the factor beliefs (b a ) 3.Marginalization requirement: a i a i
27
Genome Evolution. Amos Tanay 2010 BP on Tree = Up-Down s4s4 s3s3 h2h2 h3h3 e s2s2 s1s1 h1h1 ba c d 21 3
28
Genome Evolution. Amos Tanay 2010 Loopy BP is not guaranteed to converge XY 11 00 This is not a hypothetical scenario – it frequently happens when there is too much symmetry For example, most mutational effects are double stranded and so symmetric which can result in loops.
29
Genome Evolution. Amos Tanay 2010 The Bethe Free Energy H. Bethe LBP was introduced in several domains (BNs, Coding), and is consider very practical in many cases...but unlike the variational approaches we studied before, it is not clear how it approximate the likelihood/partition function, even when it converges.. Compare to the variational free energy: Theorem: beliefs are LBP fixed points if and only if they are locally optimal for the Bethe free energy In the early 2000, Yedidia, Freeman and Weiss discovered a connection between the LBP algorithm and the Bethe free energy developed by Hans Bethe to approximate the free energy in crystal field theory back in the 40’s/50’s.
30
Genome Evolution. Amos Tanay 2010 Generalization: Regions-based free energy Start with a factor graph (X,A) Introduce regions (X R,A R ) and multipliers c R We require that: We will work with valid regions graphs: Region-based average energy Region average energy Region Entropy Region Free energy Region-based entropy Region-based free energy
31
Genome Evolution. Amos Tanay 2010 Bethe regions are the factors neighbors sets and single variables regions: a c b We compensate for the multiple counting of variables using the multiplicity constant We can add larger regions As long as we update the multipliers: RaRa R ac R bc
32
Genome Evolution. Amos Tanay 2010 Multipliers compensate on average, not on entropy Claim: For valid regions, if the regions’ beliefs are exact: We cannot guarantee much on the region-based entropy: Claim: the region-based entropy is exact when the model is a uniform distribution Proof: exercise. This means that the entropy count the correct number of degrees of freedom – e.g. for binary variables, H=Nlog2 Definition: a region based free energy approximation is said to be max-ent normal if its region-based entropy is maximized when the beliefs are uniform. An non max-ent approximation can minimize the region free energy by selecting erroneously high entropy beliefs! then the average region-based energy is exact:
33
Genome Evolution. Amos Tanay 2010 Bethe’s region are max-ent normal Claim: The Bethe regions gives a max-ent normal approximation (i.e. it maximize the region-based entropy on the uniform distribution) EntropyInformation (maximal on uniform)(nonnegative, and 0 on uniform)
34
Genome Evolution. Amos Tanay 2010 Start with a complete graph and binary factors Add all variable triplets, pairs and singleton as regions Generate multipliers: triplets = 1 (20 overall) pairs = -3 (15 overall) singletons = 6 (6 overall)( guarantee consistency) Example: A Non max-ent approximation Look at the consistent beliefs: The Region entropy (for any region) = ln2. The total region entropy is: We claimed before the entropy of the uniform distribution will be exact: 6ln2
35
Genome Evolution. Amos Tanay 2010 We want to solve a variational problem: While enforcing constraints on the regions’ beliefs: Inference as minimization of region-based free energy Unlike the structured variational approximation we discussed before, and although the beliefs are (regionally) compatible, we can have cases with optimal beliefs that are not representing a true global posterior distribution C B A Optimal region beliefs are identical to the factors: It can be shown that this cannot be the result of any joint distribution on the three variables (note the negative feedback loop here)
36
Genome Evolution. Amos Tanay 2010 Claim: When it converges, LBP finds a minimum of the Bethe free energy. Proof idea: we have an optimization problem (minimum energy) with constraints (beliefs are consistent and adds up to 1). We write down a Lagrangian that expresses both minimization goal and constraints, and show that it is minimized when the LBP update rules are holding. Inference as minimization of region-based free energy Important technical point: we shall assume that in the fixed point all beliefs are non zero. This can be shown to hold if all factors are “soft” (do not contain zero values for any assignment).
37
Genome Evolution. Amos Tanay 2010 The Bethe Lagrangian Large region beliefs are normalized Variable region beliefs are normalized Marginalization
38
Genome Evolution. Amos Tanay 2010 The Bethe lagrangian Take the derivatives with respect to each b a and b i :
39
Genome Evolution. Amos Tanay 2010 Bethe minima are LBP fixed points So here are the conditions: And we can solve them if: Giving us: We saw before these conditions, with the marginalization constraint, are generating the update rules! So L minimum -> LBP fixed point is proven. The other direction quite direct – see Exercise LBP is in fact computing the Lagrange multipliers – a very powerful observation
40
Genome Evolution. Amos Tanay 2010 Generalizing LBP for region graphs Parent-to-child beliefs: A region graph is graph on subsets of nodes in the factor graph, with valid multipliers (as defined above) R D(R) – Decedents of R P(R) regions (X R,A R ) and multipliers c R We require that: We will work with valid regions graphs: P(D(R))\D(R) P(R) – Parents of R D(R)
41
Genome Evolution. Amos Tanay 2010 Generalizing LBP for region graphs Parent-to-child algorithm: I J D(P)+P Not D(P)+P D(R) – Decedents of R P(R) – Parents of R P R D(R)+R I J D(P)+P P R D(R)+R N(I,J) = I not in D(P)+P J in D(P)+P but not D(R)+R D(I,J) = I in D(P)+P but not D(R)+R J in D(R)+R
42
Genome Evolution. Amos Tanay 2010 GLBP in practice LBP is very attractive for users: really simple to implement, very fast LBP performance is limited by the size of region assignments X a which can grow rapidly with the factor’s degrees or the size of large regions GLBP will be powerful when large regions can capture significant dependencies that are not captured by individual factors – think small positive loop or other symmetric effects LBP messages can be computed synchronously (factors->variables->factors…), other scheduling options may boost up performance considerably LBP is just one (quite indirect) way by which Bethe energies can be minimized. Other approaches are possible – which can be guaranteed to converge The Bethe/Region energy minimization can be further constraint to force beliefs are realizable. This gives rise to the concept of Wainwright-Jordan marginal polytope and convex algorithms on it.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.