Genome evolution: a sequence-centric approach Lecture 5: Undirected models and variational inference
Course outline Probabilistic models Inference Parameter estimation Genome structure Mutations Population Inferring Selection (Probability, Calculus/Matrix theory, some graph theory, some statistics) Simple Tree Models HMMs and variants PhyloHMM,DBN Context-aware MM DP Sampling EM
Log-likelihood to Free Energy We have so far worked on computing the likelihood: Better: when q a distribution, the free energy bounds the likelihood: Computing likelihood is hard. We can reformulate the problem by adding parameters and transforming it into an optimization problem. Given a trial function q, define the free energy of the model as: The free energy is exactly the likelihood when q is the posterior: D(q || p(h|s)) Likelihood
Energy?? What energy? In statistical mechanics, a system at temprature T with states x and an energy function E(x) is characterized by Boltzman’s law: If we think of P(h|s, ): Given a directed model p(h,s|T) (a BN), we can define the energy using Boltzman’s law Z is the partition function:
Free Energy and Variational Free Energy The Helmoholtz free energy is defined in physics as: The average energy is: The variational transformation introduce trial functions q(h), and set the variational free energy (or Gibbs free energy) to: This free energy is important in statistical mechanics, but it is difficult to compute, as our probabilistic Z (= p(s)) The variational entropy is: And as before:
Solving the variational optimization problem So instead of computing p(s), we can search for q that optimizes the free energy This is still hard as before, but we can simplify the problem by restricting q (this is where the additional degrees of freedom become important) Maxmizing U?Maxmizing H? Focus on max configurationsSpread out the distribution
Simplest variational approximation: Mean Field Let’s assume complete independence among r.v.’s posteriors: Under this assumption we can try optimizing the q i – (looking for minimal energy!) Maxmizing U?Maxmizing H? Focus on max configurationsSpread out the distribution
Mean Field Inference We optimize iteratively: Select i (sequentially, or using any method) Optimize q i to minimize F MF (q 1,..,q i,…,q n ) while fixing all other qs Terminate when F MF cannot be improved further Remember: F MF always bound the likelihood q i optimization can usually be done efficiently
Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its q i while making sure it is a distribution: To ease notation, assume the left (l) and right (r) children are hidden The energy decomposes, and only few terms are affected:
Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its q i while making sure it is a distribution:
Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hjihji h j-1 i h j-1 pai h j pai hjlhjl h j+1 l hjrhjr h j+1 r h j+1 i h j pai h j-1 r
Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hjihji h j-1 i h j-1 pai h j pai hjlhjl h j+1 l hjrhjr h j+1 r h j+1 i h j pai h j-1 r As before, the optimal solution is derived by making logq i equals the sum of affected terms:
Because the MF trial function is very crude Simple Mean Field is usually not a good idea Why? For example, we said before that the joint posteriors cannot be approximated by independent product of the hidden variables posteriors ACA C A/C
Exploiting additional structure The approximation specify independent distributions for each loci, but maintain the tree dependencies. We can greatly improve accuracy by generalizing the mean field algorithm using larger building blocks We now optimize each tree q separately, given the current other tree potentials. The key point is that optimizing for any given tree is efficient: we just use a modified up-down algorithm
Tree based variational inference Each tree is only affected by the tree before and the tree after:
Tree based variational inference We got the same functional form as we had for the simple tree, so we can use the up-down algorithm to optimize q j.
Chain cluster variational inference We can use any partition of a BN to trees and derive a similar MF algorithm For example, instead of trees we can use the Markov chains in each species What will work better for us? Depends on the strength of dependencies at each dimension – we should try to capture as much “dependency” as possible
Directionality acyclicity is crucial for BNs This is why: It also allows us to estimate parameters using EM: Given a set of observations s 1,s 2,.. Start with any set of CPDs While improving Computer posteriors (somehow): Update all CPDs: The maximization part is simple because each factor in the joint probability can be optimized separately
But we are minimizing energy, not computing likelihood – would the generalized EM still work? if we make sure that the M step optimizes the free energy we will be ok We will usually obtain the same functional form for the optimization problem, where each conditional probability can be optimized independently It is crucial that we define a probability distribution without normalizing Without this, we will lose independence among the product factors..
Directionality fits temporal behavior, but in a complex way h pai j h pai j+1 h pai j hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 Directed acylic approximations are limited DBNPhyloHMM h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 This may be more accurate, (but we’ve added directed cycles):
Factor graphs/Markov Nets Defining the joint probability for a set of random variables given: 1)Any set of node subsets (hypergraph) 2)Functions on the node subsets (Potentials) Joint distribution: Partition function: If the potentials are condition probabilities, what will be Z? Not necessarily 1! (can you think of an example?) Things are difficult when there are several modes – think of these like local optima Factor R.V.
h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 DBN PhyloHMM h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 Converting directional models to factor graphs (Loops!) Well defined Z=1 Z!=1
More definitions Remember: Potentials can be defined on discrete, real valued etc. it is also common to define general log-linear models directly: These models are very expressive and broad. The techniques we discussed today (and also the MCMC inference) works without change. But anything that rely on Z=1 (e.g. forward sampling, EM) becomes more difficult. Directed models are sometimes more natural and easy to understand. Their popularity stems from their original role as expressing knowledge in AI, not from their adequacy for modeling physical phenomena. Undirected models are very similar to techniques from statistical physics (e.g., spin glass models), and we can use ideas from physicists (the guys are big with approximations) The models are convex which give them important algorithmic properties, these were recently exploited to derive convex variational optimization (Wainwright and Jordan 2003)]