Genome evolution: a sequence-centric approach Lecture 5: Undirected models and variational inference.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

A Tutorial on Learning with Bayesian Networks

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

Exact Inference in Bayes Nets

Expectation Maximization

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep

Planning under Uncertainty

Visual Recognition Tutorial

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

Lecture 5: Learning models using EM

Maximum likelihood (ML) and likelihood ratio (LR) test

Genome evolution: a sequence-centric approach Lecture 6: Belief propagation.

Expectation Maximization Algorithm

Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.

Maximum Likelihood (ML), Expectation Maximization (EM)

Genome evolution: a sequence-centric approach Lecture 4: Beyond Trees. Inference by sampling Pre-lecture draft – update your copy after the lecture!

Visual Recognition Tutorial

Computer vision: models, learning and inference Chapter 10 Graphical Models.

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

Genome Evolution. Amos Tanay 2009 Genome evolution: Lecture 8: Belief propagation.

Maximum likelihood (ML)

. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.

Genome Evolution. Amos Tanay 2012 Genome evolution Lecture 9: Mutations and variational inference.

Random Sampling, Point Estimation and Maximum Likelihood.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

Lecture 19: More EM Machine Learning April 15, 2010.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.

Genome Evolution. Amos Tanay 2009 Genome evolution Lecture 6: Mutations and variational inference.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

CS Statistical Machine learning Lecture 24

Slides for “Data Mining” by I. H. Witten and E. Frank.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

An Introduction to Variational Methods for Graphical Models

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Lecture 2: Statistical learning primer for biologists

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Genome Evolution. Amos Tanay 2010 Genome evolution: Lecture 9: Variational inference and Belief propagation.

Inference Algorithms for Bayes Networks

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Pattern Recognition and Machine Learning

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Today Graphical Models Representing conditional dependence graphically

Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA.

Bayesian Belief Propagation for Image Understanding David Rosenberg.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

Learning Deep Generative Models by Ruslan Salakhutdinov

Markov Networks.

Hidden Markov Models Part 2: Algorithms

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 6: Mutations and variational inference

Markov Random Fields Presented by: Vladan Radosavljevic.

Expectation-Maximization & Belief Propagation

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

Markov Networks.

Presentation transcript:

Genome evolution: a sequence-centric approach Lecture 5: Undirected models and variational inference

Course outline Probabilistic models Inference Parameter estimation Genome structure Mutations Population Inferring Selection (Probability, Calculus/Matrix theory, some graph theory, some statistics) Simple Tree Models HMMs and variants PhyloHMM,DBN Context-aware MM DP Sampling EM

Log-likelihood to Free Energy We have so far worked on computing the likelihood: Better: when q a distribution, the free energy bounds the likelihood: Computing likelihood is hard. We can reformulate the problem by adding parameters and transforming it into an optimization problem. Given a trial function q, define the free energy of the model as: The free energy is exactly the likelihood when q is the posterior: D(q || p(h|s)) Likelihood

Energy?? What energy? In statistical mechanics, a system at temprature T with states x and an energy function E(x) is characterized by Boltzman’s law: If we think of P(h|s,  ): Given a directed model p(h,s|T) (a BN), we can define the energy using Boltzman’s law Z is the partition function:

Free Energy and Variational Free Energy The Helmoholtz free energy is defined in physics as: The average energy is: The variational transformation introduce trial functions q(h), and set the variational free energy (or Gibbs free energy) to: This free energy is important in statistical mechanics, but it is difficult to compute, as our probabilistic Z (= p(s)) The variational entropy is: And as before:

Solving the variational optimization problem So instead of computing p(s), we can search for q that optimizes the free energy This is still hard as before, but we can simplify the problem by restricting q (this is where the additional degrees of freedom become important) Maxmizing U?Maxmizing H? Focus on max configurationsSpread out the distribution

Simplest variational approximation: Mean Field Let’s assume complete independence among r.v.’s posteriors: Under this assumption we can try optimizing the q i – (looking for minimal energy!) Maxmizing U?Maxmizing H? Focus on max configurationsSpread out the distribution

Mean Field Inference We optimize iteratively: Select i (sequentially, or using any method) Optimize q i to minimize F MF (q 1,..,q i,…,q n ) while fixing all other qs Terminate when F MF cannot be improved further Remember: F MF always bound the likelihood q i optimization can usually be done efficiently

Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its q i while making sure it is a distribution: To ease notation, assume the left (l) and right (r) children are hidden The energy decomposes, and only few terms are affected:

Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its q i while making sure it is a distribution:

Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hjihji h j-1 i h j-1 pai h j pai hjlhjl h j+1 l hjrhjr h j+1 r h j+1 i h j pai h j-1 r

Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hjihji h j-1 i h j-1 pai h j pai hjlhjl h j+1 l hjrhjr h j+1 r h j+1 i h j pai h j-1 r As before, the optimal solution is derived by making logq i equals the sum of affected terms:

Because the MF trial function is very crude Simple Mean Field is usually not a good idea Why? For example, we said before that the joint posteriors cannot be approximated by independent product of the hidden variables posteriors ACA C A/C

Exploiting additional structure The approximation specify independent distributions for each loci, but maintain the tree dependencies. We can greatly improve accuracy by generalizing the mean field algorithm using larger building blocks We now optimize each tree q separately, given the current other tree potentials. The key point is that optimizing for any given tree is efficient: we just use a modified up-down algorithm

Tree based variational inference Each tree is only affected by the tree before and the tree after:

Tree based variational inference We got the same functional form as we had for the simple tree, so we can use the up-down algorithm to optimize q j.

Chain cluster variational inference We can use any partition of a BN to trees and derive a similar MF algorithm For example, instead of trees we can use the Markov chains in each species What will work better for us? Depends on the strength of dependencies at each dimension – we should try to capture as much “dependency” as possible

Directionality acyclicity is crucial for BNs This is why: It also allows us to estimate parameters using EM: Given a set of observations s 1,s 2,.. Start with any set of CPDs While improving Computer posteriors (somehow): Update all CPDs: The maximization part is simple because each factor in the joint probability can be optimized separately

But we are minimizing energy, not computing likelihood – would the generalized EM still work? if we make sure that the M step optimizes the free energy we will be ok We will usually obtain the same functional form for the optimization problem, where each conditional probability can be optimized independently It is crucial that we define a probability distribution without normalizing Without this, we will lose independence among the product factors..

Directionality fits temporal behavior, but in a complex way h pai j h pai j+1 h pai j hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 Directed acylic approximations are limited DBNPhyloHMM h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 This may be more accurate, (but we’ve added directed cycles):

Factor graphs/Markov Nets Defining the joint probability for a set of random variables given: 1)Any set of node subsets (hypergraph) 2)Functions on the node subsets (Potentials) Joint distribution: Partition function: If the potentials are condition probabilities, what will be Z? Not necessarily 1! (can you think of an example?) Things are difficult when there are several modes – think of these like local optima Factor R.V.

h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 DBN PhyloHMM h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 h pai j h pai j+1 h pai j-1 hijhij h i j+1 h i j-1 Converting directional models to factor graphs (Loops!) Well defined Z=1 Z!=1

More definitions Remember: Potentials can be defined on discrete, real valued etc. it is also common to define general log-linear models directly: These models are very expressive and broad. The techniques we discussed today (and also the MCMC inference) works without change. But anything that rely on Z=1 (e.g. forward sampling, EM) becomes more difficult. Directed models are sometimes more natural and easy to understand. Their popularity stems from their original role as expressing knowledge in AI, not from their adequacy for modeling physical phenomena. Undirected models are very similar to techniques from statistical physics (e.g., spin glass models), and we can use ideas from physicists (the guys are big with approximations) The models are convex which give them important algorithmic properties, these were recently exploited to derive convex variational optimization (Wainwright and Jordan 2003)]