Lecture 5: Learning models using EM

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Mixture Models and the EM Algorithm
Hidden Markov Model in Biological Sequence Analysis – Part 2
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Theory By Johan Walters (SR 2003)
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Learning Bayesian networks Slides by Nir Friedman.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Machine Learning CMPT 726 Simon Fraser University
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Expectation Maximization Algorithm
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models BMI/CS 576
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Introduction to EM algorithm
CONTEXT DEPENDENT CLASSIFICATION
LECTURE 07: BAYESIAN ESTIMATION
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Learning Bayesian networks
Presentation transcript:

Lecture 5: Learning models using EM Intro to Comp Genomics Lecture 5: Learning models using EM

Mixtures of Gaussians We have experimental results of some value We want to describe the behavior of the experimental values: Essentially one behavior? Two behaviors? More? In one dimension it may look very easy: just looking at the distribution will give us a good idea.. We can formulate the model probabilistically as a mixture of normal distributions. As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution. If the data is multi dimensional, the problem is becoming non trivial.

Inference Let’s represent the model as: What is the inference problem in our model? Inference: computing the posterior probability of a hidden variable given the data and the model parameters. For p0=0.2, p1=0.8, m0=0, m1=1, s0=1,s1=0.2, what is Pr(s=0|0.8) ?

Estimation/parameter learning Generic optimization techniques: Gradient ascent: Find Simulation annealing Genetic algorithms And more.. Given data, how can we estimate the model parameters? Transform it into an optimization problem! Likelihood: a function of the parameters. Defined given the data. Find parameters that maximize the likelihood: the ML problem Can be approached heuristically: using any optimization technique. But it is a non linear problem which may be very difficult

The EM algorithm for mixtures Continue iterating until convergence. The EM theorem: the algorithm will converge and will improve likelihood monotonically But: No Guarantee of finding the optimum Or of finding anything meaningful The initial conditions are critical: Think of starting from m0=0, m1=10, s1,2=1 Solutions: start from “reasonable” solutions Try many starting points -1 1 We start by guessing parameters: We now go over the samples and compute their posteriors (i.e., inference): We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:

Hidden Markov Models Emission space Caution! This is NOT the HMM Bayes Net 1.Cycles 2.States are NOT random vars! Hidden Markov Models Emission space Observing only emissions of states to some probability space E Each state is equipped with an emission distribution (x a state, e emission)

Simple example: Mixture with “memory” We sample a sequence of dependent values At each step, we decide if we continue to sample from the same distribution or switch with probability p B A We can compute the probability directly only given the hidden variables. P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?) There is an exponential number of h assignments, can we still solve the problem efficiently?

Inference in HMM Forward formula: Backward formula: Start States Finish Backward formula: Emissions Start States Finish Emissions

Inference in HMM Forward formula: Backward formula: Start States Finish Backward formula: Emissions Start States Finish Emissions

EM for HMMs Emissions States Finish Start The posterior probability for emitting the i’th character from state s? The posterior probability for transition from s’ to s after character i? With multiple sequence, assume independence (accumulate stats) Claim: HMM EM is monotonically improving the likelihood

The EM theorem for mixtures simplified Assume that we know which distribution generated each sample (samples Si generated from distribution i) We want to maximize the model’s likelihood, given this extra information: “multinomial estimator” solve using Lagrange multipliers: Solve separately:

The EM theorem for mixtures simplified Assume that we know which distribution generated each sample (samples Si generated from distribution i) We want to maximize the model’s likelihood, given this extra information: Normal distribution estimator: using observed sufficient statistics (an exponential family) Solve separately: We found the global optimum of the likelihood in the case of full data.

The EM theorem for mixtures simplified Assume now that each sample i is known to be from distribution j with probability Pij. We can write down: Same maximization holds. In the EM algorithm we used: Solve separately: Deriving the EM formula. In this case Q is dependent on the current parameters, so we call it: What is missing? Q is not L!

Expectation-Maximization Dempster Relative entropy>=0 EM maximization

KL-divergence Entropy (Shannon) Kullback-leibler divergence Not a metric!! KL

Bayesian learning vs. Maximum likelihood Maximum likelihood estimator Introducing prior beliefs on the process (Alternatively: think of virtual evidence) Computing posterior probabilities on the parameters No prior beliefs Parameter Space PME Beliefs MAP MLE Parameter Space

Your Task Preparations: Preparations: Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17 Cut the data into segments of 50,000 data points Modeling: Use EM to build a probabilistic model for the peak signals and the background Use heuristics for peak finding to initialize the EM Analysis: Test if your model for single peak structure is as good as the model for two peak structures. Compute the distribution of peaks relative to transcription start sites Your Task Preparations: Background on ChIP-seq CTCF and PolII Modeling ChIP-seq, binning

Your Task Your Task Modeling S P1 P2 B P3 F P.. Preparations: Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp Cut the data into segments of 50,000 data points Modeling: Use EM to build a probabilistic model for the peak signals and the background. Use heuristics for peak finding to initialize the EM Analysis: Test if your model for single peak structure is as good as the model for two peak structures. Compute the distribution of peaks relative to transcription start sites Your Task Modeling S P1 P2 B P3 F P.. The model use k-states for the peak and one state for the background Use K=40.

Your Task Your Task Modeling Implement HMM inference: forward-backward Preparations: Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp Cut the data into segments of 50,000 data points Modeling: Use EM to build a probabilistic model for the peak signals and the background. Use heuristics for peak finding to initialize the EM Analysis: Test if your model for single peak structure is as good as the model for two peak structures. Compute the distribution of peaks relative to transcription start sites Your Task Your Task Modeling Implement HMM inference: forward-backward - let’s write them together Make sure your total probability equals for the forward and backward algorithm! Implement the EM update rules Run EM from multiple random points and record the likelihoods you derive Implement smarter initialization: take the average values around all probes with value over a threshold. Compute posterior peak probabilities: report all loci with P(Peak)>0.8

Your Task Your Task Analysis Preparations: Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp Cut the data into segments of 50,000 data points Modeling: Use EM to build a probabilistic model for the peak signals and the background. Use heuristics for peak finding to initialize the EM Analysis: Test if your model for single peak structure is as good as the model for two peak structures. Compute the distribution of peaks relative to transcription start sites Your Task Your Task Analysis Compare the two peak structures you get (from CTCF and PolII) Retrain a model together on the two datasets Compute the log-likelihood of the unified model and compare to the sum of likelihood for the two models Optional: test if the difference is significant by: sampling data from the unified model training two models on the synthetic data and compute the likelihood delta as for real data Use a set of known TSS to compute the distribution of peaks relative to genes