Hidden Markov Models BMI/CS 576

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models Eine Einführung.
MNW2 course Introduction to Bioinformatics
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Master’s course Bioinformatics Data Analysis and Tools
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Hidden Markov Models for Sequence Analysis 4
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
1.
Three classic HMM problems
N-Gram Model Formulas Word sequences Chain rule of probability
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
CONTEXT DEPENDENT CLASSIFICATION
Input Output HMMs for modeling network dynamics
Introduction to HMM (cont)
Hidden Markov Model Lecture #6
Hidden Markov Models By Manish Shrivastava.
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

Hidden Markov Models BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Classifying sequences Markov chains useful for modeling a single class of sequence likelihood ratios of different models can be used to classify sequences What if a sequence contains multiple classes of elements? Example: a whole genome sequence How can we model such sequences? How can we partition these sequences into their component elements?

One attempt: merge Markov chains “CpG island” “background” Problem: given say a T in our input sequence, which state emitted it?

Hidden State we’ll distinguish between the observed parts of a problem and the hidden parts in the Markov models we’ve considered previously, it is clear which state accounts for each part of the observed sequence in the model above, there are multiple states that could account for each part of the observed sequence this is the hidden part of the problem

Two HMM random variables Observed sequence Hidden state sequence HMM: Markov chain over hidden sequence Dependence between

The Parameters of an HMM as in Markov chain models, we have transition probabilities probability of a transition from state k to l represents a path (sequence of states) through the model since we’ve decoupled states and characters, we also have emission probabilities Explain probability notation probability of emitting character b in state k

A Simple HMM with Emission Parameters probability of a transition from state 1 to state 3 probability of emitting character A in state 2 0.4 A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 A 0.2 C 0.3 G 0.3 T 0.2 begin end 0.5 0.2 0.8 0.6 0.1 0.9 5 4 3 2 1 G 0.1 T 0.4 0.8

Simple HMM for Gene Finding Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

Three Important Questions How likely is a given sequence? the Forward algorithm What is the most probable “path” (sequence of hidden states) for generating a given sequence? the Viterbi algorithm How can we learn the HMM parameters given a set of sequences? the Forward-Backward (Baum-Welch) algorithm

How Likely is a Given Path and Sequence? the probability that the path is taken and the sequence is generated: (assuming begin/end are the only silent states on path) Note that we have both transitions and emissions

How Likely Is A Given Path and Sequence? 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 begin end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 \pi = 0 1 1 3 5 2 4 0.1 0.8

How Likely is a Given Sequence? We usually only observe the sequence, not the path To find the probability of a sequence, we must sum over all possible paths but the number of paths can be exponential in the length of the sequence... the Forward algorithm enables us to compute this efficiently

How Likely is a Given Sequence: The Forward Algorithm A dynamic programming solution subproblem: define to be the probability of generating the first i characters and ending in state k we want to compute , the probability of generating the entire sequence (x) and ending in the end state (state N) can define this recursively Draw trellis

The Forward Algorithm because of the Markov property, don’t have to explicitly enumerate every path A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 G 0.1 T 0.4 A 0.2 C 0.3 G 0.3 T 0.2 begin end 0.5 0.2 0.8 0.4 0.6 0.1 0.9 5 4 3 2 1 e.g. compute using

The Forward Algorithm initialization: probability that we’re in start state and have observed 0 characters from the sequence

The Forward Algorithm recursion for emitting states (i =1…L): recursion for silent states: Draw trellis Sum is usually not over all states (sparse transition matrix)

The Forward Algorithm termination: probability that we’re in the end state and have observed the entire sequence

Forward Algorithm Example C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 G 0.1 T 0.4 A 0.2 C 0.3 G 0.3 T 0.2 begin end 0.5 0.2 0.8 0.4 0.6 0.1 0.9 5 4 3 2 1 given the sequence x = TAGA

Forward Algorithm Example given the sequence x = TAGA initialization computing other values

Forward Algorithm Note in some cases, we can make the algorithm more efficient by taking into account the minimum number of steps that must be taken to reach a state e.g. for this HMM, we don’t need to initialize or compute the values begin end 5 4 3 2 1

Three Important Questions How likely is a given sequence? What is the most probable “path” for generating a given sequence? How can we learn the HMM parameters given a set of sequences?

Finding the Most Probable Path: The Viterbi Algorithm Dynamic programming approach, again! subproblem: define to be the probability of the most probable path accounting for the first i characters of x and ending in state k we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use DP to find efficiently

Finding the Most Probable Path: The Viterbi Algorithm initialization:

The Viterbi Algorithm recursion for emitting states (i =1…L): keep track of most probable path recursion for silent states:

The Viterbi Algorithm termination: \pi*: a mostly likely path traceback: follow pointers back starting at

Forward & Viterbi Algorithms Forward/Viterbi algorithms effectively consider all possible paths for a sequence Forward to find probability of a sequence Viterbi to find most probable path consider a sequence of length 4… begin end

Three Important Questions How likely is a given sequence? What is the most probable “path” for generating a given sequence? How can we learn the HMM parameters given a set of sequences?

Learning without Hidden State learning is simple if we know the correct path for each sequence in our training set 2 2 4 4 5 C A G T begin 1 3 end 5 If we know path, then can estimate transition parameters just as Markov model. Emission parameters estimated via standard multinomial estimation as described in Markov model lecture. We sometimes do this when we have an annotated training set. Example: experimentally derived gene annotations 2 4 estimate parameters by counting the number of times each parameter is used across the training set

Learning with Hidden State if we don’t know the correct path for each sequence in our training set, consider all possible paths for the sequence ? ? ? ? 5 C A G T begin 1 3 end 5 2 4 estimate parameters through a procedure that counts the expected number of times each transition and emission occurs across the training set

Learning Parameters if we know the state path for each training sequence, learning the model parameters is simple no hidden state during training count how often each transition and emission occurs normalize/smooth to get probabilities process is just like it was for Markov chain models if we don’t know the path for each training sequence, how can we determine the counts? key insight: estimate the counts by considering every path weighted by its probability

Learning Parameters: The Baum-Welch Algorithm a.k.a the Forward-Backward algorithm an Expectation Maximization (EM) algorithm EM is a family of algorithms for learning probabilistic models in problems that involve hidden state in this context, the hidden state is the path that explains each training sequence

Learning Parameters: The Baum-Welch Algorithm algorithm sketch: initialize the parameters of the model iterate until convergence calculate the expected number of times each transition or emission is used adjust the parameters to maximize the likelihood of these expected values

The Expectation Step first, we need to know the probability of the i th symbol being produced by state k, given sequence x given this we can compute our expected counts for state transitions, character emissions

The Expectation Step the probability of of producing x with the i th symbol being produced by state k is the first term is , computed by the forward algorithm the second term is , computed by the backward algorithm

The Expectation Step C A G T we want to know the probability of producing sequence x with the i th symbol being produced by state k (for all x, i and k) 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 begin 1 3 end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8 C A G T

The Expectation Step C A G T the forward algorithm gives us , the probability of being in state k having observed the first i characters of x 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 end begin 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8 C A G T

The Expectation Step C A G T the backward algorithm gives us , the probability of observing the rest of x, given that we’re in state k after i characters 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 begin end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 Give probabilistic definitions for f_k(i) and b_k(i) 2 4 0.1 0.8 C A G T

The Expectation Step C A G T putting forward and backward together, we can compute the probability of producing sequence x with the i th symbol being produced by state k 0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 begin end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 2 4 0.1 0.8 C A G T

The Backward Algorithm initialization: for states with a transition to end state

The Backward Algorithm recursion (i =L-1…0): An alternative to the forward algorithm for computing the probability of a sequence:

The Expectation Step now we can calculate the probability of the i th symbol being produced by state k, given x

The Expectation Step now we can calculate the expected number of times letter c is emitted by state k here we’ve added the superscript j to refer to a specific sequence in the training set sum over sequences sum over positions where c occurs in xj

The Expectation Step sum over positions sum over sequences where c occurs in x

The Expectation Step and we can calculate the expected number of times that the transition from k to l is used or if l is a silent state

The Maximization Step Let be the expected number of emissions of c from state k for the training set estimate new emission parameters by: just like in the simple case but typically we’ll do some “smoothing” (e.g. add pseudocounts)

The Maximization Step let be the expected number of transitions from state k to state l for the training set estimate new transition parameters by: Can also add pseudocounts

The Baum-Welch Algorithm initialize the parameters of the HMM iterate until convergence initialize , with pseudocounts E-step: for each training set sequence j = 1…n calculate values for sequence j add the contribution of sequence j to , M-step: update the HMM parameters using ,

Baum-Welch Algorithm Example given the HMM with the parameters initialized as shown the training sequences TAG, ACG A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 begin end 1.0 0.1 0.9 0.2 0.8 3 2 1 we’ll work through one iteration of Baum-Welch

Baum-Welch Example (Cont) determining the forward values for TAG here we compute just the values that represent events with non-zero probability in a similar way, we also compute forward values for ACG Note that some of the forward probabilities not computed actually have non-zero value, but are not necessary to compute because the corresponding backward value is zero

Baum-Welch Example (Cont) determining the backward values for TAG here we compute just the values that represent events with non-zero probability in a similar way, we also compute backward values for ACG

Baum-Welch Example (Cont) determining the expected emission counts for state 1 contribution of TAG contribution of ACG pseudocount *note that the forward/backward values in these two columns differ; in each column they are computed for the sequence associated with the column

Baum-Welch Example (Cont) determining the expected transition counts for state 1 (not using pseudocounts) in a similar way, we also determine the expected emission/transition counts for state 2 contribution of TAG contribution of ACG

Baum-Welch Example (Cont) determining probabilities for state 1

Computational Complexity of HMM Algorithms given an HMM with N states and a sequence of length L, the time complexity of the Forward, Backward and Viterbi algorithms is this assumes that the states are densely interconnected Given M sequences of length L, the time complexity of Baum-Welch on each iteration is N^2 is maximum number of edges in state transition diagram

Baum-Welch Convergence some convergence criteria likelihood of the training sequences changes little fixed number of iterations reached parameters are not changing significantly usually converges in a small number of iterations will converge to a local maximum (in the likelihood of the data given the model) Draw 1-D local maxima problem What type of algorithm is Baum-Welch: greedy! Hill-climbing If there is only one local maximum (likelihood is concave), then EM is guaranteed to work Otherwise, there is no guarantee parameters

Learning and Prediction Tasks Given: a model, a set of training sequences Do: find model parameters that explain the training sequences with relatively high probability (goal is to find a model that generalizes well to sequences we haven’t seen before) classification Given: a set of models representing different sequence classes, a test sequence Do: determine which model/class best explains the sequence segmentation Given: a model representing different sequence classes, a test sequence Do: segment the sequence into subsequences, predicting the class of each subsequence

Algorithms for Learning and Prediction Tasks correct path known for each training sequence  simple maximum-likelihood or Bayesian estimation correct path not known  Baum-Welch (EM) algorithm for maximum likelihood estimation classification simple Markov model  calculate probability of sequence along single path for each model hidden Markov model  Forward algorithm to calculate probability of sequence along all paths for each model segmentation hidden Markov model  Viterbi algorithm to find most probable path for sequence