CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Model.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
CpG islands in DNA sequences
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
CS 6243 Machine Learning Markov Chain and Hidden Markov Models.
CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications.
CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models BMI/CS 576
Hidden Markov Models (HMMs)
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

In the next few lectures Hidden Markov Models –The theory –Probabilistic treatment of sequence alignments using HMMs –Application of HMMs to biological sequence modeling and discovery of features such as genes

Markov models A sequence of random variables is a k-th order Markov chain if, for all i, i th value is independent of all but the previous k values: First order: Second order: 0 th order: (independence)

Probability to generate a sequence with a 1 st Markov model

CpG islands For biological reasons, CpG (C followed by G) is very rare in mammal genomes But relatively more common in promoter regions Detecting CpG islands help predict genes –Model: consider all 16 conditional probabilities P(G | C), P(T | C), P(A | C), P(C | C), P(G | T) …

A 1 st order Markov model for CpG islands Essentially a finite state automaton Transitions are probabilistic 4 states: A, C, G, T 16 transition: a st = P(x i = t | x i-1 = s) Begin/End states (for convenience)

Probability of a sequence What’s the probability of ACGGCTA? –For 0-th order Markov model, simply P(A)P(C)… –For 1 st order Markov model like the one above P(A) * P(C|A) * P(G|C) … P(A|T) = a BA a AC a CG …a TA

Probability of a sequence What’s the probability of ACGGCTA? P(A) * P(C|A) * P(G|C) … P(A|T) = a BA a AC a CG …a TA Equivalent: follow the path in the automaton, and multiply the probability of each arc on the path Given a sequence, there is only one path you can walk.

Training Estimate the parameters of the model –Training sequences: sequences with labels (CpG islands or not CpG islands) –Test sequences: sequences for test (labels removed) Two models –CpG model learned from known CpG islands –Background model learned from known non-CpG islands What to learn? –Transition frequencies –a st = #(s→t) / #(s →  )

Training Parameters learned from known CpG islands and non-CpG islands

Discrimination / Classification Given a sequence, is it CpG island or not? Log likelihood ratio X = ACGGCTA Compute log (P(X | CpG+) / P(X | CpG-)) or log(P(CpG+ | X) / P(CpG- | X)) given priors Q: how to get a + BA and a - BA ? (B: begin state) P( X|CpG+) = a + BA a + AC a + CG …a + TA P( X|CpG-) = a - BA a - AC a - CG …a - TA

CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Questions Q1: given a short sequence, is it more likely from feature model or background model? Q2: Given a long sequence, where are the features in it (if any)? –Approach 1: score 100 bp (e.g.) windows Pro: simple Con: arbitrary, fixed length, inflexible –Approach 2: combine +/- models.

Combined model Now given a sequence, we cannot direct tell which state a symbol is in. We have two states for each symbol. (hidden Markov model => hidden states) ?

Decoding Give you a sequence, ACGTACGTATA… What’s the most probable path?

Most probable path ACGTACGTATACGTACGTAT A+ A- C+ C- A+ A- A+ A- C+ C- G+ G- G+ G- T+ T- T+ T- T+ T- Find a path with the following objective: Maximize the product of probabilities for the arcs on the path  Maximize the sum of log probabilities for the arcs on the path B Probability of a path is the product of all probabilities for the arcs on the path

Most probable path V(+, i+1) = max { V(+, i) + w(x + i, x + i+1 ) V(-, i) + w(x - i, x + i+1 ) } V(-, i+1) = max { V(+, i) + w(x + i, x - i+1 ) V(-, i) + w(x - i, x - i+1 ) } A+ A- C+ C- A+ A- A+ A- C+ C- G+ G- G+ G- T+ T- T+ T- T+ T- B w(s, t) = log (a st ) x1x2x3x4x5x6x7x8x9x10

Simpler but more general case We can go to any state at any time. Doesn’t depend on the input Each state can emit any symbol with some probabilities (possibly zero) FairLOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 Probability of a path is the product of all probabilities for the arcs on the path, and the possibilities of emitting the sequence of symbols

Two types of rewards –Mileage: for using an arc (fixed) –Bonus: for visiting a node (depending on the token you show) F L F L F L F L F L F L F L F L F L F L B P(s) = 1/6, for s in [1..6] P(s) = 1/10, for s in [1..5] P(6) = 1/2

V(F, i+1) = Max { V(F,i) + w(F,F) + r(F, x i+1 ), V(L,i) + w(L,F) + r(F, x i+1 )} V(L, i+1) = Max { V(F,i) + w(F,F) + r(L, x i+1 ), V(L,i) + w(L,F) + r(L, x i+1 )} x1x2x3x4x5x6x7x8x9x10 F L F L F L F L F L F L F L F L F L F L B P(s) = 1/6, for s in [1..6] P(s) = 1/10, for s in [1..5] P(6) = 1/2 r(F, x) = log (e F (x))w(F, L) = log (a FL )

FSA interpretation Fair LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 Fair LOADED r(F,1) = 0.5 r(F,2) = 0.5 r(F,3) = 0.5 r(F,4) = 0.5 r(F,5) = 0.5 r(F,6) = 0.5 r(L,1) = 0 r(L,2) = 0 r(L,3) = 0 r(L,4) = 0 r(L,5) = 0 r(L,6) = 1.6 r = log (10 * p) V(L, i+1) = max { V(L, i) + W(L, L) + r(L, x i+1 ), V(F, i) + W(F, L) + r(L, x i+1 ) } V(F, i+1) = max { V(L, i) + W(L, F) + r(F, x i+1 ), V(F, i) + W(F, F) + r(F, x i+1 ) }

More general cases x1x2x3x4x5x6x7x8x9x B 3 k k k k k k k k k k......

V(1,i) + w(1,1) + r(1, x i+1 ), V(2,i) + w(2,1) + r(1, x i+1 ), V(1, i+1) = max V(3,i) + w(3,1) + r(1, x i+1 ), …… V(k,i) + w(k,1) + r(1, x i+1 ) V(1,i) + w(1,2) + r(2, x i+1 ), V(2,i) + w(2,2) + r(2, x i+1 ), V(2, i+1) = max V(3,i) + w(3,2) + r(2, x i+1 ), …… V(k,i) + w(k,2) + r(2, x i+1 ) A FSA with k states, fully connected

V(1,i) + w(1, j) + r(j, x i+1 ), V(2,i) + w(2, j) + r(j, x i+1 ), V(j, i+1) = max V(3,i) + w(3, j) + r(j, x i+1 ), …… V(k,i) + w(k, j) + r(j, x i+1 ) Or simply: V(j, i+1) = Max l {V(l,i) + w(l, j) + r(j, x i+1 )} Generalize

Previous equation assumes completely connected structure. In practice may not be the case. Load_1 Fair Load_2 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L1) = 1/10 P(2|L1) = 1/10 P(3|L1) = 1/10 P(4|L1) = 1/10 P(5|L1) = 1/10 P(6|L1) = 1/2 P(1|L2) = 1/2 P(2|L2) = 1/10 P(3|L2) = 1/10 P(4|L2) = 1/10 P(5|L2) = 1/10 P(6|L2) = 1/

Load_1 Fair Load_ Take advantage of sparse structure V(L1, i) + w(L1, F) + r(F, x i+1 ), V(F, i+1) = maxV(L2, i) + w(L2, F) + r(F, x i+1 ), V(F, i) + w(F, F) + r(F, x i+1 ) V(L1, i) + w(L1, L1) + r(L1, x i+1 ), V(L1, i+1) = max V(F, i) + w(F, L1) + r(L1, x i+1 ) V(L2, i) + w(L2, L2) + r(L2, x i+1 ), V(L2, i+1) = maxV(F, i) + w(F, L2) + r(L2, x i+1 )

The Viterbi Algorithm Input: x = x 1 ……x N Initialization: P 0 (0) = 1 (zero in subscript is the start state.) P k (0) = 0, for all k > 0(0 in parenthesis is the imaginary first position) Iteration: for each j = 1..N for each i = 1..k P j (i) = e j (x i )  max k a kj P k (i-1) Ptr j (i) = argmax k a kj P k (i-1) end Termination: Prob(x,  *) = max k P k (N) Traceback:  N * = argmax k P k (N)  i-1 * = Ptr  i (i)

The Viterbi Algorithm Input: x = x 1 ……x N Initialization: V 0 (0) = 0 (zero in subscript is the start state.) V k (0) = -inf, for all k > 0(0 in parenthesis is the imaginary first position) Iteration: for each j for each i V j (i) = r j (x i ) + max k (w kj + V k (i-1)) r j (x i ) = log(e j (x i )), w kj = log(a kj ) Ptr j (i) = argmax k (w kj + V k (i-1)) end Termination: Prob(x,  *) = exp{max k V k (N)} Traceback:  N * = argmax k V k (N)  i-1 * = Ptr  i (i)

The Viterbi Algorithm Similar to “aligning” a set of states to a sequence Time: O(K 2 N) Space: O(KN) x 1 x 2 x 3 ………………………………………..x N State 1 2 K V j (i)

So far so good but

Problem Is the most probable path necessarily the best one? –Single optimal vs multiple sub-optimal What if there are many sub-optimal paths with slightly lower probabilities? –Global optimal vs local optimal What’s best globally may not be the best for each individual

For example Probability for path is 0.4 Probability for path is 0.3 Probability for path is 0.3 What’s the most probable state at step 2? –0.4 goes through 5 –0.6 goes through 6 –Viterbi may not be the only interesting answer

Another example The dishonest casino Say x = Most probable path:  = FF……F However: marked letters more likely to be L than unmarked letters Another way to interpret the problem –With Viterbi, every pos is assigned a label –Confidence level for each assignment?

Posterior decoding Viterbi finds the path with the highest probability –The probability is usually very tiny anyway We want to know –  k = 1

Probability of a sequence The probability that a sequence is generated by a model. P(X | M) Sometimes simply written as P(X) Sometimes written as P(X | M, θ) or P(X | θ) to emphasize that we are looking for θ to optimize the likelihood Not equal to the probability of a path P(X,  ) –There are many possible paths that can generate X –Each with a different probability –P(X) =   P(X,  ) =   P(X |  ) P(  ) –Why do we need this? –How to compute without summing over all paths (exponential)?

The forward algorithm Define f k (i) the probability to get to state k at step i –Sum over all possible previous paths –Remember the way we counted # alignments?

The forward algorithm

We can compute f k (i) for all k, i, using dynamic programming! Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f k (i) = e k (x i )  j f j (i-1) a jk Termination: Prob(x) =  k f k (N)

Relation between Forward and Viterbi VITERBI Initialization: P 0 (0) = 1 P k (0) = 0, for all k > 0 Iteration: P k (i) = e k (x i ) max j P j (i-1) a jk Termination: Prob(x,  *) = max k P k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f k (i) = e k (x i )  j f j (i-1) a jk Termination: Prob(x) =  k f k (N)

The backward algorithm Define b k (i) as the probability for paths starting from position i within state k, to the end of the sequence –Sum over all possible paths from i to N

1 This does not include the emission probability of x i

f k (i): prob to get to pos i in state k and emit x i b k (i): prob from i to end, given i is in state k What is f k (i) b k (i)?

Is your guess correct?

What we need is But: We have P(x) already in the forward algorithm Can verify:  k = 1

The forward-backward algorithm Compute f k (i), for each state k and pos i Compute b k (i), for each state k and pos I Compute P(x) =  k f k (N) Compute P(  i =k | x) = f k (i) * b k (i) / P(x)

The prob of x, with the constraint that it has to pass state k on step i state Sequence + P(  i =k | x) Space: O(KN) Time: O(K 2 N) / P(X) Forward probabilitiesBackward probabilities

Relation to another F-B algorithm We’ve learned a forward-backward algorithm in linear- space sequence alignment –Hirscheberg’s algorithm –Also known as forward-backward alignment algorithm x y

What’s P(  i =k | x) good for? For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding  ^ i = argmax k P(  i = k | x)

For example If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

CpG islands again Data: 41 human sequences, totaling 60kbp, including 48 CpG islands of about 1kbp each Viterbi: Post-process: –Found 46 of 48 46/48 –plus 121 “false positives” 67 false pos Posterior Decoding: –same 2 false negatives 46/48 –plus 236 false positives 83 false pos Post-process: merge within 500; discard < 500

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? - LEARNING

Learning When the states are known –We’ve already done that –Estimate parameters from labeled data (known CpG or non-CpG) –“supervised” learning When the states are unknown –Estimate parameters without labeled data –“unsupervised” learning

Basic idea 1.We estimate our “best guess” on the model parameters θ 2.We use θ to predict the unknown labels 3.We re-estimate a new set of θ 4.Repeat 2 & 3 Two ways

Viterbi Training given θ estimate π; then re-estimate θ 1.Make initial estimates of parameters θ 2.Find Viterbi path π for each training sequence 3.Count transitions/emissions on those paths, getting new θ 4.Repeat 2&3 Not rigorously optimizing desired likelihood, but still useful & commonly used. (Arguably good if you’re doing Viterbi decoding.)

Baum-Welch Training given θ, estimate π ensemble; then re-estimate θ Instead of estimating the new θ from the most probable path  We can re-estimate θ from all possible paths For example, according to Viterbi, pos i is in state k and pos (i+1) is in state l –This contributes 1 count towards the frequency that transition k  l is used –In Baum-Welch, this transition is counted only partially, according the probability that this arc is taken by some path

Estimated # of k  l transition

Why is this working? Proof is complicated (ch 11 in Durbin book) But basically, –The backward-forward algorithm computes the likelihood of the data given the HMM –When we re-estimate θ, we maximize P(sequence | θ). –Effect: in each iteration, the likelihood of the sequence will be improved –Therefore, guaranteed to converge (not necessarily to a global optimal)

expectation-maximization (EM) Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data E-step: compute the expectation (e.g. prob for each pos to be in a certain state) M-step: maximum-likelihood parameter estimation We’ll see EM and similar techniques again in motif finding k-means clustering has some similar flavor

HMM summary Viterbi – best single path Forward – sum over all paths Backward – similar Baum-Welch – training via EM and forward-backward Viterbi – another “EM”, but Viterbi-based