CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications.

Slides:

Advertisements

Similar presentations

Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.

Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Hidden Markov Model.

Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.

Hidden Markov Models Eine Einführung.

Hidden Markov Models.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Hidden Markov Models Modified from:

Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Model Most pages of the slides are from lecture notes from Prof. Serafim Batzoglou’s course in Stanford: CS 262: Computational Genomics (Winter.

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

CpG islands in DNA sequences

HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.

… Hidden Markov Models Markov assumption: Transition model:

Lecture 6, Thursday April 17, 2003

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models Lecture 6, Thursday April 17, 2003.

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

CpG islands in DNA sequences

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.

Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Hidden Markov Models.

Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =

1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.

HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

Hidden Markov Models for Sequence Analysis 4

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models.

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.

Hidden Markov Models Usman Roshan CS 675 Machine Learning.

CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models BMI/CS 576

Hidden Markov Models - Training

Hidden Markov Models Part 2: Algorithms

Lecture 13: Hidden Markov Models and applications

CSE 5290: Algorithms for Bioinformatics Fall 2009

Presentation transcript:

CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications

Project ideas Implement an HMM including Viterbi decoding, posterior decoding and Baum- Welch learning –Construct a model –Generate sequences with the model –Given labels, estimate parameters –Given parameters, decode –Given nothing, learn the parameters and decode

Project ideas Implement a progressive multiple sequence alignment with iterative refinement –Use an inferred phylo-genetic tree –Affine gap penalty? –Compare with results in protein families? –Compare with HMM-based?

Project ideas Implement a combinatorial motif finder –Fast enumeration using suffix tree? –Statistical evaluation –Word clustering? –Test on simulated data – can you find known motifs embedded in sequences –Test on real data – find motifs in some real promoter sequences and compare with what is known about those genes

Project ideas Pick a paper about some algorithm and implement it Do your own experiments Or pick a topic and do a survey

Problems in HMM Decoding –Predict the state of each symbol Most probable path Most probable state for each position: posterior decoding Evaluation –The probability that a sequence is generated by a model –Basis for posterior decoding Learning –Decode without knowing model parameters –Estimate parameters without knowing states

Review of last lecture

Decoding Input: HMM & transition and emission parameters, and a sequence Output: the state of each position on the sequence ?

Decoding Solution 1: find the most probable path Algorithm: Viterbi

HMM for loaded/fair dice FairLOADED a LF = 0.05 e F (1) = 1/6 e F (2) = 1/6 e F (3) = 1/6 e F (4) = 1/6 e F (5) = 1/6 e F (6) = 1/6 e L (1) = 1/10 e L (2) = 1/10 e L (3) = 1/10 e L (4) = 1/10 e L (5) = 1/10 e L (6) = 1/2 Probability of a path is the product of transition probabilities and emission probabilities on the path Transition probability Emission probability a FL = 0.05a LL = 0.95a FF = 0.95

HMM unrolled x1x2x3x4x5x6x7x8x9x10 F L F L F L F L F L F L F L F L F L F L B Node weight r(F, x) = log (e F (x))Edge weight w(F, L) = log (a FL )  Find a path with the following objective: Maximize the product of transition and emission probabilities  Maximize the sum of weights  Strategy: Dynamic Programming

FSA interpretation Fair LOADED a FL = 0.05 Fair LOADED w(F,F) = 2.3 r(F,1) = 0.5 r(F,2) = 0.5 r(F,3) = 0.5 r(F,4) = 0.5 r(F,5) = 0.5 r(F,6) = 0.5 r(L,1) = 0 r(L,2) = 0 r(L,3) = 0 r(L,4) = 0 r(L,5) = 0 r(L,6) = 1.6 V(L, i+1) = max { V(L, i) + W(L, L) + r(L, x i+1 ), V(F, i) + W(F, L) + r(L, x i+1 )} V(F, i+1) = max { V(L, i) + W(L, F) + r(F, x i+1 ), V(F, i) + W(F, F) + r(F, x i+1 )} e F (1) = 1/6 e F (2) = 1/6 e F (3) = 1/6 e F (4) = 1/6 e F (5) = 1/6 e F (6) = 1/6 e L (1) = 1/10 e L (2) = 1/10 e L (3) = 1/10 e L (4) = 1/10 e L (5) = 1/10 e L (6) = 1/2 P(L, i+1) = max { P(L, i) a LL e L (x i+1 ), P(F, i) a FL e L (x i+1 )} P(F, i+1) = max { P(L, i) a LF e F (x i+1 ), P(F, i) a FF e F (x i+1 )} a LF = 0.05 a LL = 0.95 a FF = 0.05 w(F,L) = -0.7w(L,L) = 2.3 w(L,F) = -0.7

More general cases 12 3 K … … K states Completely connected (possibly with 0 transition probabilities) Each state has a set of emission probabilities (emission probabilities may be 0 for some symbols in some states)

HMM unrolled x1x2x3x4x5x6x7x8x9x B 3 k k k k k k k k k k......

V(1,i) + w(1, j) + r(j, x i+1 ), V(2,i) + w(2, j) + r(j, x i+1 ), V(j, i+1) = max V(3,i) + w(3, j) + r(j, x i+1 ), …… V(k,i) + w(k, j) + r(j, x i+1 ) Or simply: V(j, i+1) = Max l {V(l,i) + w(l, j) + r(j, x i+1 )} Recurrence

The Viterbi Algorithm Time: O(K 2 N) Space: O(KN) x 1 x 2 x 3 ………………………………………..x N State 1 2 K V j (i)

Problems with Viterbi decoding The most probable path not necessarily the only interesting one –Single optimal vs multiple sub-optimal –Global optimal vs local optimal

For example Probability for path is 0.4 Probability for path is 0.3 Probability for path is 0.3 The most probable state at step 2 is 6 –0.4 goes through 5 –0.6 goes through 6

Another example The dishonest casino Say x = Most probable path:  = FF……F However: marked letters more likely to be L than unmarked letters

Posterior decoding Viterbi finds the path with the highest probability Posterior decoding  ^ i = argmax k P(  i = k | x) Need to know

Posterior decoding In order to do posterior decoding, we need to know the probability of a sequence given a model, since This is called the evaluation problem The solution: Forward-backward algorithm

The forward algorithm

Relation between Forward and Viterbi VITERBI Initialization: P 0 (0) = 1 P k (0) = 0, for all k > 0 Iteration: P k (i) = e k (x i ) max j P j (i-1) a jk Termination: Prob(x,  *) = max k P k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f k (i) = e k (x i )  j f j (i-1) a jk Termination: Prob(x) =  k f k (N)

1 This does not include the emission probability of x i

Forward-backward algorithm f k (i): prob to get to pos i in state k and emit x i b k (i): prob from i to end, given i is in state k What is f k (i) b k (i)? –Answer:

The forward-backward algorithm Compute f k (i), for each state k and pos i Compute b k (i), for each state k and pos I Compute P(x) =  k f k (N) Compute P(  i =k | x) = f k (i) * b k (i) / P(x)

state Sequence  P(  i =k | x) Space: O(KN) Time: O(K 2 N) / P(X) Forward probabilitiesBackward probabilities

The Forward-backward algorithm Posterior decoding  ^ i = argmax k P(  i = k | x) Confidence level for the assignment Similarly idea can be used to compute P(  i = k,  i+1 = l | x): the probability that a particular transition is used

For example If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

Posterior decoding Sometimes may not give a valid path

Today Learning Practical issues in HMM learning

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? - LEARNING

Learning When the states are known –We’ve already done that –Estimate parameters from labeled data (known CpG or non-CpG) –“supervised” learning –Frequency counting is called “maximum likelihood parameter estimation” The parameters you found will maximizes the likelihood of your data under the model When the states are unknown –Estimate parameters without labeled data –“unsupervised” learning

Basic idea 1.We estimate our “best guess” on the model parameters θ 2.We use θ to predict the unknown labels 3.We re-estimate a new set of θ 4.Repeat 2 & 3 Two ways

Viterbi Training given θ estimate π; then re-estimate θ 1.Make initial estimates (guess) of parameters θ 2.Find Viterbi path π for each training sequence 3.Count transitions/emissions on those paths, getting new θ 4.Repeat 2&3 Not rigorously optimizing desired likelihood, but still useful & commonly used. (Arguably good if you’re doing Viterbi decoding.)

Baum-Welch Training given θ, estimate π ensemble; then re-estimate θ Instead of estimating the new θ from the most probable path  We can re-estimate θ from all possible paths –For example, according to Viterbi, pos i is in state k and pos (i+1) is in state l –This contributes 1 count towards the frequency that transition k  l is used –In Baum-Welch, this transition is counted only partially, according to the probability that this transition is taken by some path –Similar for emission

Question How to compute P(  i = k,  i = l | X)? Evaluation problem –Solvable with the backward-forward algorithm

Answer P(  i = k,  i+1 = l, x) = P(  i = k, x 1 …x i ) * a kl * e l (x i+1 ) * P(  i+2 = q, x i+3 …x n ) = f k (i) * a kl * e l (x i+1 ) * b l (i+1)

Estimated # of k  l transition: New transition probabilities: Estimated # of symbol t emitted in state k: New emission probabilities:

Why is this working? Proof is very technical (chap 11 in Durbin book) But basically, –The backward-forward algorithm computes the likelihood of the data P(X | θ) –When we re-estimate θ, we maximize P(X | θ). –Effect: in each iteration, the likelihood of the sequence will be improved –Therefore, guaranteed to converge (not necessarily to a global optimal) –Viterbi training is also guaranteed to converge: every iteration we improve the probability of the most probable path

Expectation-maximization (EM) Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data E-step: compute the expectation (e.g. prob for each pos to be in a certain state) M-step: maximum-likelihood parameter estimation We’ll see EM and similar techniques again in motif finding k-means clustering is a special case of EM

Does it actually work? Depend on: Nature of the problem Quality of model (architecture) Size of training data Selection of initial parameters

Initial parameters May come from prior knowledge If no prior knowledge, use multiple sets of random parameters –Each one ends up in a local maxima –Hopefully one will lead to the correct answer

HMM summary Viterbi – best single path Forward – sum over all paths Backward – similar Baum-Welch – training via EM and forward-backward Viterbi – another “EM”, but Viterbi-based

HMM structure We have assumed a fully connected structure –In practice, many transitions are impossible or have very low probabilities “Let the model find out for itself” which transition to use –Almost never work in practice –Poor model even with plenty of training data –Too many local optima when not constrained Most successful HMMs are based on knowledge –Model topology should have an interpretation –Some standard topology to choose in typical situations Define the model as well as you can –Model surgery based on data

Duration modeling For any sub-path, the probability consists of two components –The product of emission probabilities Depend on symbols and state path –The product of transition probabilities Depend on state path

Duration modeling Model a stretch of DNA for which the distribution does not change for a certain length The simplest model implies that P(length = L) = (1-p)p L-1 i.e., length follows geometric distribution –Not always appropriate s P 1-p Duration: the number of steps that a state is used consecutively without visiting other states L p

Duration models s P sss s P sss 1-p Negative binominal Min, then geometric 1-p PPP

Explicit duration modeling Intron P(A | I) = 0.3 P(C | I) = 0.2 P(G | I) = 0.2 P(T | I) = 0.3 L P ExonIntergenic Empirical intron length distribution

Explicit duration modeling Can use any arbitrary length distribution Generalized HMM. Often used in gene finders Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs P k (i) = max l max d=1..D P l (i-d) a lk e k (i-d+1, …, i) P (length = d in state k) Disadvantage: Increase in complexity: Time: O(D 2 ) Space: O(D) Where D = maximum duration of state

Silent states Silent states are states that do not emit symbols (e.g., the state 0 in our previous examples) Silent states can be introduced in HMMs to reduce the number of transitions

Silent states Suppose we want to model a sequence in which arbitrary deletions are allowed (in next lecture) In that case we need some completely forward connected HMM (O(m 2 ) edges)

Silent states If we use silent states, we use only O(m) edges Nothing comes free Suppose we want to assign high probability to 1→5 and 2→4, there is no way to have also low probability on 1→4 and 2→5. Algorithms can be modified easily to deal with silent states, so long as no silent-state loops

HMM applications Pair-wise sequence alignment Multiple sequence alignment Gene finding Speech recognition: a good tutorial on course website Machine translation Many others

Connection between HMM and sequence alignments

FSA for global alignment X i and Y j aligned X i aligned to a gap Y j aligned to a gap    d d e e

HMM for global alignment X i and Y j aligned X i aligned to a gap Y j aligned to a gap 1-2  1-      Pair-wise HMM: emit two sequences simultaneously Algorithm is similar to regular HMM, but need an additional dimension. e.g. in Viterbi, we need V k (i, j) instead of V k (i) P(x i,y j ) 16 emission probabilities q(x i ): 4 emission probabilities q(y j ): 4 emission probabilities

HMM and FSA for alignment FSA: regular grammar HMM: stochastic regular grammar After proper transformation between the probabilities and substitution scores, the two are identical   (a, b)  log [p(a, b) / (q(a) q(b))]  d  log   e  log  Details in Durbin book chap 4 Finding an optimal FSA alignment is equivalent to finding the most probable path with Viterbi

HMM for pair-wise alignment Theoretical advantages: –Full probabilistic interpretation of alignment scores –Probability of all alignments instead of the best alignment (forward-backward alignment) –Sampling sub-optimal alignments –Posterior probability that A i is aligned to B j Not commonly used in practice –Needleman-Wunsch and Smith-Waterman algorithms work pretty well, and more intuitive to biologists –Other reasons?

Next lecture HMM for multiple alignment –Very useful HMM for gene finding –Very useful –But very technical –Include many knowledge-based fine-tunes and extensions –We’ll only discuss basic ideas