Hidden Markov Models Fundamentals and applications to bioinformatics.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Presented By Cindy Xiaotong Lin
Introduction to Hidden Markov Models
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models in Bioinformatics Applications
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Chapter 4: Stochastic Processes Poisson Processes and Markov Chains
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Introduction to Profile Hidden Markov Models
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Hidden Markov Models for Sequence Analysis 4
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
(H)MMs in gene prediction and similarity searches.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models BMI/CS 576
Fundamentals and applications to bioinformatics.
Hidden Markov Models Part 2: Algorithms
Hidden Markov Model LR Rabiner
CONTEXT DEPENDENT CLASSIFICATION
Hidden Markov Models By Manish Shrivastava.
Presentation transcript:

Hidden Markov Models Fundamentals and applications to bioinformatics.

Markov Chains Given a finite discrete set S of possible states, a Markov chain process occupies one of these states at each unit of time. The process either stays in the same state or moves to some other state in S. This occurs in a stochastic way, rather than in a deterministic one. The process is memoryless and time homogeneous.

Transition Matrix Let S={S 1, S 2, S 3 }. A Markov Chain is described by a table of transition probabilities such as the following: S1S1 S2S2 S3S3 S1S1 010 S2S2 1/32/30 S3S3 1/21/31/6

A simple example Consider a 3-state Markov model of the weather. We assume that once a day the weather is observed as being one of the following: rainy or snowy, cloudy, sunny. We postulate that on day t, weather is characterized by a single one of the three states above, and give ourselves a transition probability matrix A given by:

- 2 - Given that the weather on day 1 is sunny, what is the probability that the weather for the next 7 days will be “sun-sun-rain-rain- sun-cloudy-sun”?

- 3 - Given that the model is in a known state, what is the probability it stays in that state for exactly d days? The answer is Thus the expected number of consecutive days in the same state is So the expected number of consecutive sunny days, according to the model is 5.

Hidden? What if each state does not correspond to an observable (physical) event? What if the observation is a probabilistic function of the state? To clarify, let us analyze another simple example, before formally defining Hidden Markov Models, or simply HMMs. The Urn and Ball Model

Elements of an HMM An HMM is characterized by the following: 1)N, the number of states in the model. 2)M, the number of distinct observation symbols per state. 3)the state transition probability distribution where 4)the observation symbol probability distribution in state q j,, where b j (k) is the probability that the k-th observation symbol pops up at time t, given that the model is in state E j. 5)the initial state distribution

Three Basic Problems for HMMs 1)Given the observation sequence O = O 1 O 2 O 3 …O t, and a model m = (A, B, p), how do we efficiently compute P(O | m)? 2)Given the observation sequence O and a model m, how do we choose a corresponding state sequence Q = q 1 q 2 q 3 …q t which is optimal in some meaningful sense? 3)How do we adjust the model parameters to maximize P(O | m)?

Solution to Problem (1) Given an observed output sequence O, we have that P[O] = This calculation involves the sum of N T multiplications, each being a multiplication of 2T terms. The total number of operations is on the order of 2T N T. Fortunately, there is a much more efficient algorithm, called the forward algorithm.

The Forward Algorithm It focuses on the calculation of the quantity which is the joint probability that the sequence of observations seen up to and including time t is O 1,…,O t, and that the state of the HMM at time t is E i. Once these quantities are known,

…continuation The calculation of the  (t, i)’s is by induction on t. From the formula we get

Backward Algorithm Another approach is the backward algorithm. Specifically, we calculate  (t, i) by the formula Again, by induction one can find the  (t,i)’s starting with the value t = T – 1, then for the value t = T – 2, and so on, eventually working back to t = 1.

Solution to Problem (2) Given an observed sequence O = O 1,…,O T of outputs, we want to compute efficiently a state sequence Q = q 1,…,q T that has the highest conditional probability given O. In other words, we want to find a Q that makes P[Q | O] maximal. There may be many Q’s that make P[Q | O] maximal. We give an algorithm to find one of them.

The Viterbi Algorithm It is divided in two steps. First it finds max Q P[Q | O], and then it backtracks to find a Q that realizes this maximum. First define, for arbitrary t and i,  (t,i) to be the maximum probability of all ways to end in state S i at time t and have observed sequence O 1 O 2 …O t. Then max Q P[Q and O] = max i  (T,i)

- 2 - But Since the denominator on the RHS does not depend on Q, we have We calculate the  (t,i)’s inductively.

- 3 - Finally, we recover the q i ’s as follows. Define and put q T = S  (T). This is the last state in the state sequence desired. The remaining q t for t < T are found recursively by defining and putting

Solution to Problem (3) We are given a set of observed data from an HMM for which the topology is known. We wish to estimate the parameters in that HMM. We briefly describe the intuition behind the Baum-Welch method of parameter estimation. Assume that the alphabet M and the number of states N is fixed at the outset. The data we use to estimate the parameters constitute a set of observed sequences {O (d) }.

The Baum-Welch Algorithm We start by setting the parameters p i, a ij, b i (k) at some initial values. We then calculate, using these initial parameter values: 1)p i * = the expected proportion of times in state S i at the first time point, given {O (d) }.

) 3) where N ij is the random number of times q t (d) =S i and q t+1 (d) = S j for some d and t; N i is the random number of times q t (d) = S i for some d and t; and N i (k) equals the random number of times q t (d) = S i and it emits symbol k, for some d and t.

Upshot It can be shown that if = (p i, a jk, b i (k)) is substituted by * = (p i *, a jk *, b i * (k)) then P[{O (d) }| * ] P[{O (d) }| ], with equality holding if and only if * =. Thus successive iterations continually increase the probability of the data, given the model. Iterations continue until a local maximum of the probability is reached.

Applications

Some preliminary remarks Sequence alignment is useful for discovering functional, structural, and evolutionary information in biological research. Different metrics (or notions of distance) could be defined to compare sequences. Mathematician Peter Sellers (1974) showed that if a sequence alignment is formulated in terms of distances instead of similarity, a biologically more appealing interpretation of gaps is possible. The latter is an evolution-motivated definition, relying on the concept of ancestry.

Modeling Protein Families The states of our HMM will be divided into match states, insert states and delete states. It is useful to include an initial state and a final one, and we assume that no match or delete state is visited more than once. The alphabet M consists of twenty amino acids together with one dummy symbol  representing “delete”. Delete states output  only. Each insert and match state has its own distribution over the 20 amino acids, and does not emit the symbol .

- 2 - If the emission probabilities for the match and insert states are uniform over the 20 amino acids, the model will produce random sequences not having much in common. If each state emits one specific amino acid with probability 1, then the model will always produce the same sequence. Somewhere in between these two extremes, the parameters can be set so that the model is interesting.

- 3 - Each choice of parameters produces a different family of sequences. This family can be rather “tight”, or it can be rather “loose”. It is possible that the tightness occurs locally. Allowing gap penalties and substitution probabilities to vary along the sequences reflects biological reality better.

- 4 - Dynamic programming and BLAST are essential for certain applications, but HMMs are more efficient for modeling large families of sequences. The HMM model is sufficiently flexible to model the varying features of a protein along its length. The model described has proven in practice to provide a good compromise between flexibility and tractability. Such HMMs are called profile HMMs.

- 5 - All applications start with training. This estimation procedure uses the Baum- Welch algorithm. The model is chosen to have length equal to the average length of a sequence in the training set, and all parameters are initialized by using uniform distributions.

Multiple Sequence Alignment The msa of a set of sequences may be viewed as an evolutionary history of the sequences. HMMs often provide a msa as good as, if not better than, other methods. The approach is well grounded in probability theory No sequence ordering is required. Insertion/deletion penalties are not needed. Experimentally derived information may be incorporated.

Description In this section we describe how to use the theory of the previous section to compute msa for a set of sequences. The sequences to be aligned are used as the training data, to train the parameters of the model. For each sequence, the Viterbi algorithm is then used to determine a path most likely to have produced that sequence.

- 2 - Consider the sequences CAEFDDH and CDAEFPDDH Suppose the model has length 10 and their most likely paths through the model are m 0 m 1 m 2 m 3 m 4 d 5 d 6 m 7 m 8 m 9 m 10 and m 0 m 1 i 1 m 2 m 3 m 4 d 5 m 6 m 7 m 8 m 9 m 10. The alignment induced is found by aligning positions that were generated by the same match state. This leads to the alignment C–AEF –DDH CDAEFPDDH

Pfam Pfam is a web-based resource maintained by the Sanger Center Pfam uses the basic theory described above to determine protein domains in a query sequence. Suppose that a new protein is obtained for which no information is available except the raw sequence. We wish to “annotate” this sequence.

- 2 - The typical starting point is a BLAST search. Pfam returns a family of protein domains, which enriches the information obtained by a BLAST search. The domains in Pfam are determined based on expert knowledge, sequence similarity, and other protein family databases. Currently, Pfam contains more than 2000 domains.

- 3 - For each domain a set of examples of this domain is selected. The sequences representing each domain are put into an alignment, and the alignments themselves are used to set the parameters. Recall that an alignment implies for each sequence in the alignment a path through the HMM, as in the previous sections.

- 4 - The proportion of times these paths take a given transition is used to estimate the transition and the emission probabilities. Given the HMMs for all the domains, a query sequence is then run past each one using a forward algorithm. When a portion of the query sequence has probability of having been produced by an HMM of a certain cutoff, the corresponding domain is reported.

Gene Finding Currently, a popular and successful gene finder for human DNA sequences is GENSCAN (Burge et al ) It is based on a generalization of HMMs, called Semihidden Markov Models. The algorithms involved in this model are an order of magnitude more complex than for a regular HMM. The gene-finding application requires a generalization of the Viterbi algorithm.

- 2 - Burge (1997) observed that if the lengths of the long intergenic regions can be taken as having geometric distributions, and if these lengths generate sequences in a relatively iid fashion, then the algorithm can be adjusted so that practical running times can be obtained.

Final Remarks HMMs have been used to model alignments of three-dimensional structure in proteins (Stultz et al. 1993; Hubbard and Park 1995; Di Francesco et al. 1997, 1999; FORREST Web server at In one example of this approach, the models are trained on patterns of  helices,  strands, tight turns, and loops in specific structural classes, which then may be used to provide the most probable structure and structural class of a protein.

Well… those weren’t the final remarks A version of GeneMark (Borodosky and McIninch 1993) called GeneMark.HMM uses a particular type of HMM (called a fifth-order Markov Model) to search for E. coli genes (Lukashin and Borodovsky 1998). The success of the HMM method depends on having appropriate initial or prior conditions, i.e., a good prior model for the sequences and a sufficient number of sequences to train the model.

Finally Another consideration in using HMMs is the number of sequences. If a good prior model is used, it should be possible to train the HMM with as few as 20 sequences. In general, the smaller the sequence number, the more important the prior conditions. HMMs are more effective if methods to inject statistical noise into the model are used during the training procedure.