. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
. Exact Inference in Bayesian Networks Lecture 9.
Hidden Markov Model in Biological Sequence Analysis – Part 2
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Statistical NLP: Lecture 11
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
… Hidden Markov Models Markov assumption: Transition model:
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Computational Genomics Lecture 8a Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
. Inference in HMM Tutorial #6 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3.1 Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
. Markov Chains Tutorial #5 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev.
Belief propagation with junction trees Presented by Mark Silberstein and Yaniv Hamo.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Hidden Markov Model Lecture #6
Hidden Markov Models ..
Presentation transcript:

. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).

2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution q(  ) over the alpha bet {A,C,T,G}. This is not the case in true genomes. 1.genomic sequences come in triplets– the codons– which encode amino acids via the genetic code. 2.there are special subsequences in the genome, like TATA within the regulatory area, upstream a gene. 3.The pairs C followed by G is less common than expected for random sampling. We will focus on analyzing the third example using a model called Hidden Markov Model.

3 Example: CpG Island In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes. These areas are called CpG islands (p denotes “pair”).

4 Example: CpG Island (Cont.) We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? We use Markov Chains. Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ? We use Hidden Markov Models.

5 (Stationary) Markov Chains X1X1 X2X2 X L-1 XLXL Every variable x i has a domain. For example, suppose the domain are the letters {a, c, t, g}. Every variable is associated with a local (transition) probability table p(X i = x i | X i-1 = x i-1 ) and p(X 1 = x 1 ). The joint distribution is given by In short: Stationary means that the transition probability tables do not depend on i.

6 Question 1: Using two Markov chains X1X1 X2X2 X L-1 XLXL For CpG islands: We need to specify p I (x i | x i-1 ) where I stands for CpG Island. X i-1 XiXi ACTG A C 0.4p(C | C)p(T| C)high T 0.1p(C | T) p(T | T)p(G | T) G 0.3p(C | G) p(T | G)p(G | G) =1 Lines must add up to one; columns need not.

7 Question 1: Using two Markov chains X1X1 X2X2 X L-1 XLXL For non-CpG islands: We need to specify p N (x i | x i-1 ) where N stands for Non CpG island. X i-1 XiXi ACTG A C 0.4p(C | C) p(T | C)low T 0.1p(C | T) p(T | T)high G 0.3p(C | G) p(T | G)p(G | G) Some entries may or may not change compared to p I (x i | x i-1 ).

8 Question 1: Log Odds-Ratio test Comparing the two options via odds-ratio test yields If logQ > 0, then CpG island is more likely. If logQ < 0, then non-CpG island is more likely.

9 Maximum Likelihood Estimate (MLE) of the parameters (with a teacher, labeled) The needed parameters are: p I (x 1 ), p I (x i | x i-1 ), p N (x 1 ), p N (x i | x i-1 ) The ML estimates are given by: Using MLE is justified when we have a large sample. The numbers appearing in the text book are based on 60,000 sequences. When only small samples are available, Bayesian learning is an attractive alternative, which we will cover soon. X1X1 X2X2 X L-1 XLXL Where N a,I is the number of times letter a appear in CpG islands in the dataset. Where N ba,I is the number of times letter b appears after letter a in CpG islands in the dataset.

10 Hidden Markov Models (HMMs) X1X1 X2X3Xi-1XiXi+1R1R1 R2R2 R3R3 R i-1 RiRi R i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1 SiSi S i+1 This HMM depicts the factorization: Application in communication: message sent is (s 1,…,s m ) but we receive (r 1,…,r m ). Compute what is the most likely message sent ? Applications in Computational Biology: discussed in this and next few classes (CpG islands, Gene finding, Genetic linkage analysis). k  k transition matrix

11 HMM for finding CpG islands Question 2: The input is a long sequence parts of which come from CpG islands and some don’t. We wish to find the most likely assignment of the two labels {I,N} to each letter in the sequence. Domain(H i )={I, N}  {A,C,T,G} (8 states/values) We define a variable H i that encodes the letter at location i and the (hidden) label at that location. Namely, H1H1 H2H2 H L-1 HLHL These hidden variables H i are assumed to form a Markov Chain: The transition matrix is of size 8  8.

12 HMM for finding CpG islands (Cont) The HMM: H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL Domain(H i )={I, N}  {A,C,T,G} (8 values) In this representation p(x i | h i ) = 0 or 1 depending on whether x i is consistent with h i. E.g. x i = G is consistent with h i =(I,G) and with h i =(N,G) but not with any other state of h i. The size of the local probability table p(x i | h i ) is 8  4. Domain(X i )= {A,C,T,G} (4 values)

13 Queries of interest (MAP) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL The Maximum A Posteriori query : An efficient solution, assuming local probability tables (“the parameters”) are known, is called the Viterbi Algorithm. Same problem if replaced by maximizing the joint distribution p(h 1,…,h L,x 1,..,x L ) An answer to this query gives the most probable N/I labeling for all locations.

14 Queries of interest (Belief Update) Posterior Decoding 1. Compute the posteriori belief in H i (specific i) given the evidence {x 1,…,x L } for each of H i ’s values h i, namely, compute p(h i | x 1,…,x L ). 2. Do the same computation for every H i but without repeating the first task L times. H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Local probability tables are assumed to be known. An answer to this query gives the probability of having label I or N at an arbitrary location.

15 Learning the parameters (EM algorithm) A common algorithm to learn the parameters from unlabeled sequences is called the Expectation-Maximization (EM) algorithm. We will devote several classes to it. In the current context, we just say that it is an iterative algorithm repeating E-step and M-step until convergence. The E-step uses the algorithms we develop in this class.

16 Decomposing the computation of Belief update (Posterior decoding) P(x 1,…,x L,h i ) = P(x 1,…,x i,h i ) P(x i+1,…,x L | x 1,…,x i,h i ) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Equality due to Ind({x i+1,…,x L }, {x 1,…,x i } | H i } = P(x 1,…,x i,h i ) P(x i+1,…,x L | h i )  f(h i ) b(h i ) Belief update: P(h i | x 1,…,x L ) = (1/K) P(x 1,…,x L,h i ) where K=  hi P(x 1,…,x L,h i ).

17 The forward algorithm P(x 1,x 2,h 2 ) =  P(x 1,h 1,h 2,x 2 ) {Second step} =  P(x 1,h 1 ) P(h 2 | x 1,h 1 ) P(x 2 | x 1,h 1,h 2 ) h1h1 h1h1 Last equality due to conditional independence =  P(x 1,h 1 ) P(h 2 | h 1 ) P(x 2 | h 2 ) h1h1 H1H1 H2H2 X1X1 X2X2 HiHi XiXi The task: Compute f(h i ) = P(x 1,…,x i,h i ) for i=1,…,L (namely, considering evidence up to time slot i). P(x 1, h 1 ) = P(h 1 ) P(x 1 |h 1 ) {Basis step} P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 {step i}

18 The backward algorithm The task: Compute b(h i ) = P(x i+1,…,x L |h i ) for i=L-1,…,1 (namely, considering evidence after time slot i). H L-1 HLHL X L-1 XLXL HiHi H i+1 X i+1 P(x L | h L-1 ) =  P(x L,h L |h L-1 ) =  P(h L |h L-1 ) P(x L |h L-1,h L )= hLhL hLhL Last equality due to conditional independence =  P(h L |h L-1 ) P(x L |h L ) {first step} hLhL P(x i+1,…,x L |h i ) =  P(h i+1 | h i ) P(x i+1 | h i+1 ) P(x i+2,…,x L | h i+1 ) h i+1 {step i} =b(h i )= =b(h i+1 )=

19 The combined answer 1. To Compute the posteriori belief in H i (specific i) given the evidence {x 1,…,x L } run the forward algorithm and compute f(h i ) = P(x 1,…,x i,h i ), run the backward algorithm to compute b(h i ) = P(x i+1,…,x L |h i ), the product f(h i )b(h i ) is the answer (for every possible value h i ). 2. To Compute the posteriori belief for every H i simply run the forward and backward algorithms once, storing f(h i ) and b(h i ) for every i (and value h i ). Compute f(h i )b(h i ) for every i. H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

20 Consequence I: The E-step H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Recall that belief update has been computed via P(x 1,…,x L,h i ) = P(x 1,…,x i,h i ) P(x i+1,…,x L | h i )  f(h i ) b(h i ) Now we wish to compute (for the E-step) p(x 1,…,x L,h i,h i+1 )= = f(h i ) p(h i+1 |h i ) p(x i+1 | h i+1 ) b(h i+1 ) p(x 1,…,x i,h i ) p(h i+1 |h i )p(x i+1 | h i+1 )p(x i+2,…,x L |h i+1 )

21 Consequence II: Likelihood of evidence 1.To compute the likelihood of evidence P(x 1,…,x L ), do one more step in the forward algorithm, namely,  f(h L ) =  P(x 1,…,x L,h L ) 2. Alternatively, do one more step in the backward algorithm, namely,  b(h 1 ) P(h 1 ) P(x 1 |h 1 ) =  P(x 2,…,x L |h 1 ) P(h 1 ) P(x 1 |h 1 ) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi hLhL h1h1 hLhL h1h1

22 Time and Space Complexity of the forward/backward algorithms Time complexity is linear in the length of the chain, provided the number of states of each variable is a constant. More precisely, time complexity is O(k 2 L) where k is the maximum domain size of each variable. H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Space complexity is also O(k 2 L).

23 The MAP query in HMM H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi 1.Recall that the query asking likelihood of evidence is to compute P(x 1,…,x L ) =  P(x 1,…,x L, h 1,…,h L ) 2.Now we wish to compute a similar quantity: P * (x 1,…,x L ) = MAX P(x 1,…,x L, h 1,…,h L ) (h 1,…,h L ) And, of course, we wish to find a MAP assignment (h 1 *,…,h L * ) that brought about this maximum.

24 Example: Revisiting likelihood of evidence H1H1 H2H2 X1X1 X2X2 H3H3 X3X3 P(x 1,x 2,x 3 ) =  P(h 1 )P(x 1 |h 1 )  P(h 2 |h 1 )P(x 2 |h 2 )  P(h 3 |h 2 )P(x 3 |h 3 ) h3h3 h2h2 h1h1 =  P(h 1 )P(x 1 |h 1 )  b(h 2 ) P(h 1 |h 2 )P(x 2 |h 2 ) h1h1 h2h2 =  b(h 1 ) P(h 1 )P(x 1 |h 1 ) h1h1

25 Example: Computing the MAP assignment H1H1 H2H2 X1X1 X2X2 H3H3 X3X3 maximum = max P(h 1 )P(x 1 |h 1 ) max P(h 2 |h 1 )P(x 2 |h 2 ) max P(h 3 |h 2 )P(x 3 |h 3 ) h3h3 h2h2 h1h1 = max P(h 1 )P(x 1 |h 1 ) max b (h 2 ) P(h 1 |h 2 )P(x 2 |h 2 ) h1h1 h2h2 h3h3 Replace sums with taking maximum: = max b (h 1 ) P(h 1 )P(x 1 |h 1 ) h1h1 h2h2 {Finding the maximum} h 1 * = arg max b (h 1 ) P(h 1 )P(x 1 |h 1 ) h1h1 h2h2 {Finding the map assignment} h 2 * = x* (h 1 * ); h2h2 x* (h 2 ) h3h3 x* (h 1 ) h2h2 h 3 * = x* (h 2 * ) h3h3

26 Viterbi’s algorithm For i=1 to L-1 do h 1 * = ARG MAX P(h 1 ) P(x 1 |h 1 ) b (h 1 ) h2h2 h2h2 h i+1 * = x* (h i *) h i+1 Forward phase (Tracing the MAP assignment) : H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi x* (h i ) = ARGMAX P(h i+1 | h i ) P(x i+1 | h i+1 ) b (h i+1 ) For i=L-1 downto 1 do b (h i ) = MAX P(h i+1 | h i ) P(x i+1 | h i+1 ) b (h i+1 ) h i+1 h i+2 b (h L ) = 1 h L+1 h i+1 h i+2 Backward phase: (Storing the best value as a function of the parent’s values)

27 Summary of HMM H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi 1.Belief update = posterior decoding Forward-Backward algorithm 2.Maximum A Posteriori assignment Viterbi algorithm 3.Learning parameters The EM algorithm Viterbi training