Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models.

Similar presentations


Presentation on theme: "CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models."— Presentation transcript:

1 CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models

2 Change of lecture orders Markov models & hidden Markov models Applications of HMMs in bioinformatics Suffix Trees Motif finding –Why? –The statistical learning algorithms are the same as for HMMs –Motif finding can be done using either combinatorial search and statistical learning methods

3 Probabilistic Calculus If A, B are mutually exclusive: –P(A U B) = P(A) + P(B) Thus: P(not(A)) = P(A c ) = 1 – P(A) A B

4 Probabilistic Calculus P(A U B) = P(A) + P(B) – P(A ∩ B)

5 Conditional probability The joint probability of two events A and B P(A∩B), or simply P(A, B), is the probability that event A and B occur at the same time. The conditional probability of P(B|A) is the probability that B occurs given A occurred. P(A | B) = P(A ∩ B) / P(B)

6 Independence P(A | B) = P(A ∩ B) / P(B) => P(A ∩ B) = P(B) * P(A | B) A, B are independent iff –P(A ∩ B) = P(A) * P(B) –That is, P(A) = P(A | B) Also implies that P(B) = P(B | A) –P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)

7 Theorem of total probability Let B 1, B 2, …, B N be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S. An event A can be represented as: Since B 1, B 2, …, B N are mutually exclusive, then P(A) = P(A∩B 1 ) + P(A∩B 2 ) + … + P(A∩B N ) And therefore P(A) = P(A|B 1 )*P(B 1 ) + P(A|B 2 )*P(B 2 ) + … + P(A|B N )*P(B N ) =  i P(A | B i ) * P(B i )

8 Bayes theorem P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A) AP BP ABP )( )( )|( = => Posterior probability of A Normalizing constant BAP)|( Prior of B Likelihood This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

9 Bayes theorem (cont’d) Given B 1, B 2, …, B N, a partition of the sample space S. Suppose that event A occurs; what is the probability of event B j ? P(B j | A) = P(A | B j ) * P(B j ) / P(A) = P(A | B j ) * P(B j ) /  j P(A | B j )*P(B j ) B j : different models In the observation of A, should you choose a model that maximizes P(B j | A) or P(A | B j )? Depending on how much you know about B j !

10 Model selection Maximum likelihood principle (ML) P(observation | model1) P(observation | model2) Maximum a posterior probability (MAP) P(model1 | observation) P(model2 | observation) P(observation | model1) * P(model1) P(observation | model2) * P(model2) ||

11 Question We’ve seen that given a sequence of observations, and two models, we can test which model is more likely to generate the data –Is the die loaded or fair? –Either ML or MAP Given a set of observations, and a model, can you estimate the parameters? –Given the results of rolling a die, how to infer the probability of each face?

12 Question You are told that there are two dice, one is loaded with 50% to be six, one is fair. Give you a series of numbers resulted from rolling the two dice Can you tell which number is generated by which die?

13 Question You are told that there are two dice, one is loaded, one is fair. But you don’t know how “loaded” it is and the frequency of die switching Give you a series of numbers resulted from rolling the two dice Can you tell how is the die loaded and which number is generated by which die?

14 In the next few lectures Hidden Markov Models –The theory –Probabilistic treatment of sequence alignments using HMMs –Application of HMMs to biological sequence modeling and discovery of features such as genes

15 Back to biology Two dice with four faces: {A,C,G,T} One has the distribution of pA=pC=pG=pT=0.25. M1 The other has the distribution: pA=0.20, pC=0.28, pG=0.30, pT=0.22. M2

16 Back to biology Assume nature generates a DNA sequence as follows: 1.Randomly select one die. 2.Roll it, append the symbol to the string. 3.Repeat 2, until all symbols have been generated. Given a string say X=“GATTCCAA…”

17 What is the probability of the sequence being generated by M1? M2? P(X|M1) P(X|M2) We can calculate the log likelihood ratio log(P(X|M1)/P(X|M2))

18 Model selection by maximum likelihood criterion P(X | M1) = P(x 1,x 2,…,x n | M1) = P(x 1 |M1) P(x 2 |M1) … P(x n |M1) (each position is generated independently) P(X | M1) = P(x 1,x 2,…,x n | M2) = P(x 1 |M2) P(x 2 |M2) … P(x n |M2) P(X|M1) / P(X|M2) =  P(x i |M1)/P(x i |M2) log likelihood ratio =  log(P(x i |M1)/P(x i |M2)) = n A S A + n C S C + n G S G + n T S T S i = log (P(i | M1) / P(i | M2)), i = A, C, G, T

19 Model selection by maximum a posterior probability criterion Take the prior probabilities of M1 and M2 into consideration if known P(M1|X) / P(M2|X) = P(X|M1) / P(X|M2) * P(M1) / P(M2) Log ratio = n A S A + n C S C + n G S G + n T S T + log(P(M1)) – log(P(M2)) If P(M1) ~ P(M2), results similar to log likelihood ratio test

20 We have assumed independence of nucleotides in different positions - definitely unrealistic –What we can see next may depend on what we have seen in the past

21 Markov models A sequence of random variables is a k-th order Markov chain if, for all i, i th value is independent of all but the previous k values: First order: Second order: 0 th order: (independence)

22 First order Markov model

23 Example: CpG islands CpG - 2 adjacent nts, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone) C of CpG is often (70-80%) methylated in mammals i.e., CH3 group added –Many reasons for methylation –Change state of DNA –“Protected” Major exception: promoters of housekeeping genes

24 CpG islands Methyl-C mutates to T relatively easily Net effect: C followed by G is less common than expected –f(CpG) < f(C) f(G) In promoter regions, CpG remain unmethylated, so CpG→TpG less likely there: makes “CpG Islands”; often mark gene-rich regions

25 CpG islands CpG Islands –More CpG than elsewhere –More C & G than elsewhere, too –Typical length: few 100 to few 1000 bp Questions –Is a short sequence (say, 200 bp) a CpG island or not? –Given long sequence (say, 10-100kb), find CpG islands?

26 A 1 st order Markov model for CpG islands Essentially a finite state automaton Transitions are probabilistic 4 states: A, C, G, T 16 transitions: a st = P(x i = t | x i-1 = s)

27 A 1 st order Markov model for CpG islands Essentially a finite state automaton Transitions are probabilistic 4 state: A, C, G, T 16 transition: a st = P(x i = t | x i-1 = s) Begin/End states (for convenience)

28 Probability of emitting sequence x

29 Training Estimate the parameters of the model –Count the transition frequencies from known CpG islands – CpG model –Also count the transition frequencies from sequences without CpG islands – background model

30 Discrimination / Classification Given a sequence, is it CpG island or not? Log likelihood ratio

31 CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

32 Questions Q1: given a short sequence, is it more likely from feature model or background model? Q2: Given a long sequence, where are the features in it (if any)? –Approach 1: score 100 bp (e.g.) windows Pro: simple Con: arbitrary, fixed length, inflexible –Approach 2: combine +/- models.

33 Combined model Given a long sequence, predict which state each position is in. (: Hidden markov model)

34 Hidden Markov Models (HMMs) Introduced in the 70’s for speech recognition Have been shown to be good models for biosequences –Alignment –Gene prediction –Protein domain analysis –…–… States: 1, 2, 3, … Transitions: Emissions: Path: sequence of states Observed data: emission sequences Hidden data: state sequences

35 A Hidden Markov Model is memory-less At each time step t, the only thing that affects future states is the current state  t P(  t+1 = k | “whatever happened so far”) = P(  t+1 = k |  1,  2, …,  t, x 1, x 2, …, x t )= P(  t+1 = k |  t ) K 1 … 2

36 Hidden Markov Models 8 states (two states for each symbol) A+, C+, G+, T+, A-, C-, G-, T- Emission probability: for A+, P(A) = 1.0 and the rest is 0. In general, however, each state may be able to emit several symbols with different probabilities Transition probability: P(+|-), P(-|+). Expect CpG islands smaller than the “sea”, expected to have P(+|-) < P(-|+) Given a DNA sequence, we cannot directly tell which state the system is in by looking at the symbol

37 Hidden Markov Models Given a DNA sequence, we cannot directly tell which state the system is in by looking at the symbol, however –If we were given a path π of the states, we can compute the probability of the path –ATCGAG –--++++ –P(A-)(T-|A-) P(C+|T-) P(G+|C+)P(A+|G+)P(G+|A+) How to search all paths to find the one with the highest probability? –Dynamic Programming!

38 Back to the Dishonest Casino A casino has two dice: Fair die P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(4) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between Load and loaded die once in a while

39 Simple scenario You don’t know the probabilities The casino player lets you observe which die he uses every time –The “state” of each roll is known Parameter estimation problem –How often the casino player switches dice? –How “loaded” is the loaded die? –Simply count the number of times each face appeared and the frequency of die switching –May add pseudo-counts These parameters may be used in the future to test whether a die is loaded or not

40 The dishonest casino model FairLOADED 0.05 0.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2

41 More complex scenarios The “state” of each roll is unknown: –You are given the results of a series of rolls –You don’t know which number is generated by which die –You may or may not know the parameters How “loaded” is the loaded die How frequently the casino player switches dice

42 Question # 1 – Decoding GIVEN Model parameters (probabilities). And a sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

43 Question # 2 – Evaluation GIVEN Model parameters and a sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? (different from “how likely is this path”. This sums over all paths). This is the EVALUATION problem in HMMs

44 Question # 3 – Learning GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How “loaded” is the loaded die? How “Load” is the Load die? How often does the casino player change from Load to loaded, and back? This is the LEARNING question in HMMs

45 A parse of a sequence Given a sequence x = x 1 ……x N, A parse of x is a sequence of states  =  1, ……,  N 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

46 Likelihood of a parse Given a sequence x = x 1 ……x N and a parse  =  1, ……,  N To find how likely is the parse: (given our HMM) P(x,  ) = P(x 1, …, x N,  1, ……,  N ) = P(x N,  N |  N-1 ) P(x N-1,  N-1 |  N-2 )……P(x 2,  2 |  1 ) P(x 1,  1 ) = P(x N |  N ) P(  N |  N-1 ) ……P(x 2 |  2 ) P(  2 |  1 ) P(x 1 |  1 ) P(  1 ) = a 0  1 a  1  2 ……a  N-1  N e  1 (x 1 )……e  N (x N ) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

47 P(x,  ) = P(x 1, …, x N,  1, ……,  N ) = P(x N,  N | x 1, …, x N-1,  1, ……,  N-1 ) P(x 1, …, x N-1,  1, ……,  N-1 ) = P(x N,  N |  N-1 ) P(x 1, …, x N-1,  1, ……,  N-1 ) = P(x 1,  1 )  P(x i,  i |  i-1 ) P(x i,  i ) = P(x i |  i ) P(  i ) P(x i,  i |  i-1 ) = P(x i |  i,  i-1 ) P(  i |  i-1 ) = P(x i |  i ) P(  i |  i-1 ) = e  1 (x 1 ) a  1  2 P(x 1,  1 ) = P(x 1 |  1 ) P(  1 ) = e  1 (x 1 ) a 0  1

48 Probability of a parse What’s the likelihood of  = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair and rolls x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 P = ½ x P(1 | F) P(F i+1 | F i ) P(2 | F) P(F i+1 | F i ) …P(4 | F) = ½ x (0.95) 9 x (1/6) 10 = 5 x 10 -9

49 Probability of a parse What’s the likelihood of  = Load, Load, Load, Load, Load, Load, Load, Load, Load, Load And rolls x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 P = ½ x P(1 | L) P(L i+1 | L i ) P(2 | L) P(L i+1 | L i ) …P(4 | L) = ½ x (0.95) 9 x (1/10) 8 x (1/2) 2 = 8 x 10 -10

50 Probability of a parse What’s the likelihood of  = Load, Load, Load, Load, Fair, Fair, Fair, Fair, Load, Load And rolls x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4? P = ½ x P(1 | F) P(F i+1 | F i ) …P(5 | F) P(L i+1 | F i ) P(6|L) P(L i+1 | L i ) …P(4 | F) = ½ x 0.95 7 0.05 2 x (1/6) 6 x (1/10) 2 x (1/2) 2 = 5 x 10 -11 0.05

51 The three main questions on HMMs 1.Decoding GIVENa HMM M, and a sequence x, FINDthe sequence  of states that maximizes P[ x,  | M ] 2.Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] 3.Learning GIVENa HMM M, with unspecified transition/emission probs., and a sequence x, FINDparameters  = (e i (.), a ij ) that maximize P[ x |  ]

52 Decoding Parse (path) is unknown. What to do? Alternatives: Most probable single path –Viterbi algorithm Sequence of most probable states –Forward-backward algorithm

53 The Viterbi algorithm Finds

54 Log(a LL ) Log(a FF ) Log(a LF ) Log(a FL ) Log(e L (6)) Log(e F (6)) Equivalent problem: find the heaviest path (weights on both edges and nodes) Dynamic programming!

55 v l (i): weight of heaviest path emitting x 1, x 2,…, x i and ending at state l –Equivalent: log probability of the most probable path –To get to a node on the (i+1) th column, must pass a node on the i th column –As long as we know the best scores to the i th column, how we get there has no impact on the scores in the future

56 The Viterbi Algorithm Input: x = x 1 ……x N Initialization: V 0 (0) = 1 (zero in subscript is the start state.) V k (0) = 0, for all k > 0(0 in parenthesis is the imaginary first position) Iteration: V j (i) = e j (x i )  max k a kj V k (i-1) Ptr j (i) = argmax k a kj V k (i-1) Termination: P(x,  *) = max k V k (N) Traceback:  N * = argmax k V k (N)  i-1 * = Ptr  i (i) Actual calculation done in log space. Numerically more stable. Better pre- compute all log(transition prob) and log(emission prob)

57 The Viterbi Algorithm Similar to “aligning” a set of states to a sequence Time: O(K 2 N) Space: O(KN) x 1 x 2 x 3 ………………………………………..x N State 1 2 K V j (i)

58

59 CpG islands Data: 41 human sequences, totaling 60kbp, including 48 CpG islands of about 1kbp each Viterbi: –Found 46 of 48 –plus 121 “false positives” Post-processing: –merge within 500bp –discard < 500 –Found 46/48 –67 false positive

60 Next lecture Evaluation Learning!


Download ppt "CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models."

Similar presentations


Ads by Google