Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov models and its application to bioinformatics.

Similar presentations


Presentation on theme: "Hidden Markov models and its application to bioinformatics."— Presentation transcript:

1 Hidden Markov models and its application to bioinformatics

2 Overview n Hidden Markov models (HMMs) –A problem in bioinformatics solved by HMMs Multiple alignments –Standard algorithms for HMMs Algorithm for finding the most likely state transition, called the Viterbi algorithm Algorithm for learning parameters from data, called the Baum-Welch algorithm

3 Problem in this talk (1/2) n Finding conserved regions of proteins (amino acid sequences) –Different species can have (almost) the same functional proteins that shares common amino acid subsequences. Ancestor mgdv.. mgpv.. Time Amino acid sequence of the ancestor mgpg..

4 Problem in this talk (2/2) n Finding conserved regions of amino acid seqs. –Different species can have common subsequences of amino acid sequences in the same functional protein. Some amino acids were changed in evolutionary process. Example) Amino acid seqs. of cytochrome C, a protein that transfers electrons for oxygen breathing. Find common parts in multiple sequences efficiently. Problem to be solved Common subseq. = conserved because they are functionally important? human mgdvekgkki fimkcsqcht vekggkhktg pnlhglfgrk... mouse mgdvekgkki fvqkcaqcht vekggkhktg pnlhglfgrk... fly mgvpagdvek gkklfvqrca qchtveaggk hkvgpnlhgl...

5 Comparison of sequences n Arrange sequences s.t. # matched characters is maximized. –Align sequences with gaps “ -. ” –This process is called alignment. Simply listed sequences mgdvekgkki fimkcsqcht.. mgdvekgkki fvqkcaqcht.. mgvpagdvek gkklfvqrca.. Sequences with gaps mg----dvek gkkifimkcsqcht.. mg----dvek gkkifvqkcaqcht.. mgvpagdvek gkklfvqrca..

6 Approaches to alignments N: # seqs., L: max. length of seqs., M: # states in an HMM. Dynamic programming Hidden Markov models (HMMs) Best Alignment # sequences in practice Computation Time Can be found Often found in practice N (# seqs) Time DP HMM O(2 N L N )O(NM 2 L) Only a few Dozens seqs. applicable

7 Hidden Markov models n We move from a state to a state probabilistically. n Markov property holds; –A state transition and an outputted symbol are dependent on only the current state. Example of HMM 12 3 0.7 0.1 0.2 0.1 0.4 0.5 0.4 0.1 a: 0.2 b: 0.8 a: 0.6 b: 0.4 Time course Time 1 a 32 b 1 b

8 Alignment with HMM n [Idea] –A state of HMM = a position of alignment –Example 1 g 23 r 1 g 3 ka gar... g-k... Arginine Lysine Similar amino acids

9 Arginine and Lysine ArginineLysine - COO H3NH3N + CH NH 3 CH 2 + COO - H3NH3N + CH C CH 2 HN NH 3 NH 2 + Serine COO - H3NH3N + CH CH 3 COHH - COO H3NH3N + CH H Glycine

10 Alignment with HMM n [Idea] –A state of HMM = a position of alignment. –Example 1 g 23 r 1 g 3 k a gar... g-k... Arginine Lysine Similar amino acids Profile HMM HMM can describe similar amino acids with prob. and states.

11 Profile HMM n Each state represents difference (insertion, deletion) of symbols from a basis sequence. m0m0 i0i0 d1d1 m1m1 i1i1 d2d2 m2m2 i2i2 m3m3 State as match (a symbol is output) State as insertion (a symbol is output) State as deletion (no symbol is output) A basis seq. A-KVG Inserted symbol Deleted symbol ASR-G Another seq.

12 Matched symbols Alignment with HMM n State transition corresponds to alignments. A - K V G A - R V G A S R - G State transitions m1m1 m2m2 m3m3 m0m0 A KVG m1m1 m2m2 m3m3 m0m0 A RVG m1m1 d2d2 m3m3 m0m0 A RG i0i0 S m0m0 i0i0 d1d1 m1m1 i1i1 d2d2 m2m2 i2i2 m3m3

13 Overview n Hidden Markov models (HMM) –A problem in bioinformatics solved by HMMs Multiple alignments –Standard algorithms for HMMs Algorithm for finding the most likely state transition, called the Viterbi algorithm Algorithm for learning parameters from data, called the Baum-Welch algorithm

14 Prediction of the best state transition Algorithm for prediction Multiple sequences AKVG ARVG ASRG m0m0 i0i0 d1d1 m1m1 i1i1 d2d2 m2m2 i2i2 m3m3 Hidden Markov model State transition that maximizes the probability m1m1 m2m2 m3m3 m0m0 A KVG m1m1 m2m2 m3m3 m0m0 A RVG m1m1 d2d2 m3m3 m0m0 A RG i0i0 S

15 Enumeration n Compute probabilities for all possible state trans. Seq. aba 1 0.2 a 1 0.7 b 0.8 1 0.7 a 0.2 1 a 1 0.7 b 0.8 2 0.1 a 0.6 1 0.2 a 1 0.7 b 0.8 3 0.2 a 0.4 0.0031 0.0013 0.0018 HMM 12 3 0.7 0.1 0.2 0.1 0.4 0.5 0.4 0.1 a: 0.2 b: 0.8 a: 0.6 b: 0.4 a: 0.6 b: 0.4 (#states) length Impossible to compute in practice.

16 To find the best state transition N: # seqs., L: max. length of seqs., M: #states in HMM. Enumeration Viterbi algorithm State trans. with max. prob. Length of seqs in practice Comp. time Can be found O(NM L )O(NM 2 L) Short only Longer seqs. applicable Can be found L (length of seqs.) Time Enumeration Viterbi

17 Viterbi algorithm (1/4) n [Idea]: Combine state transitions. –Transition probabilities and output probabilities are independent from past states. Symbol a ba 1 2 3 1 2 3 1 2 3

18 Viterbi algorithm (2/4) n (1). Computing prob. of state transitions. Init. Prob. 0.2 1 0.320.5 3 steps Trans. Prob. 1 2 3 symbols a ba 0.2 0.6 0.4 output Prob. 0.7 0.028 0.5 0.090 0.1 0.002

19 Viterbi algorithm (2/4) n (1). Computing prob. of state transitions. Init. Prob. 0.2 0.3 0.5 1 2 3 steps Trans. Prob. 1 2 3 symbols a ba 0.2 0.6 0.4 Output Prob. 0.08 0.09 0.10 1 2 3 0.8 0.4 0.6 0.5 0.016 0.1 0.006 0.050 0.7

20 Viterbi algorithm (2/4) n (1). Computing prob. of state transitions. Init Prob. 0.2 0.3 0.5 1 2 3 step Trans. Prob. 1 2 3 symbol a ba 0.2 0.6 0.4 Output Prob. 0.08 0.09 0.10 1 2 3 0.8 0.4 0.6 0.050 0.018 0.030 0.2 0.6 0.4 0.010 0.011 0.012

21 Viterbi algorithm (3/4) n (2). Tracing the state transition with max. prob. Init Prob. 0.2 0.3 0.5 1 2 3 step Trans. Prob. 1 2 3 symbol a ba 0.2 0.6 0.4 Output Prob. 0.08 0.09 0.10 1 2 3 0.8 0.4 0.6 0.050 0.018 0.030 0.2 0.6 0.4 0.010 0.011 0.012

22 Viterbi algorithm (4/4) n Computing time: O(NM 2 L) –At each node, it takes O(M) time. –# nodes for a sequence is O(ML). step Init Prob. 0.2 0.3 0.5 1 2 3 Trans. Prob. 1 2 3 symbols a ba 0.2 0.6 0.4 Output Prob. 0.08 0.09 0.10 1 2 3 0.8 0.4 0.6 0.050 0.018 0.030 0.2 0.6 0.4 0.010 0.011 0.012

23 Appropriate parameters n When parameters change, the best state transition can also change. Init Prob. 0.2 0.3 0.5 1 2 3 step Trans. Prob. 1 2 3 symbol a ba 0.2 0.6 0.4 Output Prob. 0.08 0.09 0.10 1 2 3 0.8 0.4 0.6 0.050 0.018 0.030 0.2 0.6 0.4 0.010 0.011 0.012 0.20.80.2 0.004

24 Overview n Hidden Markov models (HMM) –A problem in bioinformatics solved by HMMs Multiple alignments –Standard algorithms for HMMs Algorithm for finding the most likely state transition, called the Viterbi algorithm Algorithm for learning parameters from data, called the Baum-Welch algorithm

25 Learning (training) algorithm n Algorithm to find appropriate parameters –Baum-Welch algorithm Instance of EM (Expectation-Maximization) algorithms. Set initial parameters at random. # updates Likelihood P(X| h ) Update parameters Increase of likelihood< e Output parameters yes no Flow of the B-W algorithm The update always increase the likelihood P(X| h ), where X is a set of sequences. The expectations of # Parameters are used.

26 Appropriate parameters n Probabilistic parameters –Better: Given data is generated with higher prob. Casting dice 30 times Prob. Parameters (1)Prob. Parameters (2) Prob. we observe the above 30 casts. (1/6) 30 (1/3) 30 < Spots 1 10 2 0 3 10 4 0 5 10 6 0 #Spots 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 Prob.Spots 1 1/3 2 0 3 1/3 4 0 5 1/3 6 0 Prob.

27 # occurrence of parameters n In updating trans. prob. a k,l –# occurrence of transition from state k to state l. An state transition x1x1 x L-1 kl xLxL x L-2 An state transition x1x1 x2x2 kl xLxL x3x3 Prob. + 1.244 = A k,l a k,l Expectation of occurrence of Value of parameter updated 0.0030.004

28 e l (x i+1 ) x i+1 a k,l Forward prob. f k (i) Computing # occurrence of param. n Expectation of # occurrence of state transition from state k to state l. l x i-1 xixi x1x1 k A state transition x i+2 A state transition xLxL Backward prob. b k (i +1)

29 a 1,k Forward probability n f k (i): forward probability A state transition x1x1 x2x2 x i-1 k xixi = + + x1x1 x i-2 x i-1 k xixi 1 x1x1 x i-2 x i-1 k xixi M f 1 (i-1) ek (xi)ek (xi) f M (i-1) a M,k ek (xi)ek (xi)

30 Backward probl n b k (i): backward probability. = + + x i+2 x i+1 k xLxL M b M (i+1) A state transition x L-1 x i+1 k xLxL a k, 1 x i+2 k xLxL 1 x i+1 b 1 (i+1) ek (xi)ek (xi) a k,M ek (xi)ek (xi)

31 Forward/Backward probabilities n Time to compute them at one sequence a M,k ek (xi)ek (xi) a 1,k A state transition x1x1 x2x2 x i-1 k xixi = x1x1 x i-2 x i-1 k xixi + + x1x1 x i-2 x i-1 k xixi 1 M f 1 (i-1) f M (i-1) ek (xi)ek (xi) fk(i)fk(i) k: M variation, i: L variation. Comp. O(M) times For each seq., it takes O(M 2 L) time.

32 Conclusion n Alignment of multiple sequences –To find conserved regions of protein sequences. n Hidden Markov models (HMM) –profile HMM Describes alignments of multiple sequences. –Prediction algorithm Viterbi algorithm –Learning (training) algorithm Baum-Welch algorithm –For efficiency, the forward and backward probabilities are used.


Download ppt "Hidden Markov models and its application to bioinformatics."

Similar presentations


Ads by Google