Presentation is loading. Please wait.

Presentation is loading. Please wait.

CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877

Similar presentations


Presentation on theme: "CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877"— Presentation transcript:

1 CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore csccyz@nus.edu.sg http://xin.cz3.nus.edu.sgcsccyz@nus.edu.sg http://xin.cz3.nus.edu.sg

2 2 Problem in biology Data and patterns are often not clear cut When we want to make a method to recognise a pattern (e.g. a sequence motif), we have to learn from the data (e.g. maybe there are other differences between sequences that have the pattern and those that do not) This leads to Data mining and Machine learning

3 3 Contents: Markov chain models (1st order, higher order and inhomogeneous models; parameter estimation; classification) Interpolated Markov models (and back-off models) Hidden Markov models (forward, backward and Baum- Welch algorithms; model topologies; applications to gene finding and protein family modeling A widely used machine learning approach: Markov models

4 4

5 5 Markov Chain Models a Markov chain model is defined by: –a set of states some states emit symbols other states (e.g. the begin state) are silent –a set of transitions with associated probabilities the transitions emanating from a given state define a distribution over the possible next states

6 6 Markov Chain Models Given some sequence x of length L, we can ask how probable the sequence is given our model For any probabilistic model of sequences, we can write this probability as Key property of a (1st order) Markov chain: the probability of each X i depends only on X i-1

7 7 Markov Chain Models Pr(cggt) = Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

8 8 Markov Chain Models Can also have an end state, allowing the model to represent: Sequences of different lengths Preferences for sequences ending with particular symbols

9 9 Markov Chain Models The transition parameters can be denoted by where Similarly we can denote the probability of a sequence x as Where a Bxi represents the transition from the begin state

10 10 Example Application CpG islands –CGdinucleotides are rarer in eukaryotic genomes than expected given the independent probabilities of C, G –but the regions upstream of genes are richer in CG dinucleotides than elsewhere – CpG islands –useful evidence for finding genes Could predict CpG islands with Markov chains –one to represent CpG islands –one to represent the rest of the genome Example includes using Maximum likelihood and Bayes’ statistical data and feeding it to a HM model

11 11 Estimating the Model Parameters Given some data (e.g. a set of sequences from CpG islands), how can we determine the probability parameters of our model? One approach: maximum likelihood estimation –given a set of data D –set the parameters  to maximize Pr(D|  ) –i.e. make the data D look likely under the model

12 12 Maximum Likelihood Estimation Suppose we want to estimate the parameters Pr(a), Pr(c), Pr(g), Pr(t) And we’re given the sequences: accgcgctta gcttagtgac tagccgttac Then the maximum likelihood estimates are: Pr(a) = 6/30 = 0.2Pr(g) = 7/30 = 0.233 Pr(c) = 9/30 = 0.3Pr(t) = 8/30 = 0.267

13 13

14 14

15 15

16 16

17 17 These data are derived from genome sequences

18 18

19 19

20 20

21 21 Higher Order Markov Chains An nth order Markov chain over some alphabet is equivalent to a first order Markov chain over the alphabet of n-tuples Example: a 2nd order Markov model for DNA can be treated as a 1st order Markov model over alphabet: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT (i.e. all possible dipeptides)

22 22 A Fifth Order Markov Chain

23 23 Inhomogenous Markov Chains In the Markov chain models we have considered so far, the probabilities do not depend on where we are in a given sequence In an inhomogeneous Markov model, we can have different distributions at different positions in the sequence Consider modeling codons in protein coding regions

24 24 Inhomogenous Markov Chains

25 25 A Fifth Order Inhomogeneous Markov Chain

26 26 Selecting the Order of a Markov Chain Model Higher order models remember more “history” Additional history can have predictive value Example: – predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?) – now predict it given more history “Fast guys finish __”

27 27 Hidden Markov models (HMMs) Given say a T in our input sequence, which state emitted it?

28 28 Hidden Markov models (HMMs) Hidden State We will distinguish between the observed parts of a problem and the hidden parts In the Markov models we have considered previously, it is clear which state accounts for each part of the observed sequence In the model above (preceding slide), there are multiple states that could account for each part of the observed sequence – this is the hidden part of the problem – states are decoupled from sequence symbols

29 29 HMM-based homology searching Transition probabilities and Emission probabilities Gapped HMMs also have insertion and deletion states

30 30 Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”. d1d1 d2d2 d3d3 d4d4 I0I0 I2I2 I3I3 I4I4 I1I1 m0m0 m1m1 m2m2 m3m3 m4m4 m5m5 Start End

31 31 HMM-based homology searching Most widely used HMM-based profile searching tools currently are SAM-T99 (Karplus et al., 1998) and HMMER2 (Eddy, 1998) formal probabilistic basis and consistent theory behind gap and insertion scores HMMs good for profile searches, bad for alignment (due to parametrisation of the models) HMMs are slow

32 32 Homology-derived Secondary Structure of Proteins Sander & Schneider, 1991

33 33 The Parameters of an HMM

34 34 HMM for Eukaryotic Gene Finding Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

35 35 A Simple HMM

36 36 Three Important Questions How likely is a given sequence? the Forward algorithm What is the most probable “ path ” for generating a given sequence? the Viterbi algorithm How can we learn the HMM parameters given a set of sequences? the Forward-Backward (Baum-Welch) algorithm

37 37 How Likely is a Given Sequence? The probability that the path is taken and the sequence is generated: (assuming begin/end are the only silent states on path)

38 38 How Likely is a Given Sequence?

39 39 How Likely is a Given Sequence? The probability over all paths is: but the number of paths can be exponential in the length of the sequence... the Forward algorithm enables us to compute this efficiently

40 40 How Likely is a Given Sequence: The Forward Algorithm Define f k (i) to be the probability of being in state k Having observed the first i characters of x we want to compute f N (L), the probability of being in the end state having observed all of x We can define this recursively

41 41 How Likely is a Given Sequence:

42 42 The forward algorithm Initialisation: f 0 (0) = 1 (start), f k (0) = 0 (other silent states k) Recursion: f l (i) = e l (i)  k f k (i-1)a kl (emitting states), f l (i) =  k f k (i)a kl (silent states) Termination: Pr(x) = Pr(x 1 … x L ) = f N (L) =  k f k (L)a kN probability that we are in the end state and have observed the entire sequence probability that we’re in start state and have observed 0 characters from the sequence

43 43 Forward algorithm example

44 44 Three Important Questions How likely is a given sequence? What is the most probable “ path ” for generating a given sequence? How can we learn the HMM parameters given a set of sequences?

45 45 Finding the Most Probable Path: The Viterbi Algorithm Define v k (i) to be the probability of the most probable path accounting for the first i characters of x and ending in state k We want to compute v N (L), the probability of the most probable path accounting for all of the sequence and ending in the end state Can be defined recursively Can use DP to find v N (L) efficiently

46 46 Finding the Most Probable Path: The Viterbi Algorithm Initialisation: v 0 (0) = 1 (start), v k (0) = 0 (non-silent states) Recursion for emitting states (i =1 … L): Recursion for silent states:

47 47 Finding the Most Probable Path: The Viterbi Algorithm

48 48 Three Important Questions How likely is a given sequence? (clustering) What is the most probable “ path ” for generating a given sequence? (alignment) How can we learn the HMM parameters given a set of sequences?

49 49 The Learning Task Given: – a model – a set of sequences (the training set) Do: – find the most likely parameters to explain the training sequences The goal is find a model that generalizes well to sequences we haven ’ t seen before

50 50 Learning Parameters If we know the state path for each training sequence, learning the model parameters is simple – no hidden state during training – count how often each parameter is used – normalize/smooth to get probabilities – process just like it was for Markov chain models If we don ’ t know the path for each training sequence, how can we determine the counts? – key insight: estimate the counts by considering every path weighted by its probability

51 51 Learning Parameters: The Baum-Welch Algorithm An EM (expectation maximization) approach, a forward-backward algorithm Algorithm sketch: – initialize parameters of model – iterate until convergence Calculate the expected number of times each transition or emission is used Adjust the parameters to maximize the likelihood of these expected values

52 52 The Expectation step

53 53 The Expectation step

54 54 The Expectation step

55 55 The Expectation step

56 56 The Expectation step First, we need to know the probability of the i th symbol being produced by state q, given sequence x: Pr(  i = k | x) Given this we can compute our expected counts for state transitions, character emissions

57 57 The Expectation step

58 58 The Backward Algorithm

59 59 The Expectation step

60 60 The Expectation step

61 61 The Expectation step

62 62 The Maximization step

63 63 The Maximization step

64 64 The Baum-Welch Algorithm Initialize parameters of model Iterate until convergence – calculate the expected number of times each transition or emission is used – adjust the parameters to maximize the likelihood of these expected values This algorithm will converge to a local maximum (in the likelihood of the data given the model) Usually in a fairly small number of iterations


Download ppt "CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877"

Similar presentations


Ads by Google