Presentation is loading. Please wait.

Presentation is loading. Please wait.

Professor of Computer Science and Mathematics

Similar presentations


Presentation on theme: "Professor of Computer Science and Mathematics"— Presentation transcript:

1 Professor of Computer Science and Mathematics
Computational Biology Lecture #8: Hidden Markov Models, Motifs & Patterns Bud Mishra Professor of Computer Science and Mathematics 11 ¦ 12 ¦ 2001 1/17/2019 ©Bud Mishra, 2001

2 Hidden Markov Models (HMM)
Defined by an alphabet S, A set of (hidden) states Q, A matrix of state transition probabilities A, and a matrix of emission probabilities E. 1/17/2019 ©Bud Mishra, 2001

3 States ©Bud Mishra, 2001 S = An alphabet of symbols
Q = A set of states that emit symbols from the alphabet S A = (akl) = |Q| £ |Q| matrix of state transition probabilities E = (eK(B)) = |Q| £ |S| matrix of emission probabilities 1/17/2019 ©Bud Mishra, 2001

4 A Simple Exaple ©Bud Mishra, 2001 Fair Bet Casino:
S = {H, T} ! A two sided coin Q = {F, B} ! The casino may use a fair coin Pr(H) = Pr(T) = ½ or a biased coin Pr(H) = ¾, Pr(T) = ¼. 1/17/2019 ©Bud Mishra, 2001

5 Fair Bet Casino Problem
0.9 0.9 0.1 B F Pr[H]= ½ Pr[T]= ½ Pr[H]= ¾ Pr[T]= ¼ 0.1 1/17/2019 ©Bud Mishra, 2001

6 Transition and Emission
aFF = aBB = 0.9 aFB = aBF = 0.1 The casino switches the coin with probability = 0.9 eF(H) = ½ eF(T) = ½ eB(H) = ¾ eB(T) = ¼ Given a sequence of coin tosses find out when the dealer used a fair coin ad when the dealer used a biased coin. 1/17/2019 ©Bud Mishra, 2001

7 HMM in Genomics ©Bud Mishra, 2001 CG-Islands:
Subsequences around genes where CG appear relatively frequently. Methylation is suppressed in CG islands. C within the CG nucleotide is typically methylated. Methyl C has a tendency to mutate into a T. 1/17/2019 ©Bud Mishra, 2001

8 A Path in the HMM ©Bud Mishra, 2001 p = p1 pi2 L pn
= a sequence of states 2 Q* in the hidden Markov model M x 2 S* = sequence generated by the path p, determined by the model M P(x| p) = P(p1)[ Õi=1n P(xi | pi) P(pi | pi+1) ] 1/17/2019 ©Bud Mishra, 2001

9 A Path in the HMM ©Bud Mishra, 2001
P(x| p) = [Õi=1n P(xi | pi) P(pi | pi+1) ] P(p1) P(xi | pi) = epi(xi) P(pi | pi+1) = api, pi+1 p0 = Initial state “begin” pn+1 = Final state “end” P(x| p) = ap0, p1 ep1(x1) ap1, p2 ep2(x2)L epn(xn) apn, pn+1 = ap0, p1 Õi=1n epi(xi) api, pi+1 1/17/2019 ©Bud Mishra, 2001

10 Decoding Problem ©Bud Mishra, 2001 p* = arg maxp P(x|p)
For a given sequence x, and a given path p, The model (Markovian) defines the probability P(x | p) The dealer knows p and x The player knows x but not p “The path of x is hidden.” Decoding Problem: Find an optimal path p* for x such that P(x|p) is maximized. p* = arg maxp P(x|p) 1/17/2019 ©Bud Mishra, 2001

11 Dynamic Programming Approach
Principle of Optimality: Optimal path for the (i+1)-prefix of x x1 L xi+1 uses a path for an i-prefix of x that is optimal among the paths ending in an (unknown) state pi = k 2 Q 1/17/2019 ©Bud Mishra, 2001

12 Dynamic Programming Approach
sk(i) = The probability of the most probable path for the i-prefix ending in state k. 8k 2 Q 81 5 i 5 n sl(i+1) = el(xi+1). maxk2 Q [sk(i) . akl] 1/17/2019 ©Bud Mishra, 2001

13 Dynamic Programming ©Bud Mishra, 2001 i=0
sbegin(0) = 1, sk(0) =0, 8k ¹ begin 0 < i 5 n sl(i+1) = el(xi+1) ¢ maxk2 Q [ sk(i) ¢ akl ] i= n+1 P(x | p*) = maxk2 Q sk(n) ak, end 1/17/2019 ©Bud Mishra, 2001

14 + maxk2 Q [ Sk(i) + log akl ]
Viterbi Algorithm Dynamic Programming with log-score function Sl(i) = log sl(i) Space complexity = O(n |Q|) Time complexity = O(n |Q|) Sl(i+1) = log el(xi+1) + maxk2 Q [ Sk(i) + log akl ] 1/17/2019 ©Bud Mishra, 2001

15 Estimating the ith State
P(pi = k | x) = Given a sequence x 2 S*, the probability that the HMM was in state k at instant i. 1/17/2019 ©Bud Mishra, 2001

16 Forward Estimate: ©Bud Mishra, 2001
fk(i) = P(x1 L xi, pi = k) = Probability of emitting the prefix x1 L xi and reaching the state pi = k fk(i) = ek(xi) ¢ ål 2 Q fl(i-1) alk 1/17/2019 ©Bud Mishra, 2001

17 Backward Estimate: ©Bud Mishra, 2001
bk(i) = P(xi+1 L xn, pi = k) = Probability of being at the state pi = k and emitting the suffix xi+1 L xn . bk(i) = ål 2 Q ek(xi+1) ¢ bl(i+1) akl 1/17/2019 ©Bud Mishra, 2001

18 Applying Bayes’ Rule ©Bud Mishra, 2001 P(pi = k | x) = (1/P(x)
£ P(x1 … xi, pi = k) £ P(xi+1…xn, pi = k) = [fk(i) bk(i)] / P(x) P(x) = åp P(x | p) 1/17/2019 ©Bud Mishra, 2001

19 Sequence Motifs ©Bud Mishra, 2001
A Sequence of patterns of biological significance. Examples: DNA: Protein binding sites (e.g. promoters, regulatory sequences) Protein: sequences corresponding to conserved pieces of structure (e.g. Local features, At various scales: blocks, domains & families) 1/17/2019 ©Bud Mishra, 2001

20 MEME Algorithm ©Bud Mishra, 2001 Description of a motif:
Uses EM (Expectation Minimization) algorithm to find multiple motifs in a set of sequences. Description of a motif: W = (Fixed) width of a motif P = (plc)l2 S, c 2 1..W = Matrix of probabilities that letter l occurs at position c = |S| £ W matrix 1/17/2019 ©Bud Mishra, 2001

21 Example ©Bud Mishra, 2001 DNA motif of width
S = { A, T, C, G} r = motif ) Wr = 3, Pr = 4 £ 3 stochastic matrix Pr = 1/17/2019 ©Bud Mishra, 2001

22 Computational Problem
Given: A set of sequences, G A width parameter W Find: Motifs of width W common to sequences G and present their probabilistic descriptions. Note that motif start sites in each sequence are unknown (hidden). 1/17/2019 ©Bud Mishra, 2001

23 Basic EM Approach ©Bud Mishra, 2001 G = Training sequences.
|G| = Total number of sequences = m; Minimum length of a sequence in G = l. Z = m £ l matrix of probabilities zij = Probability that the motif starts in position j in sequence i. 1/17/2019 ©Bud Mishra, 2001

24 EM Algorithm ©Bud Mishra, 2001 Set initial values for P do
Re-estimate Z from P Re-estimate P from Z until change in P < e return P 1/17/2019 ©Bud Mishra, 2001

25 EM Algorithm ©Bud Mishra, 2001
Maximize the likelihood of a motif in the training sequence: si 2 G ith sequence Iij= {1, if motif starts at posn. j in seq. i {0, otherwise. lk = the char. at posn. j+k-1 in seq. Si Pr (Si | Iij = 1, r) = Õk=1W rl{k, k}. 1/17/2019 ©Bud Mishra, 2001

26 Example ©Bud Mishra, 2001 Si = AGGCTGTAGACAC Pr(Si = TGT | Ii5 =1, r)
= rT,1 £ rG,2 £ rT,3 = 0.2 £ 0.1 £ 0.1 = 2 £ 10-3 Pr = 1/17/2019 ©Bud Mishra, 2001

27 Estimating Z ©Bud Mishra, 2001 zij = Pr(Iij = 1 | r, Si)
= Estimates the starting position in Si 2 G. zij = Pr ( Iij =1 | r, Si) = Pr( Si, Iij = 1 | r)/ Pr(Si | r) = Pr( Si | Iij = 1, r) Pr(Iij = 1)/ åk Pr( Si | Iik = 1, r) Pr(Iik = 1) = Pr( Si | Iij = 1, r) / åk Pr( Si | Iik = 1, r) Follows from an application of the Bayes’ rule and the assumption that “it is equally likely that the motif will start in any position.” 8jk Pr(Iij = 1) = Pr(Iik=1) 1/17/2019 ©Bud Mishra, 2001

28 Example ©Bud Mishra, 2001 Si = AGGCTGTAGACAC z’i1 = 6 £ 10-3
Pr = 0.1 £ 0.1 £ 0.6 z’i1 = 6 £ 10-3 z’i2 = 3 £ 10-3 z’i3 = 6 £ 10-3 NORMALIZE 0.3 £ 0.1 £ 0.1 0.3 £ 0.2 £ 0.1 1/17/2019 ©Bud Mishra, 2001

29 Estimating Pr nck = ås(i) 2 G;{j | s(i, j+k-1) = c} zij
Given Z, estimate the probability that the character c occurs at the kth position of a motif. nck = ås(i) 2 G;{j | s(i, j+k-1) = c} zij Expected number of occurrences of the character c at the k^{th} position of a motif \rho (assuming that the motif “start position” is known.) pck = (nck + 1)/ åd (ndk + 1) 1/17/2019 ©Bud Mishra, 2001

30 Example ©Bud Mishra, 2001 s1 : A C A G C A s2 : A G G C A G
s3 : T C A G T C z1,1 z1,3 z2,1 z3,3 pA,1 = (z11 +z13+ z21 + z33 +1)/ (z11 + z12 + L+ z33 + z34 +4) 1/17/2019 ©Bud Mishra, 2001

31 Meme ©Bud Mishra, 2001 Uses the basic EM approach
Try many starting points. Allow multiple occurrences of a motif per sequence Allows multiple motifs to be learned simultaneously. 1/17/2019 ©Bud Mishra, 2001

32 Meme ©Bud Mishra, 2001 Initial set of possible motifs:
Take every distinct subsequences of length W in the training set Derive an initial matrix P pck = { a if c occurs in position k in the subsequence {(1-a)/ (|S| - 1) Otherwise 1/17/2019 ©Bud Mishra, 2001

33 Example ©Bud Mishra, 2001 W = 3, r = T A T, a = 0.5
Choose the motif model with the highest likelihood. Run EM to convergence Pr = 1/17/2019 ©Bud Mishra, 2001


Download ppt "Professor of Computer Science and Mathematics"

Similar presentations


Ads by Google