Download presentation
Presentation is loading. Please wait.
1
Professor of Computer Science and Mathematics
Computational Biology Lecture #8: Hidden Markov Models, Motifs & Patterns Bud Mishra Professor of Computer Science and Mathematics 11 ¦ 12 ¦ 2001 1/18/2019 ©Bud Mishra, 2001
2
Hidden Markov Models (HMM)
Defined by an alphabet S, A set of (hidden) states Q, A matrix of state transition probabilities A, and a matrix of emission probabilities E. 1/18/2019 ©Bud Mishra, 2001
3
States ©Bud Mishra, 2001 S = An alphabet of symbols
Q = A set of states that emit symbols from the alphabet S A = (akl) = |Q| £ |Q| matrix of state transition probabilities E = (eK(B)) = |Q| £ |S| matrix of emission probabilities 1/18/2019 ©Bud Mishra, 2001
4
A Simple Exaple ©Bud Mishra, 2001 Fair Bet Casino:
S = {H, T} ! A two sided coin Q = {F, B} ! The casino may use a fair coin Pr(H) = Pr(T) = ½ or a biased coin Pr(H) = ¾, Pr(T) = ¼. 1/18/2019 ©Bud Mishra, 2001
5
Fair Bet Casino Problem
0.9 0.9 0.1 B F Pr[H]= ½ Pr[T]= ½ Pr[H]= ¾ Pr[T]= ¼ 0.1 1/18/2019 ©Bud Mishra, 2001
6
Transition and Emission
aFF = aBB = 0.9 aFB = aBF = 0.1 The casino switches the coin with probability = 0.9 eF(H) = ½ eF(T) = ½ eB(H) = ¾ eB(T) = ¼ Given a sequence of coin tosses find out when the dealer used a fair coin ad when the dealer used a biased coin. 1/18/2019 ©Bud Mishra, 2001
7
HMM in Genomics ©Bud Mishra, 2001 CG-Islands:
Subsequences around genes where CG appear relatively frequently. Methylation is suppressed in CG islands. C within the CG nucleotide is typically methylated. Methyl C has a tendency to mutate into a T. 1/18/2019 ©Bud Mishra, 2001
8
A Path in the HMM ©Bud Mishra, 2001 p = p1 pi2 L pn
= a sequence of states 2 Q* in the hidden Markov model M x 2 S* = sequence generated by the path p, determined by the model M P(x| p) = P(p1)[ Õi=1n P(xi | pi) P(pi | pi+1) ] 1/18/2019 ©Bud Mishra, 2001
9
A Path in the HMM ©Bud Mishra, 2001
P(x| p) = [Õi=1n P(xi | pi) P(pi | pi+1) ] P(p1) P(xi | pi) = epi(xi) P(pi | pi+1) = api, pi+1 p0 = Initial state “begin” pn+1 = Final state “end” P(x| p) = ap0, p1 ep1(x1) ap1, p2 ep2(x2)L epn(xn) apn, pn+1 = ap0, p1 Õi=1n epi(xi) api, pi+1 1/18/2019 ©Bud Mishra, 2001
10
Decoding Problem ©Bud Mishra, 2001 p* = arg maxp P(x|p)
For a given sequence x, and a given path p, The model (Markovian) defines the probability P(x | p) The dealer knows p and x The player knows x but not p “The path of x is hidden.” Decoding Problem: Find an optimal path p* for x such that P(x|p) is maximized. p* = arg maxp P(x|p) 1/18/2019 ©Bud Mishra, 2001
11
Dynamic Programming Approach
Principle of Optimality: Optimal path for the (i+1)-prefix of x x1 L xi+1 uses a path for an i-prefix of x that is optimal among the paths ending in an (unknown) state pi = k 2 Q 1/18/2019 ©Bud Mishra, 2001
12
Dynamic Programming Approach
sk(i) = The probability of the most probable path for the i-prefix ending in state k. 8k 2 Q 81 5 i 5 n sl(i+1) = el(xi+1). maxk2 Q [sk(i) . akl] 1/18/2019 ©Bud Mishra, 2001
13
Dynamic Programming ©Bud Mishra, 2001 i=0
sbegin(0) = 1, sk(0) =0, 8k ¹ begin 0 < i 5 n sl(i+1) = el(xi+1) ¢ maxk2 Q [ sk(i) ¢ akl ] i= n+1 P(x | p*) = maxk2 Q sk(n) ak, end 1/18/2019 ©Bud Mishra, 2001
14
+ maxk2 Q [ Sk(i) + log akl ]
Viterbi Algorithm Dynamic Programming with log-score function Sl(i) = log sl(i) Space complexity = O(n |Q|) Time complexity = O(n |Q|) Sl(i+1) = log el(xi+1) + maxk2 Q [ Sk(i) + log akl ] 1/18/2019 ©Bud Mishra, 2001
15
Estimating the ith State
P(pi = k | x) = Given a sequence x 2 S*, the probability that the HMM was in state k at instant i. 1/18/2019 ©Bud Mishra, 2001
16
Forward Estimate: ©Bud Mishra, 2001
fk(i) = P(x1 L xi, pi = k) = Probability of emitting the prefix x1 L xi and reaching the state pi = k fk(i) = ek(xi) ¢ ål 2 Q fl(i-1) alk 1/18/2019 ©Bud Mishra, 2001
17
Backward Estimate: ©Bud Mishra, 2001
bk(i) = P(xi+1 L xn, pi = k) = Probability of being at the state pi = k and emitting the suffix xi+1 L xn . bk(i) = ål 2 Q ek(xi+1) ¢ bl(i+1) akl 1/18/2019 ©Bud Mishra, 2001
18
Applying Bayes’ Rule ©Bud Mishra, 2001 P(pi = k | x) = (1/P(x)
£ P(x1 … xi, pi = k) £ P(xi+1…xn, pi = k) = [fk(i) bk(i)] / P(x) P(x) = åp P(x | p) 1/18/2019 ©Bud Mishra, 2001
19
Sequence Motifs ©Bud Mishra, 2001
A Sequence of patterns of biological significance. Examples: DNA: Protein binding sites (e.g. promoters, regulatory sequences) Protein: sequences corresponding to conserved pieces of structure (e.g. Local features, At various scales: blocks, domains & families) 1/18/2019 ©Bud Mishra, 2001
20
MEME Algorithm ©Bud Mishra, 2001 Description of a motif:
Uses EM (Expectation Minimization) algorithm to find multiple motifs in a set of sequences. Description of a motif: W = (Fixed) width of a motif P = (plc)l2 S, c 2 1..W = Matrix of probabilities that letter l occurs at position c = |S| £ W matrix 1/18/2019 ©Bud Mishra, 2001
21
Example ©Bud Mishra, 2001 DNA motif of width
S = { A, T, C, G} r = motif ) Wr = 3, Pr = 4 £ 3 stochastic matrix Pr = 1/18/2019 ©Bud Mishra, 2001
22
Computational Problem
Given: A set of sequences, G A width parameter W Find: Motifs of width W common to sequences G and present their probabilistic descriptions. Note that motif start sites in each sequence are unknown (hidden). 1/18/2019 ©Bud Mishra, 2001
23
Basic EM Approach ©Bud Mishra, 2001 G = Training sequences.
|G| = Total number of sequences = m; Minimum length of a sequence in G = l. Z = m £ l matrix of probabilities zij = Probability that the motif starts in position j in sequence i. 1/18/2019 ©Bud Mishra, 2001
24
EM Algorithm ©Bud Mishra, 2001 Set initial values for P do
Re-estimate Z from P Re-estimate P from Z until change in P < e return P 1/18/2019 ©Bud Mishra, 2001
25
EM Algorithm ©Bud Mishra, 2001
Maximize the likelihood of a motif in the training sequence: si 2 G ith sequence Iij= {1, if motif starts at posn. j in seq. i {0, otherwise. lk = the char. at posn. j+k-1 in seq. Si Pr (Si | Iij = 1, r) = Õk=1W rl{k, k}. 1/18/2019 ©Bud Mishra, 2001
26
Example ©Bud Mishra, 2001 Si = AGGCTGTAGACAC Pr(Si = TGT | Ii5 =1, r)
= rT,1 £ rG,2 £ rT,3 = 0.2 £ 0.1 £ 0.1 = 2 £ 10-3 Pr = 1/18/2019 ©Bud Mishra, 2001
27
Estimating Z ©Bud Mishra, 2001 zij = Pr(Iij = 1 | r, Si)
= Estimates the starting position in Si 2 G. zij = Pr ( Iij =1 | r, Si) = Pr( Si, Iij = 1 | r)/ Pr(Si | r) = Pr( Si | Iij = 1, r) Pr(Iij = 1)/ åk Pr( Si | Iik = 1, r) Pr(Iik = 1) = Pr( Si | Iij = 1, r) / åk Pr( Si | Iik = 1, r) Follows from an application of the Bayes’ rule and the assumption that “it is equally likely that the motif will start in any position.” 8jk Pr(Iij = 1) = Pr(Iik=1) 1/18/2019 ©Bud Mishra, 2001
28
Example ©Bud Mishra, 2001 Si = AGGCTGTAGACAC z’i1 = 6 £ 10-3
Pr = 0.1 £ 0.1 £ 0.6 z’i1 = 6 £ 10-3 z’i2 = 3 £ 10-3 z’i3 = 6 £ 10-3 … NORMALIZE 0.3 £ 0.1 £ 0.1 0.3 £ 0.2 £ 0.1 1/18/2019 ©Bud Mishra, 2001
29
Estimating Pr nck = ås(i) 2 G;{j | s(i, j+k-1) = c} zij
Given Z, estimate the probability that the character c occurs at the kth position of a motif. nck = ås(i) 2 G;{j | s(i, j+k-1) = c} zij Expected number of occurrences of the character c at the k^{th} position of a motif \rho (assuming that the motif “start position” is known.) pck = (nck + 1)/ åd (ndk + 1) 1/18/2019 ©Bud Mishra, 2001
30
Example ©Bud Mishra, 2001 s1 : A C A G C A s2 : A G G C A G
s3 : T C A G T C z1,1 z1,3 z2,1 z3,3 pA,1 = (z11 +z13+ z21 + z33 +1)/ (z11 + z12 + L+ z33 + z34 +4) 1/18/2019 ©Bud Mishra, 2001
31
Meme ©Bud Mishra, 2001 Uses the basic EM approach
Try many starting points. Allow multiple occurrences of a motif per sequence Allows multiple motifs to be learned simultaneously. 1/18/2019 ©Bud Mishra, 2001
32
Meme ©Bud Mishra, 2001 Initial set of possible motifs:
Take every distinct subsequences of length W in the training set Derive an initial matrix P pck = { a if c occurs in position k in the subsequence {(1-a)/ (|S| - 1) Otherwise 1/18/2019 ©Bud Mishra, 2001
33
Example ©Bud Mishra, 2001 W = 3, r = T A T, a = 0.5
Choose the motif model with the highest likelihood. Run EM to convergence Pr = 1/18/2019 ©Bud Mishra, 2001
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.