Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =

Variants of HMMs

Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH

Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq

Sol’n 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

Sol’n 2: Negative binomial distribution Duration in X: m turns, where  During first m – 1 turns, exactly n – 1 arrows to next state are followed  During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X p X X p 1 – p p …… Y 1 – p

Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D 2 ) -- Why? Space: O(D) where D = maximum duration of state X

Connection Between Alignment and HMMs

A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J

Let’s score the transitions -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J s(x i, y j ) -d -e

How do we find optimal alignment according to this model? Dynamic Programming: M(i, j):Optimal alignment of x 1 …x i to y 1 …y j ending in M I(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in I J(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in J The score is additive, therefore we can apply DP recurrence formulas

Needleman Wunsch with affine gaps – state version Initialization: M(0,0) = 0; M(i,0) = M(0,j) = - , for i, j > 0 I(i,0) = d + i  e;J(0,j) = d + j  e Iteration: M(i – 1, j – 1) M(i, j) = s(x i, y j ) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = maxe + J(i, j – 1) d + M(i – 1, j – 1) e + I(i – 1, j) J(i, j) = maxe + J(i, j – 1) d + M(i – 1, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  –  1 – 2  –        BEGIN END M J I    1 – 2  – 

A Pair HMM for unaligned sequences BEGIN I P(x i ) END BEGIN J P(y j ) END 1 -    P(x, y | R) =  (1 –  ) m P(x 1 )…P(x m )  (1 –  ) n P(y 1 )…P(y n ) =  2 (1 –  ) m+n  i P(x i )  j P(y j ) Model R

To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes: (1 – 2  –  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gapped (1 –  ) 2 P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) BEGIN I P(x i ) END BEGIN J P(y j ) END 1 -    M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  –  1 – 2  –      

To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(x i, y j ) (1 – 2  –  ) s(x i, y j ) = log ––––––––––– + log ––––––––––– P(x i ) P(y j ) (1 –  ) 2  (1 – 2  –  ) P(x i ) d = – log –––––––––––––––––––– (1 –  ) (1 – 2  –  ) P(x i )  P(x i ) e = – log ––––––––––– (1 –  ) P(x i ) Every letter b in random model contributes (1 –  ) P(b)

The meaning of alignment scores Because , , are small, and ,  are very small, P(x i, y j ) (1 – 2  –  ) P(x i, y j ) s(x i, y j ) = log ––––––––– + log ––––––––––  log –––––––– + log(1 – 2  ) P(x i ) P(y j ) (1 –  ) 2 P(x i ) P(y j )  (1 –  –  ) 1 –  d = – log ––––––––––––––––––  – log  –––––– (1 –  ) (1 – 2  –  ) 1 – 2   e = – log –––––––  – log  (1 –  )

The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to the Needleman-Wunsch algorithm with affine gaps However, now we need to score alignment with parameters that add up to probability distributions   1/mean length of next gap   1/mean arrival time of next gap  affine gaps decouple arrival time with length   1/mean length of aligned sequences(set to ~0)   1/mean length of unaligned sequences(set to ~0)

The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b)  log ––––––––––– (let’s ignore log(1 – 2  ) for the moment – assume no gaps) P(x i ) P(y j ) Example: Say DNA regions between human and mouse have average conservation of 50% Then P(A,A) = P(C,C) = P(G,G) = P(T,T) = 1/8 (so they sum to ½) P(A,C) = P(A,G) =……= P(T,G) = 1/24 (24 mismatches, sum to ½) Say P(A) = P(C) = P(G) = P(T) = ¼ log [ (1/8) / (1/4 * 1/4) ] = log 2 = 1, for match Then, s(a, b) = log [ (1/24) / (1/4 * 1/4) ] = log 16/24 = -0.585 Cutoff similarity that scores 0: s*1 – (1 – s)*0.585 = 0 According to this model, a 37.5%-conserved sequence with no gaps would score on average 0.375 * 1 – 0.725 * 0.585 = 0 Why? 37.5% is between the 50% conservation model, and the random 25% conservation model !

Substitution matrices A more meaningful way to assign match/mismatch scores For protein sequences, different substitutions have dramatically different frequencies! PAM Matrices: 1.Start from a curated set of very similar protein sequences 2.Construct ancestral sequences (using parsimony) 3.Calculate A ab : frequency of letters a and b interchanging 4.Calculate B ab = P(b|a) = A ab /(  c≤d A cd ) 5.Adjust matrix B so that  a,b q a q b B ab = 0.01 PAM(1) 6.Let PAM(N) = [PAM(1)] N-- Common PAM(250)

Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)

Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =

Similar presentations

Presentation on theme: "Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 ="— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =

Similar presentations

Presentation on theme: "Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 ="— Presentation transcript:

Similar presentations

About project

Feedback