Download presentation
Presentation is loading. Please wait.
1
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2
2
CS262 Lecture 8, Win07, Batzoglou Substitutions of Amino Acids Mutation rates between amino acids have dramatic differences!
3
CS262 Lecture 8, Win07, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = ( b A ab )/( c≤d A cd ); P(a, b) = A ab /( c≤d A cd )
4
CS262 Lecture 8, Win07, Batzoglou Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related
5
CS262 Lecture 8, Win07, Batzoglou A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2 1 – This model generates two sequences simultaneously Match/Mismatch state M: P(x, y) reflects substitution frequencies between pairs of amino acids Insertion states I, J: P(x), P(y) reflect frequencies of each amino acid : set so that 1/2 is avg. length before next gap : set so that 1/(1 – ) is avg. length of a gap M Model M optional
6
CS262 Lecture 8, Win07, Batzoglou A Pair HMM for unaligned sequences I P(x i ) J P(y j ) 11 Two sequences are independently generated from one another P(x, y | R) = P(x 1 )…P(x m ) P(y 1 )…P(y n ) = i P(x i ) j P(y j ) R Model R
7
CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2 ) P(x i, y j ) when matched P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2 1 – I P(x i ) J P(y j ) 1 1
8
CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2 ) P(x i, y j ) when matched P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2 (1 – ) ----------- (1 – 2 ) I P(x i ) J P(y j ) 1 1 1 – 2 Equivalent!
9
CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2 ) P(x i, y j ) when matched P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j )/ P(x i ) P(y j ) I1I1 J1J1 1 – 2 (1 – ) ----------- (1 – 2 ) I P(x i ) J P(y j ) 1 1 1 – 2 Equivalent!
10
CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(x i, y j ) s(x i, y j ) = log ––––––––– + log (1 – 2 ) P(x i ) P(y j ) (1 – ) P(x i ) d = – log ––––––––––––– (1 – 2 ) P(x i ) P(x i ) e = – log –––––– P(x i ) = Defn substitution score = Defn gap initiation penalty = Defn gap extension penalty
11
CS262 Lecture 8, Win07, Batzoglou The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i – 1, j – 1), V I ( i – 1, j – 1), V j ( i – 1, j – 1) } + s(x i, y j ) V I (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e } V J (i, j) = max { V M (i, j – 1) – d, V I ( i, j – 1) – e } s(.,.) (1 – 2 ) ~how often a pair of letters substitute one another 1/mean length of next gap (1 – ) / (1 – 2 ) 1/mean arrival time of next gap
12
CS262 Lecture 8, Win07, Batzoglou The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b) log –––––––––– (ignore log(1 – 2 ) for the moment) P(x i ) P(y j ) Example: DNA regions between human and mouse genes have average conservation of 80% 1.What is the substitution score for a match? P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8 P(x, x) = 0.2 P(a) = P(c) = P(g) = P(t) = 0.25 s(x, x) = log [ 0.2 / 0.25 2 ] = 1.163 What is the substitution score for a mismatch? P(a, c) +…+P(t, g) = 0.2 P(x, y x) = 0.2/12 = 0.0167 s(x, y x) = log[ 0.0167 / 0.25 2 ] = -1.322 What ratio matches/(matches + mism.) gives score 0? x(#match) – y(#mism) = 0 1.163 (#match) – 1.322 (#mism) = 0 #match = 1.137(#mism) matches = 53.2%
13
CS262 Lecture 8, Win07, Batzoglou The meaning of alignment scores The global alignment algorithm we learned, corresponds to: Find the most likely alignment under the 3-state pHMM The score of an alignment corresponds to: Log-likehood ratio between P(best alignment| alignment model), and P(sequences were generated independently)
14
CS262 Lecture 8, Win07, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = ( b A ab )/( c≤d A cd ); P(a, b) = A ab /( c≤d A cd )
15
CS262 Lecture 8, Win07, Batzoglou BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)
16
CS262 Lecture 8, Win07, Batzoglou Conditional Random Fields A brief description of a relatively new kind of graphical model
17
CS262 Lecture 8, Win07, Batzoglou Let’s look at an HMM again Why are HMMs convenient to use? Because we can do dynamic programming with them! “Best” state sequence for 1…i interacts with “best” sequence for i+1…N using K 2 arrows V l (i+1) = e l (i+1) max k V k (i) a kl = max k ( V k (i) + [ e(l, i+1) + a(k, l) ] ) (where e(.,.) and a(.,.) are logs) Total likelihood of all state sequences for 1…i+1 can be calculated from total likelihood for 1…i by only summing up K 2 arrows 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2
18
CS262 Lecture 8, Win07, Batzoglou Let’s look at an HMM again Some shortcomings of HMMs Can’t model state duration Solution: explicit duration models (Semi-Markov HMMs) Unfortunately, state i cannot “look” at any letter other than x i ! Strong independence assumption: P( i | x 1 …x i-1, 1 … i-1 ) = P( i | i-1 ) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2
19
CS262 Lecture 8, Win07, Batzoglou Let’s look at an HMM again Another way to put this, features used in objective function P(x, ): a kl, e k (b), where b At position i: all K 2 a kl features, and all K e l (x i ) features play a role OK forget probabilistic interpretation for a moment “Given that prev. state is k, current state is l, how much is current score?” V l (i) = V k (i – 1) + (a(k, l) + e(l, i)) = V k (i – 1) + g(k, l, x i ) Let’s generalize g!!! V k (i – 1) + g(k, l, x, i) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2
20
CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x What do we put in g(k, l, x, i)? The “higher” g(k, l, x, i), the more we like going from k to l at position i Richer models using this additional power Examples Casino player looks at previous 100 pos’ns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[x i-100, …, x i-1 has > 50 6s] w DON’T_GET_CAUGHT Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[x i-1000, …, x i+1000 has > 1/16 CpG] w CG_RICH_REGION x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9 ii i-1
21
CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x Conditional Random Fields—Features 1.Define a set of features that you think are important All features should be functions of current state, previous state, x, and position i Example: Old features: transition k l, emission b from state k Plus new features: prev 100 letters have 50 6s Number the features 1…n: f 1 (k, l, x, i), …, f n (k, l, x, i) features are indicator true/false variables Find appropriate weights w 1,…, w n for when each feature is true weights are the parameters of the model 2.Let’s assume for now each feature has a weight w j Then, g(k, l, x, i) = j=1…n f j (k, l, x, i) w j x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9
22
CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x Define V k (i): Optimal score of “parsing” x 1 …x i and ending in state k Then, assuming V k (i) is optimal for every k at position i, it follows that V l (i+1) = max k [V k (i) + g(k, l, x, i+1)] Why? Even though at pos’n i+1 we “look” at arbitrary positions in x, we are only “affected” by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x 1 …x N x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9
23
CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state i only “looks” at prev. state i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … HMM CRF
24
CS262 Lecture 8, Win07, Batzoglou How many parameters are there, in general? Arbitrarily many parameters! For example, let f j (k, l, x, i) depend on x i-5, x i-4, …, x i+5 Then, we would have up to K | | 11 parameters! Advantage: powerful, expressive model Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region” Question: how do we train these parameters?
25
CS262 Lecture 8, Win07, Batzoglou Conditional Training Hidden Markov Model training: Given training sequence x, “true” parse Maximize P(x, ) Disadvantage: P(x, ) = P( | x) P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given
26
CS262 Lecture 8, Win07, Batzoglou Conditional Training P(x, ) = P( | x) P(x) P( | x) = P(x, ) / P(x) Recall F(j, x, ) = # times feature f j occurs in (x, ) = i=1…N f j (k, l, x, i) ; count f j in x, In HMMs, let’s denote by w j the weight of j th feature: w j = log(a kl ) or log(e k (b)) Then, HMM: P(x, ) = exp [ j=1…n w j F(j, x, ) ] CRF:Score(x, ) = exp [ j=1…n w j F(j, x, ) ]
27
CS262 Lecture 8, Win07, Batzoglou Conditional Training In HMMs, P( | x) = P(x, ) / P(x) P(x, ) = exp [ j=1…n w j F(j, x, ) ] P(x) = exp [ j=1…n w j F(j, x, ) ] =: Z Then, in CRF we can do the same to normalize Score(x, ) into a prob. P CRF ( | x) = exp [ j=1…n w j F(j, x, ) ] / Z QUESTION: Why is this a probability???
28
CS262 Lecture 8, Win07, Batzoglou Conditional Training 1.We need to be given a set of sequences x and “true” parses 2.Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P( | x) 3.Calculate partial derivative of P( | x) w.r.t. each parameter w j (not covered—akin to forward/backward) Update each parameter with gradient descent! 4.Continue until convergence to optimal set of weights P( | x) = exp [ j=1…n w j F(j, x, ) ] / Zis convex!!!
29
CS262 Lecture 8, Win07, Batzoglou Conditional Random Fields—Summary 1.Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient 2.Conditional training Train parameters that are best for parsing, not modeling Need labeled examples—sequences x and “true” parses (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slower—many iterations of forward/backward
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.