Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

CS262 Lecture 8, Win07, Batzoglou Substitutions of Amino Acids Mutation rates between amino acids have dramatic differences!

CS262 Lecture 8, Win07, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

CS262 Lecture 8, Win07, Batzoglou Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

CS262 Lecture 8, Win07, Batzoglou A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  1 –      This model generates two sequences simultaneously Match/Mismatch state M: P(x, y) reflects substitution frequencies between pairs of amino acids Insertion states I, J: P(x), P(y) reflect frequencies of each amino acid  : set so that 1/2  is avg. length before next gap  : set so that 1/(1 –  ) is avg. length of a gap M Model M optional

CS262 Lecture 8, Win07, Batzoglou A Pair HMM for unaligned sequences I P(x i ) J P(y j ) 11 Two sequences are independently generated from one another P(x, y | R) = P(x 1 )…P(x m ) P(y 1 )…P(y n ) =  i P(x i )  j P(y j ) R Model R

CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  1 –     I P(x i ) J P(y j ) 1 1

CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2   (1 –  ) ----------- (1 – 2  )  I P(x i ) J P(y j ) 1 1 1 – 2  Equivalent!

CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j )/ P(x i ) P(y j ) I1I1 J1J1 1 – 2   (1 –  ) ----------- (1 – 2  )  I P(x i ) J P(y j ) 1 1 1 – 2  Equivalent!

CS262 Lecture 8, Win07, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(x i, y j ) s(x i, y j ) = log ––––––––– + log (1 – 2  ) P(x i ) P(y j )  (1 –  ) P(x i ) d = – log ––––––––––––– (1 – 2  ) P(x i )  P(x i ) e = – log –––––– P(x i ) = Defn substitution score = Defn gap initiation penalty = Defn gap extension penalty

CS262 Lecture 8, Win07, Batzoglou The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i – 1, j – 1), V I ( i – 1, j – 1), V j ( i – 1, j – 1) } + s(x i, y j ) V I (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e } V J (i, j) = max { V M (i, j – 1) – d, V I ( i, j – 1) – e }  s(.,.) (1 – 2  ) ~how often a pair of letters substitute one another   1/mean length of next gap   (1 –  ) / (1 – 2  ) 1/mean arrival time of next gap

CS262 Lecture 8, Win07, Batzoglou The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b)  log –––––––––– (ignore log(1 – 2  ) for the moment) P(x i ) P(y j ) Example: DNA regions between human and mouse genes have average conservation of 80% 1.What is the substitution score for a match? P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8  P(x, x) = 0.2 P(a) = P(c) = P(g) = P(t) = 0.25 s(x, x) = log [ 0.2 / 0.25 2 ] = 1.163 What is the substitution score for a mismatch? P(a, c) +…+P(t, g) = 0.2  P(x, y  x) = 0.2/12 = 0.0167 s(x, y  x) = log[ 0.0167 / 0.25 2 ] = -1.322 What ratio matches/(matches + mism.) gives score 0? x(#match) – y(#mism) = 0 1.163 (#match) – 1.322 (#mism) = 0 #match = 1.137(#mism) matches = 53.2%

CS262 Lecture 8, Win07, Batzoglou The meaning of alignment scores The global alignment algorithm we learned, corresponds to:  Find the most likely alignment under the 3-state pHMM The score of an alignment corresponds to:  Log-likehood ratio between P(best alignment| alignment model), and P(sequences were generated independently)

CS262 Lecture 8, Win07, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

CS262 Lecture 8, Win07, Batzoglou BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)

CS262 Lecture 8, Win07, Batzoglou Conditional Random Fields A brief description of a relatively new kind of graphical model

CS262 Lecture 8, Win07, Batzoglou Let’s look at an HMM again Why are HMMs convenient to use?  Because we can do dynamic programming with them! “Best” state sequence for 1…i interacts with “best” sequence for i+1…N using K 2 arrows V l (i+1) = e l (i+1) max k V k (i) a kl = max k ( V k (i) + [ e(l, i+1) + a(k, l) ] ) (where e(.,.) and a(.,.) are logs) Total likelihood of all state sequences for 1…i+1 can be calculated from total likelihood for 1…i by only summing up K 2 arrows 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

CS262 Lecture 8, Win07, Batzoglou Let’s look at an HMM again Some shortcomings of HMMs  Can’t model state duration Solution: explicit duration models (Semi-Markov HMMs)  Unfortunately, state  i cannot “look” at any letter other than x i ! Strong independence assumption: P(  i | x 1 …x i-1,  1 …  i-1 ) = P(  i |  i-1 ) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

CS262 Lecture 8, Win07, Batzoglou Let’s look at an HMM again Another way to put this, features used in objective function P(x,  ):  a kl, e k (b), where b    At position i: all K 2 a kl features, and all K e l (x i ) features play a role  OK forget probabilistic interpretation for a moment  “Given that prev. state is k, current state is l, how much is current score?” V l (i) = V k (i – 1) + (a(k, l) + e(l, i)) = V k (i – 1) + g(k, l, x i ) Let’s generalize g!!! V k (i – 1) + g(k, l, x, i) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x What do we put in g(k, l, x, i)?  The “higher” g(k, l, x, i), the more we like going from k to l at position i Richer models using this additional power  Examples Casino player looks at previous 100 pos’ns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[x i-100, …, x i-1 has > 50 6s]  w DON’T_GET_CAUGHT Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[x i-1000, …, x i+1000 has > 1/16 CpG]  w CG_RICH_REGION x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9 ii  i-1

CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x Conditional Random Fields—Features 1.Define a set of features that you think are important  All features should be functions of current state, previous state, x, and position i  Example: Old features: transition k  l, emission b from state k Plus new features: prev 100 letters have 50 6s  Number the features 1…n: f 1 (k, l, x, i), …, f n (k, l, x, i) features are indicator true/false variables  Find appropriate weights w 1,…, w n for when each feature is true weights are the parameters of the model 2.Let’s assume for now each feature has a weight w j  Then, g(k, l, x, i) =  j=1…n f j (k, l, x, i)  w j x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9

CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x Define V k (i): Optimal score of “parsing” x 1 …x i and ending in state k Then, assuming V k (i) is optimal for every k at position i, it follows that V l (i+1) = max k [V k (i) + g(k, l, x, i+1)] Why? Even though at pos’n i+1 we “look” at arbitrary positions in x, we are only “affected” by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x 1 …x N x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9

CS262 Lecture 8, Win07, Batzoglou “Features” that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state  i only “looks” at prev. state  i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … HMM CRF

CS262 Lecture 8, Win07, Batzoglou How many parameters are there, in general? Arbitrarily many parameters!  For example, let f j (k, l, x, i) depend on x i-5, x i-4, …, x i+5 Then, we would have up to K  |  | 11 parameters!  Advantage: powerful, expressive model Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region”  Question: how do we train these parameters?

CS262 Lecture 8, Win07, Batzoglou Conditional Training Hidden Markov Model training:  Given training sequence x, “true” parse   Maximize P(x,  ) Disadvantage:  P(x,  ) = P(  | x) P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given

CS262 Lecture 8, Win07, Batzoglou Conditional Training P(x,  ) = P(  | x) P(x) P(  | x) = P(x,  ) / P(x) Recall F(j, x,  ) = # times feature f j occurs in (x,  ) =  i=1…N f j (k, l, x, i) ; count f j in x,  In HMMs, let’s denote by w j the weight of j th feature: w j = log(a kl ) or log(e k (b)) Then, HMM: P(x,  ) = exp [  j=1…n w j  F(j, x,  ) ] CRF:Score(x,  ) = exp [  j=1…n w j  F(j, x,  ) ]

CS262 Lecture 8, Win07, Batzoglou Conditional Training In HMMs, P(  | x) = P(x,  ) / P(x) P(x,  ) = exp [  j=1…n w j  F(j, x,  ) ] P(x) =   exp [  j=1…n w j  F(j, x,  ) ] =: Z Then, in CRF we can do the same to normalize Score(x,  ) into a prob. P CRF (  | x) = exp [  j=1…n w j  F(j, x,  ) ] / Z QUESTION: Why is this a probability???

CS262 Lecture 8, Win07, Batzoglou Conditional Training 1.We need to be given a set of sequences x and “true” parses  2.Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P(  | x) 3.Calculate partial derivative of P(  | x) w.r.t. each parameter w j (not covered—akin to forward/backward) Update each parameter with gradient descent! 4.Continue until convergence to optimal set of weights P(  | x) = exp [  j=1…n w j  F(j, x,  ) ] / Zis convex!!!

CS262 Lecture 8, Win07, Batzoglou Conditional Random Fields—Summary 1.Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient 2.Conditional training Train parameters that are best for parsing, not modeling Need labeled examples—sequences x and “true” parses  (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slower—many iterations of forward/backward

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Similar presentations

Presentation on theme: "Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Similar presentations

Presentation on theme: "Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."— Presentation transcript:

Similar presentations

About project

Feedback