Hidden Markov Models CBB 231 / COMPSCI 261
An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite set of states, Q={q 0, q 1,..., q m } a finite alphabet ={s 0, s 1,..., s n } a transition distribution P t : Q×Q a¡ i.e., P t (q j | q i ) an emission distribution P e : Q× a¡ i.e., P e (s j | q i ) q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 What is an HMM? M 1 =({q 0,q 1,q 2 },{ Y, R },P t,P e ) P t ={(q 0,q 1,1), (q 1,q 1,0.8), (q 1,q 2,0.15), (q 1,q 0,0.05), (q 2,q 2,0.7), (q 2,q 1,0.3)} P e ={(q 1, Y,1), (q 1, R,0), (q 2, Y,0), (q 2, R,1) } An Example
q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 P( YRYRY |M 1 ) = a 0→1 b 1,Y a 1→2 b 2,R a 2→1 b 1,Y a 1→2 b 2,R a 2→1 b 1,Y a 1→0 =1 1 0.15 1 0.3 1 0.15 1 0.3 1 0.05 = Probability of a Sequence
Another Example A=10% T=30% C=40% G=20% A=10% T=30% C=40% G=20% A=11% T=17% C=43% G=29% A=11% T=17% C=43% G=29% A=35% T=25% C=15% G=25% A=35% T=25% C=15% G=25% A=27% T=14% C=22% G=37% A=27% T=14% C=22% G=37% 50% 100% 65% 35% 20% 80% q1q1 q1q1 100% q2q2 q2q2 q3q3 q3q3 q4q4 q4q4 q0q0 q0q0 M 2 = (Q, , P t, P e ) Q = {q 0, q 1, q 2, q 3, q 4 } ={A, C, G, T}
A=10% T=30% C=40% G=20% A=10% T=30% C=40% G=20% A=11% T=17% C=43% G=29% A=11% T=17% C=43% G=29% A=35% T=25% C=15% G=25% A=35% T=25% C=15% G=25% A=27% T=14% C=22% G=37% A=27% T=14% C=22% G=37% 50% 100% 65% 35% 20% 80% q1q1 q1q1 100% q2q2 q2q2 q3q3 q3q3 q4q4 q4q4 q0q0 q0q0 The most probable path is: States: States: Sequence: Sequence: CATTAATAG resulting in this parse: States: States: Sequence: Sequence: CATTAATAG The most probable path is: States: States: Sequence: Sequence: CATTAATAG resulting in this parse: States: States: Sequence: Sequence: CATTAATAG Finding the Most Probable Path Example: C A T T A A T A G top: top: 7.0× bottom: bottom: 2.8× feature 1: C feature 2: ATTAATA feature 3: G Finding the Most Probable Path
emission prob. transition prob. Decoding with an HMM
The Best Partial Parse
The Viterbi Algorithm sequence states (i,k)(i,k) k k-1k-1... k-2k-2 k +1...
Viterbi: Traceback T( T( T(... T( T(i, L - 1), L - 2)..., 2), 1), 0) = 0
Viterbi Algorithm in Pseudocode trans [q i ]={q j | P t (q i |q j )>0} emit [s] = {q i | P e (s|q i )>0} initialization fill out main part of DP matrix choose best state from last column in DP matrix traceback
The Forward Algorithm : Probability of a Sequence Fik F(i,k) represents the probability P(S 0..k - 1 | q i ) that the machine emits the subsequence x 0...x k - 1 by any path ending in state q i — i.e., so that symbol x k - 1 is emitted by state q i.
The Forward Algorithm : Probability of a Sequence sequence states (i,k)(i,k) k k-1k-1... k-2k-2 k Viterbi:Viterbi: Forward:Forward: the single most probable path sum over all paths i.e.,
The Forward Algorithm in Pseudocode fill out the DP matrix sum over the final column to get P(S)
CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC to state 012 from state 00 (0%)1 (100%)0 (0%) 11 (4%)21 (84%)3 (12%) 20 (0%)3 (20%)12 (80%) symbol ACGT in state 1 6 (24%) 7 (28%) 5 (20%) 7 (28%) 2 3 (20%) 2 (13%) 7 (47%) Training an HMM from Labeled Sequences transitions emissions
Recall: Eukaryotic Gene Structure ATG TGA coding segment complete mRNA ATG GT AGGTAG... start codon stop codon donor site acceptor site exonexon intron TGA
exon 1 exon 2 exon 3 AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA Intergenic Start codon Start codon Stop codon Stop codon Exon Donor Acceptor Intron the Markov model: the gene prediction: the input sequence: q0q0 q0q0 the most probable path: Using an HMM for Gene Prediction
Higher Order Markovian Eukaryotic Recognizer (HOMER) H3H3 H5H5 H 17 H 27 H 77 H 95
nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# baseline H3H HOMER, version H 3 I=intron state E=exon state N=intergenic state tested on 500 Arabidopsis genes: Intergenic Start codon Stop codon Exon DonorAcceptor Intron q0q0 q0q0
Recall: Sensitivity and Specificity
nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# H3H H5H HOMER, version H 5 three exon states, for the three codon positions
nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# H5H H HOMER version H 17 acceptor site donor site start codon stop codon Intergenic Start codon Start codon Stop codon Stop codon Exon Donor Acceptor Intron q0q0 q0q0
GTATGCGATAGTCAAGAGTGATCGCTAGACC | | | | | | | phase: sequence: coordinates: Maintaining Phase Across an Intron
nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H H HOMER version H 27 three separate intron models
A T G T G A T A A T A G G T A G (start codons) (donor splice sites) (acceptor splice sites) Recall: Weight Matrices (stop codons) (stop codons)
nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H H HOMER version H 77 positional biases near splice sites
nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H H HOMER version H 95
Summary of HOMER Results
Higher-order Markov Models A C G C T A P(G|AC) A C G C T A P(G|C) A C G C T A P(G) 0 th order: 1 st order: 2 nd order:
order nucleotides splice sites starts/ stopsexonsgenes SnSpFSnSpSnSpSnSpFSn# H H H H H H Higher-order Markov Models
Variable-Order Markov Models
Interpolation Results
Summary An HMM is a stochastic generative model which emits sequences Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence Training of unambiguous HMM’s can be accomplished using labeled sequence training Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...) Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...) An HMM is a stochastic generative model which emits sequences Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence Training of unambiguous HMM’s can be accomplished using labeled sequence training Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...) Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)