Download presentation
Presentation is loading. Please wait.
1
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2
2
Viterbi, Forward, Backward VITERBI Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Iteration: V l (i) = e l (x i ) max k V k (i-1) a kl Termination: P(x, *) = max k V k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i ) k f k (i-1) a kl Termination: P(x) = k f k (N) BACKWARD Initialization: b k (N) = 1, for all k Iteration: b l (i) = k e l (x i +1) a kl b k (i+1) Termination: P(x) = k a 0k e k (x 1 ) b k (1)
3
Posterior Decoding We can now calculate f k (i) b k (i) P( i = k | x) = ––––––– P(x) Then, we can ask What is the most likely state at position i of sequence x: Define ^ by Posterior Decoding: ^ i = argmax k P( i = k | x) P( i = k | x) = P( i = k, x)/P(x) = P(x 1, …, x i, i = k, x i+1, … x n ) / P(x) = P(x 1, …, x i, i = k) P(x i+1, … x n | i = k) / P(x) = f k (i) b k (i) / P(x)
4
Posterior Decoding For each state, Posterior Decoding gives us a curve of likelihood of state for each position That is sometimes more informative than Viterbi path * Posterior Decoding may give an invalid sequence of states (of prob 0) Why?
5
Posterior Decoding P( i = k | x) = P( | x) 1( i = k) = { : [i] = k} P( | x) x 1 x 2 x 3 …………………………………………… x N State 1 l P( i =l|x) k 1( ) = 1, if is true 0, otherwise
6
Variants of HMMs
7
Higher-order HMMs How do we model “memory” larger than one time point? P( i+1 = l | i = k)a kl P( i+1 = l | i = k, i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH
8
Similar Algorithms to 1 st Order P( i+1 = l | i = k, i -1 = j) V lk (i) = max j { V kj (i – 1) + … } Time? Space?
9
Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq
10
Example: exon lengths in genes
11
Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)
12
Solution 2: Negative binomial distribution Duration in X: m turns, where During first m – 1 turns, exactly n – 1 arrows to next state are followed During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X (n) p X (2) X (1) p 1 – p p …… Y 1 – p
13
Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3
14
Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state F d<D f x i …x i+d-1 PfPf Warning, Rabiner’s tutorial claims O(D 2 ) & O(D) increases
15
Viterbi with duration modeling Recall original iteration: Vl(i) = max k V k (i – 1) a kl e l (x i ) New iteration: V l (i) = max k max d=1…Dl V k (i – d) P l (d) a kl j=i-d+1…i e l (x j ) FL transitions emissions d<D f x i …x i + d – 1 emissions d<D l x j …x j + d – 1 PfPf PlPl Precompute cumulative values
16
Proteins, Pair HMMs, and Alignment
17
A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J
18
Let’s score the transitions -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J s(x i, y j ) -d -e
19
Alignment with affine gaps – state version Dynamic Programming: M(i, j):Optimal alignment of x 1 …x i to y 1 …y j ending in M I(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in I J(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in J The score is additive, therefore we can apply DP recurrence formulas
20
Alignment with affine gaps – state version Initialization: M(0,0) = 0; M(i, 0) = M(0, j) = - , for i, j > 0 I(i,0) = d + i e;J(0, j) = d + j e Iteration: M(i – 1, j – 1) M(i, j) = s(x i, y j ) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = max d + M(i – 1, j) e + J(i, j – 1) J(i, j) = max d + M(i, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }
21
Brief introduction to the evolution of proteins Protein sequence and structure Protein classification Phylogeny trees Substitution matrices
22
Structure Determines Function What determines structure? Energy Kinematics How can we determine structure? Experimental methods Computational predictions The Protein Folding Problem
23
Primary Structure: Sequence The primary structure of a protein is the amino acid sequence
24
Primary Structure: Sequence Twenty different amino acids have distinct shapes and properties
25
Primary Structure: Sequence A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"
26
Secondary Structure: , , & loops helices and sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms
27
Tertiary Structure: A Protein Fold
28
Actin structure
29
Actin sequence Actin is ancient and abundant Most abundant protein in cells 1-2 actin genes in bacteria, yeasts, amoebas Humans: 6 actin genes -actin in muscles; -actin, -actin in non-muscle cells ~4 amino acids different between each version MUSCLE ACTIN Amino Acid Sequence 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF
30
A related protein in bacteria
31
Relation between sequence and structure
32
Protein Phylogenies Proteins evolve by both duplication and species divergence
33
Protein Phylogenies – Example
34
PDB Growth New PDB structures
35
Only a few folds are found in nature
36
Substitutions of Amino Acids Mutation rates between amino acids have dramatic differences!
37
Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = ( b A ab )/( c≤d A cd ); P(a, b) = A ab /( c≤d A cd )
38
Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related
39
A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2 1 – This model generates two sequences simultaneously Match/Mismatch state M: P(x, y) reflects substitution frequencies between pairs of amino acids Insertion states I, J: P(x), P(y) reflect frequencies of each amino acid : set so that 1/2 is avg. length before next gap : set so that 1/(1 – ) is avg. length of a gap M Model M optional
40
A Pair HMM for unaligned sequences I P(x i ) J P(y j ) 11 Two sequences are independently generated from one another P(x, y | R) = P(x 1 )…P(x m ) P(y 1 )…P(y n ) = i P(x i ) j P(y j ) R Model R
41
To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2 ) P(x i, y j ) when matched P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2 1 – I P(x i ) J P(y j ) 1 1
42
To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2 ) P(x i, y j ) when matched P(x i ) P(y j ) when gapped Some extra term for gap openingR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2 (1 – ) –––––––– (1 – 2 ) I P(x i ) J P(y j ) 1 1 1 – 2 Equivalent!
43
To compare ALIGNMENT vs. RANDOM hypothesis Idea: MR We will divide M alignment score by R score, and take logarithms Let P(x i, y j ) s(x i, y j ) = log ––––––––– + log (1 – 2 ) P(x i ) P(y j ) (1 – ) P(x i ) d = – log ––––––––––––– (1 – 2 ) P(x i ) P(x i ) e = – log –––––– P(x i ) = Defn substitution score = Defn gap initiation penalty = Defn gap extension penalty
44
The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i – 1, j – 1), V I ( i – 1, j – 1) – d, V j ( i – 1, j – 1) } + s(x i, y j ) V I (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e } V J (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e } s(.,.) (1 – 2 ) ~how often a pair of letters substitute one another 1/mean length of next gap (1 – ) / (1 – 2 ) 1/mean arrival time of next gap
45
The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b) log –––––––––– (ignore log(1 – 2 ) for the moment) P(x i ) P(y j ) Example: Genes between human and mouse genes have average conservation of 80% 1.Let’s calculate this way the substitution score for a match: P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8 P(x, x) = 0.2 P(a) = P(c) = P(g) = P(t) = 0.25 s(x, x) = log [ 0.2 / 0.25 2 ] = 1.163 …and for a mismatch: P(a, c) +…+P(t, g) = 0.2 P(x, y x) = 0.2/12 = 0.0167 s(x, y x) = log[ 0.0167 / 0.25 2 ] = -1.322 What ratio matches/(matches + mism.) gives score 0? x(#match) – y(#mism) = 0 1.163 (#match) – 1.322 (#mism) = 0 #match = 1.137(#mism) matches = 53.2% Match/mismatch scores: P(x i, y j ) s(a, b) log –––––––––– (ignore log(1 – 2 ) for the moment) P(x i ) P(y j ) Example: Genes between human and mouse genes have average conservation of 80% 1.Let’s calculate this way the substitution score for a match: P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8 P(x, x) = 0.2 P(a) = P(c) = P(g) = P(t) = 0.25 s(x, x) = log [ 0.2 / 0.25 2 ] = 1.163 …and for a mismatch: P(a, c) +…+P(t, g) = 0.2 P(x, y x) = 0.2/12 = 0.0167 s(x, y x) = log[ 0.0167 / 0.25 2 ] = -1.322 What ratio matches/(matches + mism.) gives score 0? x(#match) – y(#mism) = 0 1.163 (#match) – 1.322 (#mism) = 0 #match = 1.137(#mism) matches = 53.2%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.