Presentation is loading. Please wait.

Presentation is loading. Please wait.

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

Similar presentations


Presentation on theme: "CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models"— Presentation transcript:

1 CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models
Model comparison CISC841, F07, Liao

2 How to tell if two HMMs are equivalent?
If not equivalent, how (dis-)similar are they? Remember: HMMs are generative Given a sequence x, P(x|M) is the probability that x can be generated from the model M. How to compare two probability distribution? Mutual entropy H(M, M’) =  x P(x|M) log [P(x|M)/P(x|M’)] CISC841, F07, Liao

3 Mutual entropy: H(p|q)  0 (why
Mutual entropy: H(p|q)  0 (why?) H(p|q) = 0 iff p = q Complexity of comparing HMMs - It is proved to be NP-hard. (Lyngso and Pedersen, LNCS, 2001, 2223: ) CISC841, F07, Liao

4 Hidden Markov Model D I M Observed emission/transition counts
node position A – C – G – T – A C G T MM MD MI IM ID II DM – DD – DI – X X X bat A G – – – C rat A – A G –C cat A G – A A– gnat – – A A AC goat A G – – – C I Start M End D CISC841, F07, Liao 1 2 3

5 Comparison levels for homology detection
Sequence-to-sequence (pair wise) - for proteins with relatively high sequence identity - dynamic programming methods Sequence-to-profile - for distant relationships and improved alignment accuracy - PSI-BLAST, HMMER, SAM Profile-to-profile - for more sensitivity and accuracy of alignment - COMPASS, Prof_sim V G A H - A G E Y A G A H D - G E F Seq dbase query hit A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V G A - - H A G E Y Prof dbase Seq dbase query hit V G A - - H A G E Y V N V D E V V E A - - D V A G H V K G D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V Prof dbase hit query CISC841, F07, Liao

6 Performance quantifiers
Ability to detect distant relationships - sensitivity - specificity Accuracy of alignment prediction (when compared to corresponding structure based alignment) query hit e-val relationship tp fp HBA_HUMAN HBB_HUMAN e MYG_PHYCA e GLB3_CHITP e GLB5_PETMA e GLB2_LUPLU e GLB1_GLYDI e tp : true positive count fp : false positive count relationship is +1 if query and hit sequences are related at super family level V G A H A G E Y A G A H D G E F Sequence based alignment G A H A G E G A H D G E Structure based alignment Modeler’s accuracy metric (Qm) = Nc/Nseq Developer’s accuracy metric (Qd) = Nc/Nstr Combined metric (Qc) = Nc / (Nseq + Nstr – Nc) where Nc = number of aligned pairs common to both alignments Nseq = number of aligned pairs in the sequence based alignemnt Nstr = number of alinged pairs in the structure based alignment sid Qm Qd Qc 5/ / /6 5/8 CISC841, F07, Liao

7 On profile-profile comparisons
1 . 20 V G A - H A G E Y V N V D E V V E A - D V A G H V K G D V Y S - T Y E T S F N A - N I P K H I A G - N G A G V A G A H D - G E F V - - N V - D E F C K A D V - A G H V K G F V L S T I - E T S D N K T I - A K H I A G T G - A G V V G A - - H A G E Y V N V D E V V E A - - D V A G H V K G D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V L1 A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V L2 Numeric profiles Subs matrix From MSA to numeric profiles - sampling - dropping columns Alignment of numeric profiles - scoring functions - dynamic programming alignment Example: COMPASS (Sadreyev et. al. J. Mol. Biol. (2003) 326, pp. 317–336. ) CISC841, F07, Liao

8 Quasi consensus based comparison of HMMs
Build profile HMMs using existing packages (SAM-T99 or HMMER) Generation of quasi consensus sequence from the model Alignment of consensus sequence of a model with another model Extraction of two alignments in each direction V G A - - H A G E Y V N V D E V V E A - - D V A G H V K G D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V - K A - T I A E H A - G A - H D G E F Consensus2 Seed 1 Seed 2 V - G A N - V A E H V - G A H - A G E Y Consensus 1 V G A - - N V A E H S(c2|M1) Aln21 Aln12 V K A - - T I A E H S(c1|M2) M1 V G A N V A E H M2 V K A T I A E H Consensus 2 CISC841, F07, Liao

9 Benchmark experiment I : Detection ability
All-vs-all comparisons of 569 MSAs from (Wang and Dunbrack, 2004) using COMPASS and QC-COMP. Two MSAs are said to be related if their seed sequences are from the same SCOP superfamily. In all-vs-all comparisons using QC-COMP, the ith HMM is used to score consensus sequences from the remaining 568 HMMs and the resulting scores are transformed into z-scores zi(ck) = [si(ck) - <s>]/ Mi = { zi (c1), zi (c2), zi (ci-1), zi (ci+1), , zi (c569) } Mj = { zj (c1), zj (c2), zj (cj-1), zj (cj+1), , zj (c569) } dij = Mi .ej = zi (cj) asymmetric similarity measure between Mi and Mj dij = Mi .ej + Mj .ei = zi (cj) + zj (ci) symmetric similarity measure between Mi and Mj Same experiment is repeated using seed sequences instead of consensus sequences For COMPASS, the ith profile is compared with the remaining 568 profiles and the scores are transformed into z-scores. The same similarity measures are used. We also consider E-values measures. CISC841, F07, Liao

10 Results for detection ability experiment
ROC values COMPASS SEED CON sym asym e-value - CISC841, F07, Liao

11 Benchmark experiment II : Alignment accuracy
2305 pairs of MSAs from (Wang and Dunbrack, 2004) were aligned using COMPASS and QC-COMP. Same experiment is repeated using seed sequences instead of consensus sequences Region Identity range #Pairs A G A H D - G E F V G A - H A G E Y COMPASS Extracted alignment Accuracy parameters Qm, Qd and Qc Extraction schemes MAX - AND - AND1 - AND2 - AND3 V N V D E V V E A - D V A G H V K G D V Y S - T Y E T S F N A - N I P K H I A G - N G A G V V - - N V - D E F C K A D V - A G H V K G F V L S T I - E T S D N K T I - A K H I A G T G - A G V V G A - - N V A E H S(c2|M1) V K A - - T I A E H S(c1|M2) V G A - - H A G E Y V N V D E V V E A - - D V A G H V K G D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V A - G A - H D G E F Aln21 V - G A H - A G E Y Aln12 CISC841, F07, Liao

12 Results for alignment accuracy experiment
Consensus based CISC841, F07, Liao

13 Results for alignment accuracy experiment
Seed based CISC841, F07, Liao

14 Results for alignment accuracy experiment
Mixing scheme: if the symmetric similarity measure between a pair of HMMs is less than –22.0, seed-based alignment is taken. Otherwise, consensus-based alignment is chosen. The threshold –22.0 was determined using a separate training set (1136 pairs of HMMs). Mix CISC841, F07, Liao

15 CISC841, F07, Liao


Download ppt "CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models"

Similar presentations


Ads by Google