CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
1 Applications of Dynamic Programming zTo sequence analysis Shotgun sequence assembly Multiple alignments Dispersed & tandem repeats Bird song alignments.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Structural bioinformatics
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Heuristic alignment algorithms and cost matrices
Profile-profile alignment using hidden Markov models Wing Wong.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Expected accuracy sequence alignment
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Similar Techniques For Molecular Sequencing and Network Security Doug Madory 27 APR 05 Big Picture Big Picture Protein Structure Protein Structure Sequencing.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Similar Sequence Similar Function Charles Yan Spring 2006.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Introduction to Profile Hidden Markov Models
“Homology-enhanced probabilistic consistency” multiple sequence alignment : a case study on transmembrane protein Jia-Ming Chang 2013-July-09 Chang, J-M,
Masquerade Detection Mark Stamp 1Masquerade Detection.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Step 3: Tools Database Searching
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Expected accuracy sequence alignment Usman Roshan.
Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Free for Academic Use. Jianlin Cheng.
Sequence similarity, BLAST alignments & multiple sequence alignments
Scoring Sequence Alignments Calculating E
Computational Structure Prediction
Overview of Multiple Sequence Alignment Algorithms

Matt Menke, Tufts Bonnie Berger, MIT Lenore Cowen, Tufts
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Dot Plots, Path Matrices, Score Matrices
Combining HMMs with SVMs
courtesy of C. Chothia Most proteins in biology have been produced by the duplication, divergence and recombination of the members of a small.
Large-Scale Genomic Surveys
Sequence Based Analysis Tutorial
Grace W. Tang, Russ B. Altman  Structure 
Protein homology detection by HMM–HMM comparison Johannes Söding
Roc curves By Vittoria Cozza, matr
Protein Structural Classification
Presentation transcript:

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models Model comparison CISC841, F07, Liao

How to tell if two HMMs are equivalent? If not equivalent, how (dis-)similar are they? Remember: HMMs are generative Given a sequence x, P(x|M) is the probability that x can be generated from the model M. How to compare two probability distribution? Mutual entropy H(M, M’) =  x P(x|M) log [P(x|M)/P(x|M’)] CISC841, F07, Liao

Mutual entropy: H(p|q)  0 (why Mutual entropy: H(p|q)  0 (why?) H(p|q) = 0 iff p = q Complexity of comparing HMMs - It is proved to be NP-hard. (Lyngso and Pedersen, LNCS, 2001, 2223:416-428.) CISC841, F07, Liao

Hidden Markov Model D I M Observed emission/transition counts node position 0 1 2 3 ------------------ A – 4 0 0 C – 0 0 4 G – 0 3 0 T – 0 0 0 A 0 0 6 0 C 0 0 0 0 G 0 0 1 0 T 0 0 0 0 MM 4 3 2 4 MD 1 1 0 0 MI 0 0 1 0 IM 0 0 2 0 ID 0 0 1 0 II 0 0 4 0 DM – 0 0 1 DD – 1 0 0 DI – 0 2 0 X X . . . X bat A G – – – C rat A – A G –C cat A G – A A– gnat – – A A AC goat A G – – – C 1 2 3 I Start M End D CISC841, F07, Liao 1 2 3

Comparison levels for homology detection Sequence-to-sequence (pair wise) - for proteins with relatively high sequence identity - dynamic programming methods Sequence-to-profile - for distant relationships and improved alignment accuracy - PSI-BLAST, HMMER, SAM Profile-to-profile - for more sensitivity and accuracy of alignment - COMPASS, Prof_sim V G A H - A G E Y A G A H D - G E F Seq dbase query hit A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V G A - - H A G E Y Prof dbase Seq dbase query hit V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V Prof dbase hit query CISC841, F07, Liao

Performance quantifiers Ability to detect distant relationships - sensitivity - specificity Accuracy of alignment prediction (when compared to corresponding structure based alignment) query hit e-val relationship tp fp HBA_HUMAN HBB_HUMAN 3.87e-60 +1 1 0 MYG_PHYCA 5.02e-23 +1 2 0 GLB3_CHITP 5.60e-4 +1 3 0 GLB5_PETMA 1.43e-1 -1 3 1 GLB2_LUPLU 1.56e+1 +1 4 1 GLB1_GLYDI 1.45e+3 -1 4 2 tp : true positive count fp : false positive count relationship is +1 if query and hit sequences are related at super family level V G A H - A G E Y A G A H D - G E F Sequence based alignment G A H A G E G A H D G E Structure based alignment Modeler’s accuracy metric (Qm) = Nc/Nseq Developer’s accuracy metric (Qd) = Nc/Nstr Combined metric (Qc) = Nc / (Nseq + Nstr – Nc) where Nc = number of aligned pairs common to both alignments Nseq = number of aligned pairs in the sequence based alignemnt Nstr = number of alinged pairs in the structure based alignment sid Qm Qd Qc 5/6 5/7 5/6 5/8 CISC841, F07, Liao

On profile-profile comparisons 1 . 20 V G A - H A G E Y V - - - N V D E V V E A - D V A G H V K G - - - - - D V Y S - T Y E T S F N A - N I P K H I A G - N G A G V A G A H D - G E F V - - N V - D E F C K A D V - A G H V K G - - - - - F V L S T I - E T S D N K T I - A K H I A G T G - A G V V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V L1 A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V L2 Numeric profiles Subs matrix From MSA to numeric profiles - sampling - dropping columns Alignment of numeric profiles - scoring functions - dynamic programming alignment Example: COMPASS (Sadreyev et. al. J. Mol. Biol. (2003) 326, pp. 317–336. ) CISC841, F07, Liao

Quasi consensus based comparison of HMMs Build profile HMMs using existing packages (SAM-T99 or HMMER) Generation of quasi consensus sequence from the model Alignment of consensus sequence of a model with another model Extraction of two alignments in each direction V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V - K A - T I A E H A - G A - H D G E F Consensus2 Seed 1 Seed 2 V - G A N - V A E H V - G A H - A G E Y Consensus 1 V G A - - N V A E H S(c2|M1) Aln21 Aln12 V K A - - T I A E H S(c1|M2) M1 V G A N V A E H M2 V K A T I A E H Consensus 2 CISC841, F07, Liao

Benchmark experiment I : Detection ability All-vs-all comparisons of 569 MSAs from (Wang and Dunbrack, 2004) using COMPASS and QC-COMP. Two MSAs are said to be related if their seed sequences are from the same SCOP superfamily. In all-vs-all comparisons using QC-COMP, the ith HMM is used to score consensus sequences from the remaining 568 HMMs and the resulting scores are transformed into z-scores zi(ck) = [si(ck) - <s>]/ Mi = { zi (c1), zi (c2), . . . zi (ci-1), zi (ci+1), . . ., zi (c569) } Mj = { zj (c1), zj (c2), . . . zj (cj-1), zj (cj+1), . . ., zj (c569) } dij = Mi .ej = zi (cj) asymmetric similarity measure between Mi and Mj dij = Mi .ej + Mj .ei = zi (cj) + zj (ci) symmetric similarity measure between Mi and Mj Same experiment is repeated using seed sequences instead of consensus sequences For COMPASS, the ith profile is compared with the remaining 568 profiles and the scores are transformed into z-scores. The same similarity measures are used. We also consider E-values measures. CISC841, F07, Liao

Results for detection ability experiment ROC values COMPASS SEED CON sym 0.883450 0.858050 0.914950 asym 0.839538 0.761250 0.866337 e-value 0.876912 - CISC841, F07, Liao

Benchmark experiment II : Alignment accuracy 2305 pairs of MSAs from (Wang and Dunbrack, 2004) were aligned using COMPASS and QC-COMP. Same experiment is repeated using seed sequences instead of consensus sequences Region Identity range #Pairs 1 0.00 - 0.05 58 2 0.05 - 0.10 522 3 0.10 - 0.15 598 4 0.15 - 0.20 382 5 0.20 - 0.25 258 6 0.25 - 0.30 217 7 0.30 - 0.35 162 8 0.35 - 0.40 108 A G A H D - G E F V G A - H A G E Y COMPASS Extracted alignment Accuracy parameters Qm, Qd and Qc Extraction schemes - MAX - AND - AND1 - AND2 - AND3 V - - - N V D E V V E A - D V A G H V K G - - - - - D V Y S - T Y E T S F N A - N I P K H I A G - N G A G V V - - N V - D E F C K A D V - A G H V K G - - - - - F V L S T I - E T S D N K T I - A K H I A G T G - A G V V G A - - N V A E H S(c2|M1) V K A - - T I A E H S(c1|M2) V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V A - G A - H D G E F Aln21 V - G A H - A G E Y Aln12 CISC841, F07, Liao

Results for alignment accuracy experiment Consensus based CISC841, F07, Liao

Results for alignment accuracy experiment Seed based CISC841, F07, Liao

Results for alignment accuracy experiment Mixing scheme: if the symmetric similarity measure between a pair of HMMs is less than –22.0, seed-based alignment is taken. Otherwise, consensus-based alignment is chosen. The threshold –22.0 was determined using a separate training set (1136 pairs of HMMs). Mix CISC841, F07, Liao

CISC841, F07, Liao