Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Profile-profile alignment using hidden Markov models Wing Wong.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein Classification. PDB Growth New PDB structures.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Protein Tertiary Structure Prediction Structural Bioinformatics.
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Protein Tertiary Structure Prediction
Masquerade Detection Mark Stamp 1Masquerade Detection.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Hidden Markov Models for Sequence Analysis 4
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Classification Using Averaged Perceptron SVM
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Sequence Alignment.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Step 3: Tools Database Searching
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Protein Classification
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
An Enhanced Support Vector Machine Model for Intrusion Detection
Pairwise alignment incorporating dipeptide covariation
Combining HMMs with SVMs
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Protein structure prediction.
Grace W. Tang, Russ B. Altman  Structure 
Protein Structural Classification
Presentation transcript:

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Abstract Detecting remote protein homologies Fisher kernel method Variant of Support Vector Machines using new kernel function  Derived from Hidden Markov Models

Introduction (1) Detecting protein homologies (sequence-based algorithm)  BLAST, Fasta, PROBE, templates, profiles, position-specific weight matrices, HMM Comparison by (Brenner 1996; Park et al. 1998)  SCOP classification of protein structures  Remote protein homologies existing between protein domain in the same structural superfamily.  Statistical models like PSI-BLAST and HMMs are better than simple pairwise comparison methods.

Introduction (2) Generative statistical models (HMMs)  Extracting features from protein sequences  Mapping all protein sequences to points in a Euclidean feature space of fixed dimension. General discriminative statistical method to classify the points. Improvements acquired  Over HMMs alone.

Methods How generative models work. (HMMs)  Training examples ( sequences known to be members of protein family ) : positive  Tuning parameters with a priori knowledge  Model assigns a probability to any given protein sequence.  The sequence from that family yield a higher probability than that of outside family. Log-likelihood ratio as score

Discriminative approaches Using both positive and negative examples Parameter is tuned so that the model can optimally discriminate members of the family from nonmembers. When training examples are few  Likelihood ratio is optimal if generative models perfectly fit to data but…  Discriminative methods often performs better.

Kernel methods Discriminant function L(X)  Where { X i, i = 1,…,n} and hypothesis class H 1, H 2  + : the sequence of the family, - : outside of the family Contribution of Kernel  i : overall importance of the example X i.  Measure of pairwise similarity : K(X i, X) User supplies the type of kernel for the application area!!

The Fischer kernel (1) Deriving kernel function from generative models  Advantage 1 : handle variable length protein sequences!!  Advantage 2 : encoding of prior knowledge about protein sequences HMMs (difference)  Kernel function specifies a similarity score for any pair of sequences.  Likelihood score from an HMM only measures the closeness of the sequence to the model itself.

The Fischer kernel (2) Sufficient statistics  Each parameter in HMM : Posterior frequencies  Of particular transition.  Of generating one of the residues of the query sequence.  Reflects the process of generating the query sequence from HMM. Alterative of sufficient statistics : Fischer score  Magnitude of the components : how each contributes to generating the query sequence.

The Fischer kernel (3) Kernel function used in this paper. note that its fixed vector. Summary  Train HMM with positive examples.  Map each new protein sequence X into a fixed vector, Fisher score.  Calculate the kernel function  Get resulting discriminant function (SVM-Fisher)

The Fischer kernel (4) Combination of scores  There might be more than one HMM model for the family or superfamily of interest. Average score Maximum score

Experimental Methods Methods  SVM-Fisher (this paper)  BLAST (Altshul et al. 1990; Gish & States 1993)  HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996) Measurement of recognition rate for members of superfamilies of the SCOP protein structure classification (Hubbard et al. 1997)  Withholding all members of SCOP family  Train with the remaining members of SCOP superfamily  Test with withheld data  Question: “Could the method discover a new family of a known superfamily?”

Overview of experiments Database  SCOP version 1.37 PDB90 : consisting of protein domains, no two of which have 90% of more residue identity  PDB90 eliminates redundant sequences. Generative models  SAM-T98 HMMs Data selection  Get 33 test families from 16 superfamilies. Evaluation strategy  Assessing to what extent it gave better scores to the positive test examples thant it gave to the negative test examples.

SCOP: a Structural Classification of Proteins database Hierachical levels  Family: clustered proteins by common evolutionary origin: residue identities of above 30%, lower sequence identities but very similar functions and structures  Superfamily: low sequence identities but probably common evolutionary origin  Fold: same major secondary structure in the same arrangement and with the same topological connections

Figure 1: Separation of the SCOP PDB90 database into training and test sequences, shown for the G proteins test family

Multiple models used Modeling superfamily  SAM-T98 : starts with a single sequence (the guide sequence for the domain) and build a model  Too many sequences!  Using a subset of PDB90.  Train SVM-Fisher method using each of models in turn

Details on the training and test sets All PDB90 sequence outside the fold of the test family were used as either negative training or negative test examples.  Reverse test/training allocation of negative examples, and repeat experiments.  Fold-by-fold basis split of negative examples. For positive examples  PDB90 sequences in the superfamily of the test family are used.  Homologs found by each individual SAM-T98 model are used.

BLAST methods WU-BLAST version 2.0a16 (Althcshul & Gish 1996)  PDB90 database was queried with each positive training examples, and E-values were recorded.  BLAST:SCOP-only  BLAST:SCOP+SAM-T98-homologs  Scores were combined by the maximum method

Generative HMM models SAM-T98 method  Null model: reverse sequence model  Same data and same set of models as in the SVM-Fisher  Combined with maximum methods

Results Metric : the rate of false positives (RFP) RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence.

G-proteins The result of the family of the nucleotide triphosphate hydrolases SCOP superfamily  Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds. Table 1  In SVM-Fisher  5 of the 8 G proteins are better than all 2439 negative test sequences.  Maximum RFP  Median RFP Figure 2  RFP curve

Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S- T98 = SAMT-98, and SVM-F = SVM-Fisher method

Figure 2: 4 methods on the 33 test families. Curve of median RFP

Discussion New approach  to recognition of remote protein homologies make a discriminative method built on top of a generative model (HMMs)  Discriminative method on top of HMM methods  Significant improvement Combining multiple score would be improved. Allocation problem  Different training set for tuning HMM and different training set for discriminative model Extend the method to identify multiple domains within large protein sequences