Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Slides:

Advertisements

Similar presentations

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Advertisements

Support Vector Machines

Machine learning continued Image source:

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

Profiles for Sequences

Structural bioinformatics

Support Vector Machines and Kernel Methods

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.

Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.

Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Protein Classification. PDB Growth New PDB structures.

Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Protein Tertiary Structure Prediction Structural Bioinformatics.

An Introduction to Support Vector Machines Martin Law.

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

Protein Tertiary Structure Prediction

PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

An Introduction to Support Vector Machines (M. Law)

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Protein Classification Using Averaged Perceptron SVM

Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.

Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.

Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

Sequence Alignment.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Protein Classification

Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Learning with General Similarity Functions Maria-Florina Balcan.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

CS 9633 Machine Learning Support Vector Machines

Chapter 14 Protein Structure Classification

An Introduction to Support Vector Machines

Discrete Kernels.

Combining HMMs with SVMs

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

Protein Structural Classification

Presentation transcript:

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Problem: classification of protein sequence data into families and superfamilies Motivation: Many proteins have been sequenced, but often structure/function remains unknown Motivation: infer structure/function from sequence-based classification Learning Sequence Based Protein Classification

>1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Sequences for four chains of human hemoglobin Tertiary Structure Function: oxygen transport Sequence Data Versus Structure and function

SCOP: Structural Classification of Proteins Interested in superfamily-level homology – remote evolutionary relationship Difficult !! Structural Hierarchy

Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise Focus on remote homology detection Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm Learning Problem

Generative model approach Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model Only uses positive training sequences Discriminative approach Learning algorithm tries to learn decision boundary between positive and negative examples Uses both positive and negative training sequences Two supervised learning approaches to classification

Class Fold Super Family Family HMM, PSI-BLAST, SVM SW, BLAST, FASTA Threading Secondary Structure Prediction Targets of the current methods

Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory Goal: learn a classifier that generalizes well to new examples Do not use training data to estimate parameters of probability distribution – “curse of dimensionality” Discriminative Learning

Want to define feature map from space of protein sequences to vector space Goals: Computational efficiency Competitive performance with known methods No reliance on generative model – general method for sequence-based classification problems SVM for protein classification

Feature vector from HMM Fisher kernel ( Jaakkola et al., 2000 ) Marginalized kernel ( Tsuda et al., 2002 ) Feature vector from sequence Spectrum kernel ( Leslie et al., 2002 ) Mismatch kernel ( Leslie et al., 2003 ) Feature vector from other score SVM pairwise ( Liao & Noble, 2002 ) Summary of the current kernel methods

Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution. It can not be used as kernel because of lack of positive definiteness. A family of local alignment (LA) kernels that mimic SW score are presented. String Alignment Kernels

Choose Feature Vector representation Get Kernel by inner product of vectors Measure similarityGet valid kernel LA Kernel Other Kernels LA Kernels

Pair score K a β (x,y) Gap kernel K g β (x,y) for penalty gap model with d is gap opening and e is extension costs Β>=0, s is a symmetric similarity score. LA Kernels

Kernel convolution: For n>=1, the string kernel can be expressed as K 0 =1 K 0 is initial part, succession of n aligned residues Ka β with n-1 possible gap K g β and a terminal part K 0. LA Kernels

It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels LA Kernels

π ： local alignment p(x,y,π): score of local alignment π over x,y. Π ： set of all possible local alignment over x,y. LA with SW score

1. SW only keep the best alignment instead of sum of alignment of x,y. 2. Logrithm can destroy the property of being postive definite. Why SW can not be kernel

LA Kernel SW score Example

SVM-pairwiseLA kernel Inner Product (0.9, 0.05, 0.3, 0.2) Pair HMM xy x y (0.2, 0.3, 0.1, 0.01) SW Score

It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM. Diagonal Dominant Issue

(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal. (2) The empirical kernel map LA-ekm

Implementation The computation of the kernel [and therefore of ] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm. Normaliztion Dataset 4352 sequences extracted from the Astral database ( grouped into families and superfamilies. Methods

ROC Curve

Summary for the kernels