Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Similar presentations


Presentation on theme: "Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu."— Presentation transcript:

1 Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

2 Problem: classification of protein sequence data into families and superfamilies Motivation: Many proteins have been sequenced, but often structure/function remains unknown Motivation: infer structure/function from sequence-based classification Learning Sequence Based Protein Classification

3 >1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Sequences for four chains of human hemoglobin Tertiary Structure Function: oxygen transport Sequence Data Versus Structure and function

4 SCOP: Structural Classification of Proteins Interested in superfamily-level homology – remote evolutionary relationship Difficult !! Structural Hierarchy

5 Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise Focus on remote homology detection Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm Learning Problem

6 Generative model approach Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model Only uses positive training sequences Discriminative approach Learning algorithm tries to learn decision boundary between positive and negative examples Uses both positive and negative training sequences Two supervised learning approaches to classification

7 Class Fold Super Family Family HMM, PSI-BLAST, SVM SW, BLAST, FASTA Threading Secondary Structure Prediction Targets of the current methods

8 Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory Goal: learn a classifier that generalizes well to new examples Do not use training data to estimate parameters of probability distribution – “curse of dimensionality” Discriminative Learning

9 Want to define feature map from space of protein sequences to vector space Goals: Computational efficiency Competitive performance with known methods No reliance on generative model – general method for sequence-based classification problems SVM for protein classification

10 Feature vector from HMM Fisher kernel ( Jaakkola et al., 2000 ) Marginalized kernel ( Tsuda et al., 2002 ) Feature vector from sequence Spectrum kernel ( Leslie et al., 2002 ) Mismatch kernel ( Leslie et al., 2003 ) Feature vector from other score SVM pairwise ( Liao & Noble, 2002 ) Summary of the current kernel methods

11 Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution. It can not be used as kernel because of lack of positive definiteness. A family of local alignment (LA) kernels that mimic SW score are presented. String Alignment Kernels

12 Choose Feature Vector representation Get Kernel by inner product of vectors Measure similarityGet valid kernel LA Kernel Other Kernels LA Kernels

13 Pair score K a β (x,y) Gap kernel K g β (x,y) for penalty gap model with d is gap opening and e is extension costs Β>=0, s is a symmetric similarity score. LA Kernels

14 Kernel convolution: For n>=1, the string kernel can be expressed as K 0 =1 K 0 is initial part, succession of n aligned residues Ka β with n-1 possible gap K g β and a terminal part K 0. LA Kernels

15 It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels LA Kernels

16 π : local alignment p(x,y,π): score of local alignment π over x,y. Π : set of all possible local alignment over x,y. LA with SW score

17 1. SW only keep the best alignment instead of sum of alignment of x,y. 2. Logrithm can destroy the property of being postive definite. Why SW can not be kernel

18 LA Kernel SW score Example

19 SVM-pairwiseLA kernel Inner Product (0.9, 0.05, 0.3, 0.2) 0.227 0.253 Pair HMM xy x y (0.2, 0.3, 0.1, 0.01) SW Score

20 It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM. Diagonal Dominant Issue

21 (1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal. (2) The empirical kernel map LA-ekm

22 Implementation The computation of the kernel [and therefore of ] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm. Normaliztion Dataset 4352 sequences extracted from the Astral database (www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.www.cs.columbia.edu/compbio/svmpairwise Methods

23 ROC Curve

24

25 Summary for the kernels


Download ppt "Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu."

Similar presentations


Ads by Google