Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Problem: classification of protein sequence data into families and superfamilies Motivation: Many proteins have been sequenced, but often structure/function remains unknown Motivation: infer structure/function from sequence-based classification Learning Sequence Based Protein Classification

>1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Sequences for four chains of human hemoglobin Tertiary Structure Function: oxygen transport Sequence Data Versus Structure and function

SCOP: Structural Classification of Proteins Interested in superfamily-level homology – remote evolutionary relationship Difficult !! Structural Hierarchy

Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise Focus on remote homology detection Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm Learning Problem

Generative model approach Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model Only uses positive training sequences Discriminative approach Learning algorithm tries to learn decision boundary between positive and negative examples Uses both positive and negative training sequences Two supervised learning approaches to classification

Class Fold Super Family Family HMM, PSI-BLAST, SVM SW, BLAST, FASTA Threading Secondary Structure Prediction Targets of the current methods

Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory Goal: learn a classifier that generalizes well to new examples Do not use training data to estimate parameters of probability distribution – “curse of dimensionality” Discriminative Learning

Want to define feature map from space of protein sequences to vector space Goals: Computational efficiency Competitive performance with known methods No reliance on generative model – general method for sequence-based classification problems SVM for protein classification

Feature vector from HMM Fisher kernel ( Jaakkola et al., 2000 ) Marginalized kernel ( Tsuda et al., 2002 ) Feature vector from sequence Spectrum kernel ( Leslie et al., 2002 ) Mismatch kernel ( Leslie et al., 2003 ) Feature vector from other score SVM pairwise ( Liao & Noble, 2002 ) Summary of the current kernel methods

Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution. It can not be used as kernel because of lack of positive definiteness. A family of local alignment (LA) kernels that mimic SW score are presented. String Alignment Kernels

Choose Feature Vector representation Get Kernel by inner product of vectors Measure similarityGet valid kernel LA Kernel Other Kernels LA Kernels

Pair score K a β (x,y) Gap kernel K g β (x,y) for penalty gap model with d is gap opening and e is extension costs Β>=0, s is a symmetric similarity score. LA Kernels

Kernel convolution: For n>=1, the string kernel can be expressed as K 0 =1 K 0 is initial part, succession of n aligned residues Ka β with n-1 possible gap K g β and a terminal part K 0. LA Kernels

It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels LA Kernels

π ： local alignment p(x,y,π): score of local alignment π over x,y. Π ： set of all possible local alignment over x,y. LA with SW score

1. SW only keep the best alignment instead of sum of alignment of x,y. 2. Logrithm can destroy the property of being postive definite. Why SW can not be kernel

LA Kernel SW score Example

SVM-pairwiseLA kernel Inner Product (0.9, 0.05, 0.3, 0.2) 0.227 0.253 Pair HMM xy x y (0.2, 0.3, 0.1, 0.01) SW Score

It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM. Diagonal Dominant Issue

(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal. (2) The empirical kernel map LA-ekm

Implementation The computation of the kernel [and therefore of ] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm. Normaliztion Dataset 4352 sequences extracted from the Astral database (www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.www.cs.columbia.edu/compbio/svmpairwise Methods

ROC Curve

Summary for the kernels

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Similar presentations

Presentation on theme: "Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Similar presentations

Presentation on theme: "Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu."— Presentation transcript:

Similar presentations

About project

Feedback