Download presentation
Presentation is loading. Please wait.
1
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu
2
Problem: classification of protein sequence data into families and superfamilies Motivation: Many proteins have been sequenced, but often structure/function remains unknown Motivation: infer structure/function from sequence-based classification Learning Sequence Based Protein Classification
3
>1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Sequences for four chains of human hemoglobin Tertiary Structure Function: oxygen transport Sequence Data Versus Structure and function
4
SCOP: Structural Classification of Proteins Interested in superfamily-level homology – remote evolutionary relationship Difficult !! Structural Hierarchy
5
Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise Focus on remote homology detection Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm Learning Problem
6
Generative model approach Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model Only uses positive training sequences Discriminative approach Learning algorithm tries to learn decision boundary between positive and negative examples Uses both positive and negative training sequences Two supervised learning approaches to classification
7
Class Fold Super Family Family HMM, PSI-BLAST, SVM SW, BLAST, FASTA Threading Secondary Structure Prediction Targets of the current methods
8
Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory Goal: learn a classifier that generalizes well to new examples Do not use training data to estimate parameters of probability distribution – “curse of dimensionality” Discriminative Learning
9
Want to define feature map from space of protein sequences to vector space Goals: Computational efficiency Competitive performance with known methods No reliance on generative model – general method for sequence-based classification problems SVM for protein classification
10
Feature vector from HMM Fisher kernel ( Jaakkola et al., 2000 ) Marginalized kernel ( Tsuda et al., 2002 ) Feature vector from sequence Spectrum kernel ( Leslie et al., 2002 ) Mismatch kernel ( Leslie et al., 2003 ) Feature vector from other score SVM pairwise ( Liao & Noble, 2002 ) Summary of the current kernel methods
11
Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution. It can not be used as kernel because of lack of positive definiteness. A family of local alignment (LA) kernels that mimic SW score are presented. String Alignment Kernels
12
Choose Feature Vector representation Get Kernel by inner product of vectors Measure similarityGet valid kernel LA Kernel Other Kernels LA Kernels
13
Pair score K a β (x,y) Gap kernel K g β (x,y) for penalty gap model with d is gap opening and e is extension costs Β>=0, s is a symmetric similarity score. LA Kernels
14
Kernel convolution: For n>=1, the string kernel can be expressed as K 0 =1 K 0 is initial part, succession of n aligned residues Ka β with n-1 possible gap K g β and a terminal part K 0. LA Kernels
15
It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels LA Kernels
16
π : local alignment p(x,y,π): score of local alignment π over x,y. Π : set of all possible local alignment over x,y. LA with SW score
17
1. SW only keep the best alignment instead of sum of alignment of x,y. 2. Logrithm can destroy the property of being postive definite. Why SW can not be kernel
18
LA Kernel SW score Example
19
SVM-pairwiseLA kernel Inner Product (0.9, 0.05, 0.3, 0.2) 0.227 0.253 Pair HMM xy x y (0.2, 0.3, 0.1, 0.01) SW Score
20
It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM. Diagonal Dominant Issue
21
(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal. (2) The empirical kernel map LA-ekm
22
Implementation The computation of the kernel [and therefore of ] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm. Normaliztion Dataset 4352 sequences extracted from the Astral database (www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.www.cs.columbia.edu/compbio/svmpairwise Methods
23
ROC Curve
25
Summary for the kernels
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.