Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Similar presentations


Presentation on theme: "Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand."— Presentation transcript:

1 Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand Ravindranath Mei Sze Lam

2 Introduction Problem in Computational Biology  Classification of Proteins into functional and structural classes based on homology of protein sequence data

3 Methods for Protein Classification and Homology Detection Pairwise sequence alignment Profiles for protein families Consensus patterns using motifs Profiles HMMs

4 Focus Remote Homology Detection

5 How is the problem handled currently? Fisher-SVM One of the successful discriminative techniques for protein classification and Best performing method for remote homology detection

6 Fisher-SVM Build a profile HMM for the positive training sequences, defining loglikelihood function [log P(x/ θ )] for any protein sequence x. θ 0 - maximum likelihood for model parameters

7 Gradient vector d(log P(x/ θ )/ θ=θ 0 )/d θ assigns to each (positive or negative) training sequence x an explicit vector feature called fisher scores. This feature mapping defines a kernel function, called the fisher kernel. This Fisher kernel can then be used to train a SVM classifier.

8 Strengths Combines biological information encoded in a HMM with the discriminative power of the SVM algorithm.

9 Negatives Needs lots of data or sophisticated priors to train the HMM. It is expensive to compute the kernel matrix, as calculating the fisher scores requires computing forward and backward probabilities from the Baun-Welch algorithm.

10 Mismatch-SVM The (k,m)-mismatch kernel is based on a feature map to a vector space indexed by all possible subsequence of amino acids of a fixed length k. Each instance of a fixed k-length subsequence in an input sequence attributes to all feature coordinates differing from it by at most m mismatches.

11 Thus, the mismatch kernel adds the biologically important idea of mismatching to the computationally simpler In this paper, it is described how to compute the new kernel efficiently using a for values of (k,m) useful in this application. spectrum kernel mismatch tree data structure Mismatch kernel

12 Advantages By using mismatch tree data structure the kernel is fast enough to use on real datasets. Considerabily less expensive than the fisher kernel. Performance equal to Fisher-SVM. Outperforms other methods.

13 This kernel does not depend on any generative model and can be used for other sequence based classification problems.

14 Feature Maps for Strings (k,m)-mismatch kernel is based on a feature map from the space of all finite sequences from an alphabet A of size | A | = l to the l k –dimensional vector space indexed by the set of k-length subsequences (“k-mers”) from A. where, A - alphabet representing amino acids. l - no. of amino acids.

15 If α is a k-mer β is all k length sequences N (k,m) ( α ) – set of all k length sequences differing from α by at most m mismatches. we define our feature map Φ (k,m) as Φ (k,m) (α) = ( φ β ( α)) βЄ A k where φ β ( α ) = 1if β belongs to N (k,m) ( α ), φ β ( α ) = 0 otherwise.

16 For a sequence x of any length, we extends the map additively by summing the feature vectors for all the k-mers in x: Φ (k,m) (x) = Σ (k-mers α in x) Φ (k,m) (α) The (k,m)-mismatch kernal is given by K (k,m) (x,y) = ‹Φ (k,m) (x), Φ (k,m) (y)›. For m = 0, we retrieve the k-spectrum kernal.

17 Fisher Scores and Spectrum Kernel Even though the spectrum and mismatch feature maps are defined without any reference to a generative model, there is some similarity between the k-spectrum feature map and the fisher scores associated to an order k-1 markov chain model.

18 Efficient computation of the Mismatch Kernel: Mismatch Tree Data Structure Mismatch tree data structure is used to represent the feature space(the set of all k-mers) and perform a lexical traversal of all k-mers occurring in the sample dataset match with up to m of mismatches.

19 Example: Traversing the Mismatch Tree Traversal for input sequence: AVLALKAVLL, k=8, m=1

20 Example: Computing the Kernel for Pair of Sequences Traversal of trie for k=3 (m=0) EADLALGKAVF ADLALGADQVFNG A S1:S1: S2:S2: EADLALGKAVF ADLALGADQVFNG D EADLALGKAVF ADLALGADQVFNG L Update kernel value for K( s 1, s 2 ) by adding contribution for feature ADL

21 Efficiency Issues for Kernel Computation Depth first search Recursive function  efficient use of memory  no problem for large data sets

22 Computational Cost Theoretical Computational Cost, O Number of mismatches, m. This increases, computational cost increases exponentially Number of different amino acids in the body The number of characters in the sequence, k. The classifier breaks up proteins into lengths of 5~6. (Longer strings are broken down by summing the feature vectors) Total length of the sample data, N

23 Computational Cost (cont’d) Worst case scenario for M sequences ? Where is just the M number of sequences to be processed. Supposing M number of sequences are all equal with max. no of non zero entities = M x M x n = M X N

24 Training and Test Data + Class 1 + + + - - Class 2 - - - Class 3.. - 33 Classes of Superfamily Proteins (taken from SCOP database) Families of Proteins, each belonging to 1 of 33 Superfamilies - Any other class other than Class 1 - - - For each class, we want to know whether a given protein sequence belongs to that class – Y/N? 160 experiments were performed on 33 classes. Class we are interested in

25 Implementation and Comparison of Methods We test the mismatch kernel with a publicly available SVM implementation 4 methods Mismatch Kernel Fisher Kernel SAM-T98 PSI-BLAST Uses SVM implementation HMM Alignment Scoring

26 Show on board: The closer to 1, the better the score – more true positives to false positives Peformance Measurement - ROC ROC50 ROC

27 Performance Comparison Comparison of four homology detection methods Many of the Mismatch SVM and Fisher Kernel classifications fall close to 1, meaning there is a low FP error rate; threshold Both classifiers manage to classify almost all of the 33 classes with ROC score > 0.85

28 Mismatch VS Spectrum  Mismatch kernel outperforms the Spectrum kernel ROCROC50

29 Mismatch VS Fisher ROC50ROC  No Significant Difference!

30 Discussion & Conclusion What was it for? Constructing kernel for homology detection What was achieved? A kernel that was equal in performance to the best known classifier but with a lower computational cost Future Work Since does not depend on generative model (unlike Fisher), can be easily used for other stuff, eg. Splice site prediction Since it is computationally cheaper (ie. faster), can be used for practical biological purposes, eg. multiclass prediction


Download ppt "Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand."

Similar presentations


Ads by Google