Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements


Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
LOGO Classification IV Lecturer: Dr. Bo Yuan
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
Active Learning with Support Vector Machines
Support Vector Machines Kernel Machines
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
A Study of the Relationship between SVM and Gabriel Graph ZHANG Wan and Irwin King, Multimedia Information Processing Laboratory, Department of Computer.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Protein Classification. PDB Growth New PDB structures.
Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
This week: overview on pattern recognition (related to machine learning)
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Classification Using Averaged Perceptron SVM
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Support Vector Machines Tao Department of computer science University of Illinois.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
SVMs in a Nutshell.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 An introduction to support vector machine (SVM) Advisor : Dr.Hsu Graduate : Ching –Wen Hong.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Support Vector Machines
An Introduction to Support Vector Machines
Linear Discrimination
Presentation transcript:

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang

Outline Problem Definition Support Vector Machines Mismatch Kernel Mismatch Tree Data Structure Experiments Conclusions

Protein Classification Problem: classification of proteins sequences into functional and structural families based on sequence homology. Motivation: Many proteins have been sequenced, but often structure and function remains unknown. Discriminative supervised machine learning approach to infer proteins structure and function needs an efficient way to do the computation.

Protein Classification Given a new protein, can we place it in its “correct” position within an existing protein hierarchy?

Remote Homology Remote homology: Superfamily-level homology. Sequences that belong to the same superfamily but not the same family. Motivation: Classify proteins based on sequence data into homologous groups to understand the structure and functions of proteins. Previous known approaches: pairwise sequence alignment, profiles, HMM New approaches: Discriminative Models

Discriminative approach More direct way to the goal Most accurate method Protein sequences are seen as a set of labeled examples—positive if they are in the family and negative otherwise.

Discriminative Models -- SVM Hyperplane: margin Assume training data linearly separable in feature space, Linear classification rule:

The Kernel Function The linear classifier relies on inner products between vectors. If every data point in the input space is mapped into high-dimensional space called feature space via some transformation Φ: x→φ(x), the inner product becomes: K (xi,xj)= called kernel. Φ is called the feature map. A kernel function is some function that corresponds to an inner product into some feature space.

The k-Spectrum of a Sequence Feature map for SVM based on spectrum of a sequence. The k-spectrum of a sequence is the set of all k-length (contiguous, k>=1) subsequences that it contains. We refer to such a k- length subsequence as a k-mer. Dimension of k-mer feature space = l k ( l =20 for the alphabet of amino acids). AKQDYYYYEI AKQ KQD QDY DYY YYY YYE YEI

k-Spectrum Feature Map Feature map is indexed by all possible k-mers. k-spectrum feature map with no mismatches:  For sequence x,,where = number of occurrences of in x. AKQDYYYYEI ( 0, 0, …, 1, …, 1, …, 2, … 1) AAA AAC … AKQ … DYY … YYY … YEI

k-Spectrum Kernel k-spectrum kernel K (x, y) for two sequences x and y is obtained by taking the inner product in feature space: This kernel simply counts the occurrences of k- length subsequences for each of the sequence in consideration. This kernel gives a simple notion of sequence similarity: two sequences will have a large k- spectrum kernel value if they share many of the same k-mers.

(k, m)-Mismatch Kernel Slight modification to the k-spectrum kernel. Define a parameter m which allows up to m mismatches in the counting of occurrences. This means =>k-spectrum feature map, allowing m mismatches:  If is a fixed k-mer,, where = 1 if is within m mismatches from, otherwise 0.  Example: Mismatch neighborhood around.

(k,m)-Mismatch Kernel  extend additively to longer sequences x by summing over all k-mers in x : (k, m)-mismatch kernel is once again just the inner product in feature space: SVMs can be learned by supplying this kernel function. The learned SVM classifier is given by:

Example: (k, m)-Mismatch Feature Map

A Simple Application: We first normalize the kernels: Then consider the induced distance:

Efficient computation of kernel matrix with a mismatch tree data structure The entire kernel matrix can be computed in one depth- first traversal of the mismatch tree structure. The (k,m)-mismatch tree is a rooted tree of depth k, where each internal node has 20 branches.  an amino acid is labeled with each branch.  a leaf node presents a k-mer.  an internal node represented the prefix of a k-mer. A L (8,1)-mismatch tree for sequence AVLALKAVLL

Example: Mismatch Tree Data Structure AA AB AC BA BB BC CA CB CC

Experiments SCOP experiments with domain homologs

Experiments SCOP experiments without domain homologs

Conclusions Presented mismatch kernels that measure sequence similarity without requiring alignment and without depending upon a generative model. Presented a method for efficiently computing kernels. In the SCOP experiments, this method performs competitively when compare with state-of-the-art methods. Such as SVM-Fisher and SVM-pairwise.  Mismatch kernel approach gives efficient kernel computation, linear time prediction and maintains good performance even when there is little training data.  Mismatch kernel approach can extract high-scoring k-mers from a trained SVM-mismatch classifier in order to look for discriminative motif regions in the positive sequence family.