Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

An Overview of Machine Learning

Supervised Learning Recap

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

Discriminative and generative methods for bags of features

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Structural bioinformatics

Support Vector Machines and Kernel Methods

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 

CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,

Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Scalable Text Mining with Sparse Generative Models

Protein Classification. PDB Growth New PDB structures.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,

Masquerade Detection Mark Stamp 1Masquerade Detection.

CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Protein Classification Using Averaged Perceptron SVM

Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

Protein Classification

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Semi-Supervised Clustering

Constrained Clustering -Semi Supervised Clustering-

Machine Learning Basics

Combining HMMs with SVMs

Protein structure prediction.

Protein Structural Classification

Presentation transcript:

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006

2 Papers  Mismatch string kernels for discriminative protein classification. Christina Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston and William Stafford Nobley. Bioinformatics,  Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Dengyong Zhou, Andre Elisseeff, William Stafford Noble. Bioinformatics, 2005.

3 Presentation Overview  Protein Classification  Machine Learning algorithms overview  Features, model and learning  Supervised, Unsupervised, and Semi-supervised  SVM: Kernels and Margins  Mismatch Kernels and Cluster Kernels  Ranking

4 Protein Sequence Classification  Protein represented by sequence of amino acids, encoded by a gene  Easy to sequence proteins, difficult to obtain structure  Classification Problem: Learn how to classify protein sequence data into families and superfamilies defined by structure/function relationships.

5 Structural Hierarchy of Proteins  Remote homologs: sequences that belong to the same superfamily but not the same family – remote evolutionary relationship  Structure and function conserved, though sequence similarity can be low

6 Learning Problem  Use discriminative supervised learning approach to classify protein sequences into structurally-related families, superfamilies  Labeled training data: proteins with known structure, positive (+) if example belongs to a family or superfamily, negative (-) otherwise  Focus on remote homology detection

7 Protein Ranking  Ranking problem: given query protein sequence, return ranked list of similar proteins from sequence database  Limitations of classification framework  Small amount of labeled data (proteins with known structure), huge unlabeled databases  Missing classes: undiscovered structures  Good local similarity scores, based on heuristic alignment: BLAST, PSI-BLAST

8 Machine Learning Background  Data -> Features, Labels  Supervised:  Training: (Features, Labels) => learn weights  Prediction: (Features) => Labels using weights.  Unsupervised: Only labels  Labels (classification) could be ranks (ranking) or continuous values (regression).  Classification  Logistic Regression  SVM

9 Support Vector Machines  Training examples mapped to (high-dimensional) feature space F(x) = (F 1 (x), …, F N (x))  Learn linear classifier in feature space f(x) = + b w = Σ i y i α i F(x i ) by solving optimization problem: trade-off between maximizing geometric margin and minimizing margin violations  Large margin gives good generalization performance, even in high dimensions

10 Support Vector Machines: Kernel  Kernel trick: To train an SVM, can use kernel rather than explicit feature map  Can define kernels for sequences, graphs, other discrete objects: { sequences } -> R N For sequences x, y, feature map F, kernel value is inner product in feature space K(x, y) =

11 String Kernels for Biosequences  Biological data needs kernels suited to its needs.  Fast string kernels for biological sequence data  Biologically-inspired underlying feature map  Kernels scale linearly with sequence length, O(cK(|x| + |y|)) to compute  Protein classification performance competitive with best available methods  Mismatches for inexact sequence matching (other models later)

12 Spectrum-based Feature Map  Idea: feature map based on spectrum of a sequence  The k-spectrum of a sequence is the set of all k-length contiguous subsequences that it contains  Feature map is indexed by all possible k-length subsequences (“k-mers”) from the alphabet of amino acids  Dimension of feature space = |Σ| k (|Σ| = 20 for amino acids)

13 K-Spectrum Feature Map

14 Inexact Matches  Improve performance by explicitly taking care of mismatches

15 Mismatch Tree  Mismatches can be efficiently tracked through a tree.

16 Fast Prediction  Kernel (string mismatch) can be computed efficiently.  SVM is fast (a dot product over support vectors)  Therefore, protein classification is fast.

17 Experiments  Tested with experiments on SCOP dataset from Jaakkola et al.  Experiments designed to ask: Could the method discover a new family of a known superfamily?

18 SCOP Experiments  160 experiments for 33 target families from 16 superfamilies  Compared results against  SVM-Fisher (HMM-based kernel)  SAM-T98 (profile HMM)  PSI-BLAST (heuristic alignment-based method)  ROC scores: area under the graph of true positives as a function of false positives, scaled so that both axes vary between 0 and 1

19 Result

20 Result (contd.)

21 Aside: Similarity with SVM-Fisher  Fisher kernel for Markov chain model similar to k-spectrum kernel  Mismatch-SVM performs as well as SVM-Fisher but avoids computational expense, training difficulties of profile HMM

22 Advantages of mismatch SVM  Efficient computation: scales linearly with sequence length  Fast prediction: classify test sequences in linear time  Interpretation of learned classifier  General approach for biosequence data, does not rely on alignment or generative model

23 Other Extensions of Mismatch Kernel  Other models for inexact matching  Restricted gaps  Probabilistic substitutions  Wildcards

24 Inexact Matching through Gaps

25 Result: Gap Kernel

26 Inexact matching through Probabilistic Substitutions  Use substitution matrices to obtain P(a|b), substitution probabilities for residues a, b  The mutation neighborhood M(k,s)(s) is the set of all k-mers t such that

27 Results: Substitution

28 Wildcards

29 Results: Wildcard

30 Cluster Kernels: Semi-Supervised  About 30,000 proteins with known structure (labeled proteins), but about 1 million sequenced proteins  BLAST, PSI-BLAST: widely-used heuristic alignment- based sequence similarity scores  Good local similarity score, less useful for more remote homology  BLAST/PSI-BLAST E-values give good measure of distance between sequences  Can we use unlabeled data, combined with good local distance measure, for semi-supervised approach to protein classification?

31 Cluster Kernels  Use unlabeled data to change the (string) kernel representation  Cluster assumption: decision boundary should pass through low density region of input space; clusters in the data are likely to have consistent labels  Profile kernel  Bagged kernel

32 Profile Kernel

33 Profile Kernel

34 Bagged Kernel  A hack to use clustering for a kernel  Use k-means clustering to cluster data (labeled + unlabeled), N bagged runs  Using N runs, define p(x,y) = (# times x, y are in same cluster)/N  Bagged kernel: K Bagged (x,y) = p(x,y) K(x,y)

35 Experimental Setup  Full dataset: 7329 SCOP protein sequences  Experiments:  54 target families (remote homology detection)  Test + training approximately for each experiment <4000 sequences, other data treated as unlabeled  Evaluation: How well do cluster kernel approaches compare to the standard approach, adding positive homologs to dataset?

36 Results: Cluster Kernels

37 Conclusions  SVMs with string kernels – like mismatch kernels – that incorporate inexact matching are competitive with best-known methods for protein classification  Efficient kernel computation: linear-time prediction, feature extraction  Gaps, substitutions, and wildcards other possible kernels  Semi-supervised cluster kernels – using unlabeled data to modify kernel representation – improve on original string kernel

38 Future Directions  Full multiclass protein classification problem  1000s of classes of different sizes, hierarchical labels  Use of unlabeled data for improving kernel representation  Domain segmentation problem: predict and classify domains of multidomain proteins  Local structure prediction: predict local conformation state (backbone angles) for short peptide segments, step towards structure prediction

Thank you CS 374 – Algorithms in Biology Tuesday, October 24, 2006

40 Protein Ranking  Pairwise sequence comparison: most fundamental bioinformatics application  BLAST, PSI-BLAST: widely-used heuristic alignment-based sequence similarity scores  Given query sequence, search unlabeled database and return ranked list of similar sequences  Query sequence does not have to belong to known family or superfamily  Good local similarity score, less useful for more remote homology  Can we use semi-supervised machine learning to improve on PSI-BLAST?

41 Ranking Induced By Clustering

42 Ranking: Experiemntal Setup  Training set: 4246 SCOP protein sequences (from 554 superfamilies) – known classes  Test set:  3083 SCOP protein sequences (from 516 different superfamilies) – hidden classes  101,403 unlabeled sequences from SWISSPROT  Task: How well can we retrieve database sequences (from train + test sets) in same superfamily as query? Evaluate with ROC-50 scores  For initial experiments, SWISSPROT sequences only used for PSI-BLAST scores, not for clustering

43 Results: Ranking