M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

An Introduction of Support Vector Machine
SVM—Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Discriminative and generative methods for bags of features
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
Protein Classification. PDB Growth New PDB structures.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Protein Tertiary Structure Prediction
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
Efficient Model Selection for Support Vector Machines
SVM by Sequential Minimal Optimization (SMO)
Semisupervised Learning A brief introduction. Semisupervised Learning Introduction Types of semisupervised learning Paper for review References.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Classification Using Averaged Perceptron SVM
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Protein Classification
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Chapter 14 Protein Structure Classification
Machine Learning Basics
Protein structure prediction.
Protein Structural Classification
Presentation transcript:

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

O UTLINE Biological Motivation and Background Algorithmic Concepts Mismatch Kernels Semi-supervised methods

P ROTEINS

T HE P ROTEIN P ROBLEM Primary Structure can be easily determined 3D structure determines function Grouping proteins into structural and evolutionary families is difficult Use machine learning to group proteins

H OW TO LOOK AT AMINO ACID CHAINS Smith-Waterman Idea Mismatch Idea

F AMILIES Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided into Proteins Proteins are divided into Species The same protein may be found in several species Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

S UPERFAMILIES Proteins which are (remote) evolutionarily related Sequence similarity low Share function Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

F OLDS Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold No evolutionary relation between proteins Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

P ROTEIN C LASSIFICATION Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein

M ACHINE L EARNING C ONCEPTS Supervised Methods Discriminative Vs. Generative Models Transductive Learning Support Vector Machines Kernel Methods Semi-supervised Methods

D ISCRIMINATIVE AND G ENERATIVE M ODELS DiscriminativeGenerative

T RANSDUCTIVE L EARNING Most Learning is Inductive Given (x 1,y 1 ) …. (x m,y m ), for any test input x* predict the label y* Transductive Learning Given (x 1,y 1 ) …. (x m,y m ) and all the test input {x 1*,…, x p* } predict label {y 1*,…, y p* }

S UPPORT V ECTOR M ACHINES Popular Discriminative Learning algorithm Optimal geometric marginal classifier Can be solved efficiently using the Sequential Minimal Optimization algorithm If x 1 … x n training examples, sign(  i i x i T x) “decides” where x falls Train i to achieve best margin

S UPPORT V ECTOR M ACHINES (2) Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(  i i K(x i,x) determines class of x)}

K ERNEL M ETHODS K(x, z) = f(x) T f(z) f is the feature mapping x and z are input vectors High dimensional features do not need to be explicitly calculated Think of the kernel function similarity measure between x and z Example:

M ISMATCH K ERNEL Regions of similar amino acid sequences yield a similar tertiary structure of proteins Used as a kernel for an SVM to identify protein homologies

K - MER BASED SVM S For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

D ISADVANTAGES 3D structure of proteins is practically impossible Primary sequences are cheap to determine How do we use all this unlabeled data? Use semi-supervised learning based on the cluster assumption

S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

Some examples are labeled Assume labels vary smoothly among all examples S EMI -S UPERVISED M ETHODS SVMs and other discriminative methods may make significant mistakes due to lack of data

S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

C LUSTER K ERNELS Semi-supervised methods 1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs  Nbd(X) 2. Define Φ nbd (X) = 1/|Nbd(X)|  X’  Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3. K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y))  X’  Nbd(X)  Y’  Nbd(Y) K(X’, Y’) 2. Next bagged mismatch

B AGGED M ISMATCHED K ERNEL Final method 1. Bagged mismatch 1. Run k-means clustering n times, giving p = 1,…,n assignments c p (X) 2. For every X and Y, count up the fraction of times they are bagged together K bag (X, Y) = 1/n  p 1 (c p (X) = c p (Y)) 3. Combine the “bag fraction” with the original comparison K(.,.) K new (X, Y) = K bag (X, Y) K(X, Y)

O. Jangmin

W HAT WORKS BEST ? Transductive Setting

R EFERENCES C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, J. Weston et al. Semi-supervised protein classification using cluster kernels Images pulled under wikiCommons