Sp’10Bafna/Ideker Classification (SVMs / Kernel method)

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.

An Introduction of Support Vector Machine
Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Support Vector Machines and Kernel Methods
Principal Component Analysis
Chapter 5: Linear Discriminant Functions
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Reduced Support Vector Machine
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Active Learning with Support Vector Machines
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Support Vector Machine & Image Classification Applications
Scenario 6 Distinguishing different types of leukemia to target treatment.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Dimensionality reduction
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines
Support Vector Machines
PREDICT 422: Practical Machine Learning
Feature space tansformation methods
Support Vector Machines
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Sp’10Bafna/Ideker Classification (SVMs / Kernel method)

Sp’10Bafna/Ideker LP versus Quadratic programming LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time

Sp’10Bafna/Ideker Margin of separation Suppose we find a separating hyperplane ( ,  0 ) s.t. – For all +ve points x  T x-  0 >=1 – For all +ve points x  T x-  0 <= -1 What is the margin of separation?  T x-  0 =0  T x-  0 =1  T x-  0 =-1

Sp’10Bafna/Ideker Separating by a wider margin Solutions with a wider margin are better.

Sp’10Bafna/Ideker Separating via misclassification In general, data is not linearly separable What if we also wanted to minimize misclassified points Recall that, each sample x i in our training set has the label y i  {- 1,1} For each point i, y i (  T x i -  0 ) should be positive Define  i >= max {0, 1- y i (  T x i -  0 ) } If i is correctly classified ( y i (  T x i -  0 ) >= 1), and  i = 0 If i is incorrectly classified, or close to the boundaries  i > 0 We must minimize  i  i

Sp’10Bafna/Ideker Support Vector machines (wide margin and misclassification) Maximimize margin while minimizing misclassification Solved using non-linear optimization techniques The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method. This gives a lot of power to the method.

Sp’10Bafna/Ideker Reformulating the optimization

Sp’10Bafna/Ideker Lagrangian relaxation Goal S.t. We minimize

Sp’10Bafna/Ideker Simplifying For fixed  >= 0, >= 0, we minimize the lagrangian

Sp’10Bafna/Ideker Substituting Substituting (1)

Sp’10Bafna/Ideker Substituting (2,3), we have the minimization problem

Sp’10Bafna/Ideker Classification using SVMs Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques Quiz: When we have solved this QP, how do we classify a point x?

Sp’10Bafna/Ideker The kernel method The SVM formulation can be solved using QP on dot-products. As these are wide-margin classifiers, they provide a more robust solution. However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non- linear spaces)

Sp’10Bafna/Ideker kernel Let X be the set of objects – Ex: X =the set of samples in micro-arrays. – Each object x  X is a vector of gene expression values k: X  X -> R is a positive semidefinite kernel if – k is symmetric. – k is +ve semidefinite

Sp’10Bafna/Ideker Kernels as dot-product Quiz: Suppose the objects x are all real vectors (as in gene expression) Define Is k L a kernel? It is symmetric, but is is +ve semi-definite?

Sp’10Bafna/Ideker Linear kernel is +ve semidefinite Recall X as a matrix, such that each column is a sample – X=[x 1 x 2 …] By definition, the linear kernel k L =X T X For any c

Sp’10Bafna/Ideker Generalizing kernels Any object can be represented by a feature vector in real space.

Sp’10Bafna/Ideker Generalizing Note that the feature mapping could actually be non-linear. On the flip side, Every kernel can be represented as a dot-product in a high dimensional space. Sometimes the kernel space is easier to define than the mapping 

Sp’10Bafna/Ideker The kernel trick If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel – Simply replace the dot-product by the kernel

Sp’10Bafna/Ideker Kernel trick example Consider a kernel k defined on a mapping  – k(x,x’) =  (x) T  (x’) It could be that  is very difficult to compute explicitly, but k is easy to compute Suppose we define a distance function between two objects as How do we compute this distance?

Sp’10Bafna/Ideker Kernels and SVMs Recall that SVM based classification is described as

Sp’10Bafna/Ideker Kernels and SVMs Applying the kernel trick We can try kernels that are biologically relevant

Sp’10Bafna/Ideker Examples of kernels for vectors

Sp’10Bafna/Ideker String kernel Consider a string s = s 1, s 2,… Define an index set I as a subset of indices s[I] is the substring limited to those indices l(I) = span W(I) = c l(I) c<1 – Weight decreases as span increases For any string u of length k l(I)

Sp’10Bafna/Ideker String Kernel Map every string to a |  | n dimensional space, indexed by all strings u of length upto n The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) =  (s) T  (t) can be computed in O(n |s| |t|) time su

Sp’10Bafna/Ideker SVM conclusion SVM are a generic scheme for classifying data with wide margins and low misclassifications For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification – Define a meaningful kernel, and solve using SVM Many standard kernels are available (linear, poly., RBF, string)

Sp’10Bafna/Ideker Classification review We started out by treating the classification problem as one of separating points in high dimensional space Obvious for gene expression data, but applicable to any kind of data Question of separability, linear separation Algorithms for classification – Perceptron – Lin. Discriminant – Max Likelihood – Linear Programming – SVMs – Kernel methods & SVM

Sp’10Bafna/Ideker Classification review Recall that we considered 3 problems: – Group together samples in an unsupervised fashion (clustering) – Classify based on a training data (often by learning a hyperplane that separates). – Selection of marker genes that are diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.

Sp’10Bafna/Ideker Dimensionality reduction Many genes have highly correlated expression profiles. By discarding some of the genes, we can greatly reduce the dimensionality of the problem. There are other, more principled ways to do such dimensionality reduction.

Sp’10Bafna/Ideker Why is high dimensionality bad? With a high enough dimensionality, all points can be linearly separated. Recall that a point x i is misclassified if – it is +ve, but  T x i -  0 <=0 – it is -ve, but  T x i +  0 > 0 In the first case choose  i s.t. –  T x i -  0 +  i >= 0 By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!

Sp’10Bafna/Ideker Principle Components Analysis We get the intrinsic dimensionality of a data- set.

Sp’10Bafna/Ideker Principle Components Analysis Consider the expression values of 2 genes over 6 samples. Clearly, the expression of the two genes is highly correlated. Projecting all the genes on a single line could explain most of the data. This is a generalization of “discarding the gene”.

Sp’10Bafna/Ideker Projecting Consider the mean of all points m, and a vector emanating from the mean Algebraically, this projection on  means that all samples x can be represented by a single value  T( x-m)  m x x-m TT = M  T( x-m)

Sp’10Bafna/Ideker Higher dimensions Consider a set of 2 (k) orthonormal vectors  1,  2 … Once projected, each sample means that all samples x can be represented by 2 (k) dimensional vector –  1 T (x-m),  2 T( x-m) 11 m x x-m 1T1T = M  1 T( x-m) 22

Sp’10Bafna/Ideker How to project The generic scheme allows us to project an m dimensional surface into a k dimensional one. How do we select the k ‘best’ dimensions? The strategy used by PCA is one that maximizes the variance of the projected points around the mean

Sp’10Bafna/Ideker PCA Suppose all of the data were to be reduced by projecting to a single line  from the mean. How do we select the line  ? m

Sp’10Bafna/Ideker PCA cont’d Let each point x k map to x’ k =m+a k . We want to minimize the error Observation 1: Each point x k maps to x’ k = m +  T (x k -m)  – (a k =  T (x k -m)) m  xkxk x’ k

Sp’10Bafna/Ideker Proof of Observation 1 Differentiating w.r.t a k

Sp’10Bafna/Ideker Minimizing PCA Error To minimize error, we must maximize  T S  By definition, =  T S  implies that is an eigenvalue, and  the corresponding eigenvector. Therefore, we must choose the eigenvector corresponding to the largest eigenvalue.

Sp’10Bafna/Ideker PCA steps X = starting matrix with n columns, m rows xjxj X

End of Lecture Sp’10Bafna/Ideker

Sp’10Bafna/Ideker

Sp’10Bafna/Ideker ALL-AML classification The two leukemias need different different therapeutic regimen. Usually distinguished through hematopathology Can Gene expression be used for a more definitive test? – 38 bonemarrow samples – Total mRNA was hybridized against probes for 6817 genes – Q: Are these classes separable

Sp’10Bafna/Ideker Neighborhood analysis (cont’d) Each gene is represented by an expression vector v(g) = (e 1,e 2,…,e n ) Choose an idealized expression vector as center. Discriminating genes will be ‘closer’ to the center (any distance measure can be used). Discriminating gene

Sp’10Bafna/Ideker Neighborhood analysis Q: Are there genes, whose expression correlates with one of the two classes A: For each class, create an idealized vector c – Compute the number of genes N c whose expression ‘matches’ the idealized expression vector – Is N c significantly larger than N c* for a random c*?

Sp’10Bafna/Ideker Neighborhood test Distance measure used: – For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2 – Compute mean and std. dev. [  1 (g),  1 (g)] of expression in class 1 and also [  2 (g),  2 (g)]. – P(g,c) = [  1 (g)-  2 (g)]/ [  1 (g)+  2 (g)] – N 1 (c,r) = {g | P(g,c) == r} – High density for some r is indicative of correlation with class distinction – Neighborhood is significant if a random center does not produce the same density.

Sp’10Bafna/Ideker Neighborhood analysis #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance. Class prediction should be possible using micro- array expression values.

Sp’10Bafna/Ideker Class prediction Choose a fixed set of informative genes (based on their correlation with the class distinction). – The predictor is uniquely defined by the sample and the subset of informative genes. For each informative gene g, define (w g,b g ). – w g =P(g,c) (When is this +ve?) – b g = [  1 (g)+  2 (g)]/2 Given a new sample X – x g is the normalized expression value at g – Vote of gene g =w g |x g -b g | (+ve value is a vote for class 1, and negative for class 2 )

Sp’10Bafna/Ideker Prediction Strength PS = [V win -V lose ]/[V win +V lose ] – Reflects the margin of victory A 50 gene predictor is correct 36/38 (cross-validation) Prediction accuracy on other samples 100% (prediction made for 29/34 samples. Median PS = 0.73 Other predictors between 10 and 200 genes all worked well.

Sp’10Bafna/Ideker Performance

Sp’10Bafna/Ideker Differentially expressed genes? Do the predictive genes reveal any biology? Initial expectation is that most genes would be of a hematopoetic lineage. However, many genes encode – Cell cycle progression genes – Chromatin remodelling – Transcription – Known oncogenes – Leukemia targets (etopside)

Sp’10Bafna/Ideker Relationship between ML, and Golub predictor ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier

Sp’10Bafna/Ideker Automatic class discovery The classification of different cancers is over years of hypothesis driven research. Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?

Sp’10Bafna/Ideker Self Organizing Maps SOMs was applied to group the 38 samples Class A1 contained 24/25 ALL and 3/13 AML samples. How can we validate this? Use the labels to do supervised classification via cross-validation A 20 gene predictor gave 34 accurate predictions, 1 error, and 2 of 3 uncertains

Sp’10Bafna/Ideker Comparing various error models

Sp’10Bafna/Ideker Conclusion