JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004
JBR2 Support Vector Machine What is the SVM? The SVM is a generalization of the Optimal Hyperplane Algorithm Why is the SVM important? z It allows the use of more similarity measures than the OHA z Through the use of kernel methods it works with non vector data
JBR3 Simple Linear Classifier X=R p f(x) = w T x + b Each x X is classified into 2 classes labeled y {+1,-1} y = 1 if f(x) 0 and y = -1 if f(x) < 0 S = {(x 1,y 1 ),(x 2,y 2 ),...} Given S, the problem is to learn f (find w and b). For each f check to see if all (x i,y i ) are correctly classified i.e. y i f(x i ) 0 Choose f so that the number of errors is minimized
JBR4 But what if the training set is not linearly separable? f(x) = w T x + b defines two half planes {x:f(x) 1} and {x: f(x) -1} Classify with the “Hinge” loss function: c(f,x,y) = max(0,1-yf(x)) c (f,x,y) as distance from correct half plane If (x,y) is correctly classified with large confidence then c(f,x,y) = 0 w T x+b > 1 w T x+b < - 1 margin = 2/ w 1 yf(x) yf(x) 1: correct with large conf 0 yf(x) < 0: correct with small conf yf(x) < 0: misclassified
JBR5 SVMs combine requirements of large margin and few misclassifications by solving the problem: New formulation: min 1/2 w + C c(f,x i,y i ) w.r.t w,x and b zC is parameter that controls tradeoff between margin and misclassification Large C small margins but more samples correctly classified with strong confidence zTechnical difficulty: hinge loss function c(f,x i,y i ) is not differentiable Even better formulation: use slack variables i min 1/2 w + C i w.r.t w, and b under the constraint i c(f,x i,y i ) (*) z But (*) is equivalent to i 0 i y i (w T x i + b) 0 z Solve this quadratic optimization problem with Lagrange Multipliers for i = 1...n
JBR6 Support Vectors Lagrange Multiplier formulation: Find that minimizes: W( )= (-1/2) y i y j i j x i T x j + i under the constraints: i = 0 and 0 i C The points with positive Lagrange Multipliers, i > 0, are called Support Vectors zThe set of support vectors contains all the information used by the SVM to learn a discrimination function = C 0 < a < C = 0
JBR7 Kernel Methods: data not represented individually, but only through a set of pairwise comparisons X a set of objects(proteins) S (s) = (aatcgagtcac, atggacgtct, tgcactact) K = Each object represented by a sequence Each number in the kernel matrix is a measure of the similarity or “distance” between two objects.
JBR8 Kernels Properties of Kernels zKernels are measures of similarity: K(x,x’) large when x and x’ are similar zKernels must be: yPositive definite ySymmetric kernel K, a Hilbert Space F and a mapping : X F K(x,x’) = x,x’ X zHence all kernels can be thought of as dot products in some feature space Advantages of Kernels z Data of very different nature can be analyzed in a unified framework z No matter what the objects are, n objects are always represented by an n x n matrix z Many times, it is easier to compare objects than represent them numerically z Complete modularity between function to represent data and algorithm to analyze data
JBR9 The “Kernel Trick” zAny algorithm for vector data that can be expressed in terms of dot products can be performed implicitly in the feature space associated with the kernel by replacing each dot product with the kernel representation e.g. For some feature space F let: d(x,x’) = (x) - (x’) zBut (x)- (x’) 2 (x), (x) (x’), (x’)> - 2 zSo d(x,x’) =(K(x,x)+K(x’,x’)-2K(x,x’)) 1/2
JBR10 Nonlinear Separation Nonlinear kernel : zX is a vector space the kernel is nonlinear linear separation in the feature space F can be associated with non linear separation in X X F
JBR11 SVM with Kernel Final formulation : Find that minimizes: W( )=(-1/2) y i y j i j x i T x j + i under the constraints: i = 0 and 0 i C Find an index i, 0 < i < C and set: b = y i - y j j k(x i x j ) The classification of a new object x X is then determined by the sign of the function f(x) = y i i k(x i x)+ b
JBR12 iris data set (Anderson 1935) 150 cases, 50 each of 3 species of iris Example from page 48 of The e1071 Package. First 10 lines of Iris > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
JBR13 SVM ANALYSIS OF IRIS DATA # SVM ANALYSIS OF IRIS DATA SET # classification mode # default with factor response: model <- svm(Species ~., data = iris) summary(model) Call: svm(formula = Species ~., data = iris) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 51 ( ) Number of Classes: 3 Levels: setosa versicolor virginica Parameter “C” in Lagrange Formulation Radial Kernel exp(- |u - v|) 2
JBR14 Exploring the SVM Model # test with training data x <- subset(iris, select = -Species) y <- Species pred <- predict(model, x) # Check accuracy: table(pred, y) # compute decision values: pred <- predict(model, x, decision.values = TRUE) attr(pred, "decision.values")[1:4,] y pred setosa versicolor virginica setosa versicolor virginica setosa/versicolor setosa/virginica versicolor/virginica [1,] [2,] [3,] [4,]
JBR15 Visualize classes with MDS # visualize (classes by color, SV by crosses): plot(cmdscale(dist(iris[,-5])), col = as.integer(iris[,5]), ch = c("o","+")[1:150 %in% model$index + 1]) cmdscale : multidimensional scaling or principal coordinates analysis black: sertosa red: versicolor green: virginica
JBR16 iris split into training and test sets first 25 of each case training set ## SECOND SVM ANALYSIS OF IRIS DATA SET ## classification mode # default with factor response # Train with iris.train.data model.2 <- svm(fS.TR ~., data = iris.train) # output from summary summary(model.2) Call: svm(formula = fS.TR ~., data = iris.train) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 32 ( ) Number of Classes: 3 Levels: setosa veriscolor virginica
JBR17 iris test results # test with iris.test.data x.2 <- subset(iris.test, select = - fS.TE) y.2 <- fS.TE pred.2 <- predict(model.2, x.2) # Check accuracy: table(pred.2, y.2) # compute decision values and probabilities: pred.2 <- predict(model.2, x.2, decision.values = TRUE) attr(pred.2, "decision.values")[1:4,] y.2 pred.2 setosa veriscolor virginica setosa veriscolor virginica setosa/veriscolor setosa/virginica veriscolor/virginica [1,] [2,] [3,] [4,]
JBR18 iris training and test sets
JBR19 Microarray Data from Golub et al. Molecular Classification of Cancer: Class Prediction by Gene Expression Monitoring, Science, Vol 286, 10/15/1999 Expression levels of predictive genes. zRows: genes zColumns: samples zExpression levels (EL) of each gene are relative to the mean EL for that gene in the initial dataset zRed if EL > mean zBlue if EL < mean The scale indicates above or below the mean zTop panel: genes highly expressed in ALL zBottom panel: genes more highly expressed in AML.
JBR20 Microarray Data Transposed rows = samples, columns = genes Training Data z38 Samples z7129 x 38 matrix zALL: 27 zAML 11 Test Data z38 Samples z7129 x 34 matrix zALL: 20 zAML 14 Microarray Data Transposed rows = samples, columns = genes [,1] [,2] [,3] [,4] [,5][,6] [,7] [,8] [,9] [,10] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,]
JBR21 SVM ANALYSIS OF MICROARRAY DATA classification mode # default with factor response y <-c(rep(0,27),rep(1,11)) fy <-factor(y,levels=0:1) levels(fy) <-c("ALL","AML") #compute svm on first 3000 genes only because of memory overflow problems model.ma <- svm(fy ~.,data = fmat.train[,1:3000]) Call: svm(formula = fy ~., data = fmat.train[, 1:3000]) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: Number of Support Vectors: 37 ( ) Number of Classes: 2 Levels: ALL AML
JBR22 Visualize Microarray Training Data with Multidimensional Scaling # visualize Training Data # (classes by color, SV by crosses) # multidimensional scaling pc <- cmdscale(dist(fmat.train[,1:3000 ])) plot(pc, col = as.integer(fy), pch = c("o","+")[1:3000 %in% model.ma$index + 1], main="Training Data ALL 'Black' and AML 'Red' Classes")
JBR23 Check Model with Training Data Predict outcomes of Test Data # check the training data x <- fmat.train[,1:3000] pred.train <- predict(model.ma, x) # check accuracy: table(pred.train, fy) # classify the test data y2 <-c(rep(0,20),rep(1,14)) fy2 <-factor(y2,levels=0:1) levels(fy2) <-c("ALL","AML") x2 <- fmat.test[,1:3000] pred <- predict(model.ma, x2) # check accuracy: table(pred, fy2) fy pred.train ALL AML ALL 27 0 AML 0 11 fy2 pred ALL AML ALL AML 0 1 Training data correctly classified Model is worthless so far
JBR24 Conclusion: zThe SVM appears to be a powerful classifier applicable to many different kinds of data But zKernel design is a full time job zSelecting model parameters is far from obvious zThe math is formidable