JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Based on slides by Pierre Dönnes and Ron Meir Modified by Longin Jan Latecki, Temple University Ch. 5: Support Vector Machines Stephen Marsland, Machine.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Machine learning continued Image source:
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Rutgers CS440, Fall 2003 Support vector machines Reading: Ch. 20, Sec. 6, AIMA 2 nd Ed.
Support Vector Machines Kernel Machines
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Integration II Prediction. Kernel-based data integration SVMs and the kernel “trick” Multiple-kernel learning Applications – Protein function prediction.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.
Support Vector Machines Tao Department of computer science University of Illinois.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs in a Nutshell.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines
Support Vector Machine _ 2 (SVM)
Support vector machines
Presentation transcript:

JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004

JBR2 Support Vector Machine What is the SVM? The SVM is a generalization of the Optimal Hyperplane Algorithm Why is the SVM important? z It allows the use of more similarity measures than the OHA z Through the use of kernel methods it works with non vector data

JBR3 Simple Linear Classifier X=R p f(x) = w T x + b Each x  X is classified into 2 classes labeled y  {+1,-1} y = 1 if f(x)  0 and y = -1 if f(x) < 0 S = {(x 1,y 1 ),(x 2,y 2 ),...} Given S, the problem is to learn f (find w and b). For each f check to see if all (x i,y i ) are correctly classified i.e. y i f(x i )  0 Choose f so that the number of errors is minimized

JBR4 But what if the training set is not linearly separable? f(x) = w T x + b defines two half planes {x:f(x)  1} and {x: f(x)  -1} Classify with the “Hinge” loss function: c(f,x,y) = max(0,1-yf(x)) c (f,x,y)  as distance from correct half plane  If (x,y) is correctly classified with large confidence then c(f,x,y) = 0 w T x+b > 1 w T x+b < - 1 margin = 2/  w  1 yf(x) yf(x)  1: correct with large conf 0  yf(x) < 0: correct with small conf yf(x) < 0: misclassified

JBR5 SVMs combine requirements of large margin and few misclassifications by solving the problem: New formulation: min 1/2  w   + C  c(f,x i,y i ) w.r.t w,x and b zC is parameter that controls tradeoff between margin and misclassification  Large C  small margins but more samples correctly classified with strong confidence zTechnical difficulty: hinge loss function c(f,x i,y i ) is not differentiable Even better formulation: use slack variables  i  min 1/2  w   + C  i w.r.t w,  and b under the constraint  i  c(f,x i,y i ) (*) z But (*) is equivalent to  i  0  i y i (w T x i + b)  0 z Solve this quadratic optimization problem with Lagrange Multipliers for i = 1...n

JBR6 Support Vectors Lagrange Multiplier formulation:  Find  that minimizes: W(  )= (-1/2)  y i y j  i  j x i T x j +   i under the constraints:   i = 0 and 0   i  C  The points with positive Lagrange Multipliers,  i > 0, are called Support Vectors zThe set of support vectors contains all the information used by the SVM to learn a discrimination function  = C 0 < a < C  = 0

JBR7 Kernel Methods: data not represented individually, but only through a set of pairwise comparisons X a set of objects(proteins) S  (s) = (aatcgagtcac, atggacgtct, tgcactact) K = Each object represented by a sequence Each number in the kernel matrix is a measure of the similarity or “distance” between two objects.

JBR8 Kernels Properties of Kernels zKernels are measures of similarity: K(x,x’) large when x and x’ are similar zKernels must be: yPositive definite ySymmetric   kernel K,  a Hilbert Space F and a mapping  : X  F  K(x,x’) =  x,x’  X zHence all kernels can be thought of as dot products in some feature space Advantages of Kernels z Data of very different nature can be analyzed in a unified framework z No matter what the objects are, n objects are always represented by an n x n matrix z Many times, it is easier to compare objects than represent them numerically z Complete modularity between function to represent data and algorithm to analyze data

JBR9 The “Kernel Trick” zAny algorithm for vector data that can be expressed in terms of dot products can be performed implicitly in the feature space associated with the kernel by replacing each dot product with the kernel representation  e.g. For some feature space F let: d(x,x’) =  (x) -  (x’)  zBut  (x)-  (x’)  2  (x),  (x)  (x’),  (x’)> - 2 zSo d(x,x’) =(K(x,x)+K(x’,x’)-2K(x,x’)) 1/2

JBR10 Nonlinear Separation Nonlinear kernel : zX is a vector space  the kernel  is nonlinear  linear separation in the feature space F can be associated with non linear separation in X  X F

JBR11 SVM with Kernel Final formulation :  Find  that minimizes: W(  )=(-1/2)  y i y j  i  j x i T x j +  i under the constraints:  i = 0 and 0   i  C  Find an index i, 0 <  i < C and set: b = y i -  y j  j k(x i x j )  The classification of a new object x  X is then determined by the sign of the function f(x) =  y i  i k(x i x)+ b

JBR12 iris data set (Anderson 1935) 150 cases, 50 each of 3 species of iris Example from page 48 of The e1071 Package. First 10 lines of Iris > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa

JBR13 SVM ANALYSIS OF IRIS DATA # SVM ANALYSIS OF IRIS DATA SET # classification mode # default with factor response: model <- svm(Species ~., data = iris) summary(model) Call: svm(formula = Species ~., data = iris) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 51 ( ) Number of Classes: 3 Levels: setosa versicolor virginica Parameter “C” in Lagrange Formulation Radial Kernel exp(-  |u - v|) 2

JBR14 Exploring the SVM Model # test with training data x <- subset(iris, select = -Species) y <- Species pred <- predict(model, x) # Check accuracy: table(pred, y) # compute decision values: pred <- predict(model, x, decision.values = TRUE) attr(pred, "decision.values")[1:4,] y pred setosa versicolor virginica setosa versicolor virginica setosa/versicolor setosa/virginica versicolor/virginica [1,] [2,] [3,] [4,]

JBR15 Visualize classes with MDS # visualize (classes by color, SV by crosses): plot(cmdscale(dist(iris[,-5])), col = as.integer(iris[,5]), ch = c("o","+")[1:150 %in% model$index + 1]) cmdscale : multidimensional scaling or principal coordinates analysis black: sertosa red: versicolor green: virginica

JBR16 iris split into training and test sets first 25 of each case training set ## SECOND SVM ANALYSIS OF IRIS DATA SET ## classification mode # default with factor response # Train with iris.train.data model.2 <- svm(fS.TR ~., data = iris.train) # output from summary summary(model.2) Call: svm(formula = fS.TR ~., data = iris.train) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 32 ( ) Number of Classes: 3 Levels: setosa veriscolor virginica

JBR17 iris test results # test with iris.test.data x.2 <- subset(iris.test, select = - fS.TE) y.2 <- fS.TE pred.2 <- predict(model.2, x.2) # Check accuracy: table(pred.2, y.2) # compute decision values and probabilities: pred.2 <- predict(model.2, x.2, decision.values = TRUE) attr(pred.2, "decision.values")[1:4,] y.2 pred.2 setosa veriscolor virginica setosa veriscolor virginica setosa/veriscolor setosa/virginica veriscolor/virginica [1,] [2,] [3,] [4,]

JBR18 iris training and test sets

JBR19 Microarray Data from Golub et al. Molecular Classification of Cancer: Class Prediction by Gene Expression Monitoring, Science, Vol 286, 10/15/1999 Expression levels of predictive genes. zRows: genes zColumns: samples zExpression levels (EL) of each gene are relative to the mean EL for that gene in the initial dataset zRed if EL > mean zBlue if EL < mean  The scale indicates  above or below the mean zTop panel: genes highly expressed in ALL zBottom panel: genes more highly expressed in AML.

JBR20 Microarray Data Transposed rows = samples, columns = genes Training Data z38 Samples z7129 x 38 matrix zALL: 27 zAML 11 Test Data z38 Samples z7129 x 34 matrix zALL: 20 zAML 14 Microarray Data Transposed rows = samples, columns = genes [,1] [,2] [,3] [,4] [,5][,6] [,7] [,8] [,9] [,10] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,]

JBR21 SVM ANALYSIS OF MICROARRAY DATA classification mode # default with factor response y <-c(rep(0,27),rep(1,11)) fy <-factor(y,levels=0:1) levels(fy) <-c("ALL","AML") #compute svm on first 3000 genes only because of memory overflow problems model.ma <- svm(fy ~.,data = fmat.train[,1:3000]) Call: svm(formula = fy ~., data = fmat.train[, 1:3000]) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: Number of Support Vectors: 37 ( ) Number of Classes: 2 Levels: ALL AML

JBR22 Visualize Microarray Training Data with Multidimensional Scaling # visualize Training Data # (classes by color, SV by crosses) # multidimensional scaling pc <- cmdscale(dist(fmat.train[,1:3000 ])) plot(pc, col = as.integer(fy), pch = c("o","+")[1:3000 %in% model.ma$index + 1], main="Training Data ALL 'Black' and AML 'Red' Classes")

JBR23 Check Model with Training Data Predict outcomes of Test Data # check the training data x <- fmat.train[,1:3000] pred.train <- predict(model.ma, x) # check accuracy: table(pred.train, fy) # classify the test data y2 <-c(rep(0,20),rep(1,14)) fy2 <-factor(y2,levels=0:1) levels(fy2) <-c("ALL","AML") x2 <- fmat.test[,1:3000] pred <- predict(model.ma, x2) # check accuracy: table(pred, fy2) fy pred.train ALL AML ALL 27 0 AML 0 11 fy2 pred ALL AML ALL AML 0 1 Training data correctly classified Model is worthless so far

JBR24 Conclusion: zThe SVM appears to be a powerful classifier applicable to many different kinds of data But zKernel design is a full time job zSelecting model parameters is far from obvious zThe math is formidable