Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Outline l The linear support vector machine (SVM) –Linear kernel l Generalized support vector machine (GSVM) –Nonlinear indefinite kernel l Linear Programming Formulation of GSVM –MINOS l Quadratic Programming Formulation of GSVM –Successive Overrelaxation (SOR) l Numerical comparisons l Conclusions
The Discrimination Problem The Fundamental 2-Category Linearly Separable Case Separating Surface: A+ A-
The Discrimination Problem The Fundamental 2-Category Linearly Separable Case l Given m points in the n dimensional space R n l Represented by an m x n matrix A l Membership of each point A i in the classes 1 or -1 is specified by: –An m x m diagonal matrix D with along its diagonal l Separate by two bounding planes: such that: l More succinctly: where e is a vector of ones.
Preliminary Attempt at the (Linear) Support Vector Machine: Robust Linear Programming l Solve the following mathematical program: where y = nonnegative error (slack) vector l Note: y = 0 if convex hulls of A+ and A- do not intersect.
The (Linear) Support Vector Machine Maximize Margin Between Separating Planes A+ A-
The (Linear) Support Vector Machine Formulation l Solve the following mathematical program: where y = nonnegative error (slack) vector l Note: y = 0 if convex hulls of A+ and A- do not intersect.
GSVM: Generalized Support Vector Machine Linear Programming Formulation l Linear Support Vector Machine (linear separating surface ) l By “duality”, set (linear separating surface ) l Nonlinear Support Vector Machine: Replace AA’ by nonlinear kernel. Nonlinear separating surface:
Examples of Kernels l Examples –Polynomial Kernel denotes componentwise exponentiation as in MATLAB –Radial Basis Kernel –Neural Network Kernel ` denotes the step function componentwise.
A Nonlinear Kernel Application Checkerboard Training Set: 1000 Points in R 2 Separate 486 Asterisks from 514 Dots
Previous Work
Polynomial Kernel:
Large Margin Classifier (SOR) Reformulation in Space A+ A-
(SOR) Linear Support Vector Machine Quadratic Programming Formulation l Solve the following mathematical program: l The quadratic term here maximizes the distance between the bounding planes in the space
Introducing a Nonlinear Kernel l The Wolfe Dual for the SOR Linear SVM is: –Linear separating surface: l Substitute in a kernel for the AA’ term: –Linear separating surface:
SVM Optimality Conditions l Define l Then dual SVM becomes much simpler! l Gradient Projection necessary & sufficient optimality condition: l denotes projecting u onto the region
SOR Algorithm & Convergence l Above optimality conditions lead to the SOR algorithm: –Remember, optimality conditions are expressed as: l SOR Linear Convergence [Luo-Tseng 1993]: –The iterates of the SOR algorithm converge R-linearly to a solution of the dual problem –The objective function values converge Q-linearly to
Numerical Testing l Comparison of Linear & Nonlinear Kernels using –Linear Programming –Quadratic Programming - SOR Formulations l Data Sets: –UCI Liver Disorders: 345 points in R 6 –Bell Labs Checkerboard: 1000 points in R 2 –Gaussian Synthetic: 1000 points in R 32 –SCDS Synthetic: 1 million points in R 32 –Massive Synthetic: 10 million points in R 32 l Machines: –Cluster of 4 Sun Enterprise E6000 machines each consisting of 16 UltraSPARC II 250 MHz Processors with 2 Gig RAM Total: 64 Processors, 8 Gig RAM
Comparison of Linear & Nonlinear SVMs Linear Programming Generated l Nonlinear kernels yield better training and testing set correctness
SOR Results l Examples of training on massive data: –1 million point dataset generated by SCDS generator: Trained completely in 9.7 hours Tuning set reached 99.7% of final accuracy in 0.3 hours –10 million point randomly generated dataset: Tuning set reached 95% of final accuracy in 14.3 hours Under 10,000 iterations l Comparison of linear and nonlinear kernels
Conclusions l Linear programming and successive overrelaxation can generate complex nonlinear separating surfaces via GSVMs l Nonlinear separating surfaces improve generalization over linear ones l SOR can handle very large problems not (easily) solveable by other methods l SOR scales up with virtually no changes l Future directions –Parallel SOR for very large problems not resident in memory –Massive multicategory discrimination via SOR –Support vector regression
Questions?