Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Slides:



Advertisements
Similar presentations
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Advertisements

Introduction to Support Vector Machines (SVM)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Support Vector Machines (and Kernel Methods in general)
Locally Constraint Support Vector Clustering
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Curve-Fitting Regression
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Martin Burger Institut für Numerische und Angewandte Mathematik European Institute for Molecular Imaging CeNoS Total Variation and related Methods Numerical.
Reduced Support Vector Machine
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Active Set Support Vector Regression
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Efficient Model Selection for Support Vector Machines
SVM by Sequential Minimal Optimization (SMO)
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
An Introduction to Support Vector Machines (M. Law)
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison November 14, 2015 TexPoint.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.
Chapter 4 Sensitivity Analysis, Duality and Interior Point Methods.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Markus Uhr Feature Extraction Sparse, Flexible and Efficient Modeling using L 1 -Regularization Saharon Rosset and Ji Zhu.
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Multiplicative updates for L1-regularized regression
Geometrical intuition behind the dual problem
An Introduction to Support Vector Machines
The following slides are taken from:
Support vector machines
Support vector machines
Support vector machines
University of Wisconsin - Madison
University of Wisconsin - Madison
Minimal Kernel Classifiers
Presentation transcript:

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors December 3,

Chunking with 1 billion nonzero elements

Outline l Problem Formulation –New formulation of Support Vector Regression (SVR) –Theoretically close to LP formulation of Smola, Schölkopf, Rätsch –Interpretation of perturbation parameter l Numerical Comparisons –Speed comparisons of our method and prior formulations l Massive Regression –Chunking methods for solving large problems Row chunking Row-column chunking l Conclusions & Future Work

Support Vector Tolerant Regression  -insensitive interval within which errors are tolerated l can improve performance on testing sets by avoiding overfitting

Deriving the SVR Problem l m points in R n, represented by an m x n matrix A. l is the vector to be approximated. We wish to solve: Let w be represented by the dual formulation This suggests replacing AA’ by a general nonlinear kernel K(A,A’): Measure the error by s, with a tolerance  bound errors tolerance (e is a vector of ones)

Deriving the SVR Problem (continued) Add regularization term and minimize the error with weight C > 0: Parametrically maximize the tolerance  via parameter. This maximizes the minimum error component, thereby resulting in error uniformity. bound errors tolerance regularizationerror regularization errorinterval size bound errors tolerance regularization

Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation l Our formulation single error bound tolerance as a constraint

l Smola, Schölkopf, Rätsch multiple error bounds

l Reduction in: –Variables: 4m+2 --> 3m+2 –Solution time

Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation l Our formulation l Smola, Schölkopf, Rätsch l Reduction in: –Variables: 4m+2 --> 3m+2 –Solution time single error bound multiple error bounds tolerance as a constraint

l Perturbation theory results show there exists a fixed such that: l For all –we solve the above stabilized least 1-norm problem –additionally we maximize  the least error component As  goes from 0 to 1, –least error component  is monotonically nondecreasing function of  Natural interpretation for  l our linear program is equivalent to classical stabilized least 1-norm approximation problem

Numerical Testing l Two sets of tests –Compare computational times of our method (MM) and the SSR method –Row-column chunking for massive datasets l Datasets: –US Census Bureau Adult Dataset: 300,000 points in R 11 –Delve Comp-Activ Dataset: 8192 points in R 13 –UCI Boston Housing Dataset: 506 points in R 13 –Gaussian noise was added to each of these datasets. l Hardware: Locop2: Dell PowerEdge 6300 server with: –Four gigabytes of memory, 36 gigabytes of disk space –Windows NT Server 4.0 –CPLEX 6.5 solver

 is a parameter which needs to be determined experimentally Use a hold-out tuning set to determine optimal value for  l Algorithm:  = 0 while (tuning set accuracy continues to improve) { Solve LP  =  } l Run for both our method and SSR methods and compare times Experimental Process

Comparison Results

Linear Programming Row Chunking l Basic approach: (PSB/OLM) for classification problems l Classification problem is solved for a subset, or chunk of constraints (data points) l Those constraints with positive multipliers are preserved and integrated into next chunk (support vectors) l Objective function is montonically nondecreasing l Dataset is repeatedly scanned until objective function stops increasing

Innovation: Simultaneous Row-Column Chunking l Mapping of data points to constraints –Classification: Each data point yields one constraint. –Regression: Each data point yields two constraints. Row- Column Chunking manages which constraint to maintain for next chunk. l Fixing dual variables at upper bounds for efficiency –Classification: Simple to do since problem is coded in its dual formulation. Any support vectors with dual variables at upper bound are held constant in successive chunks. –Regression: Primal formulation was used for efficiency purposes. We therefore aggregated all constraints with fixed multipliers to yield a single constraint.

Innovation: Simultaneous Row-Column Chunking l Large number of columns –Row Chunking Implemented for a linear kernel only. Cannot handle problems with large numbers of variables, and hence limited practically to linear kernels. –Row-Column Chunking Implemented for a general nonlinear kernel. New data increase the dimensionality of K(A,A’) by adding both rows and columns (variables) to the problem. We handle this with row-column chunking.

while (problem termination criteria not satisfied) { choose a set of rows from the problem as a row chunk while (row chunk termination criteria not satisfied) { from this row chunk, select a set of columns solve the LP allowing only these columns as variables add those columns with nonzero values to the next column chunk } add those rows with nonzero dual multipliers to the next row chunk } Row-Column Chunking Algorithm

Row-Column Chunking Diagram Step 1aStep 1bStep 1c Step 2aStep 2b Step 2c Step 3aStep 3bStep 3c loop

Chunking Experimental Results

Objective Value & Tuning Set Error for Billion-Element Matrix

Conclusions and Future Work l Conclusions –Support Vector Regression can be handled more efficiently using improvements on previous formulations –Row-column chunking is a new approach which can handle massive regression problems l Future work –Generalizing to other loss functions, such as Huber M-estimator –Extension to larger problems using parallel processing for both linear and quadratic programming formulations

Questions?

LP Perturbation Regime #1 l Our LP is given by: When  = 0, the solution is the stabilized least 1- norm solution. l Therefore, by LP Perturbation Theory, there exists a such that –The solution to the LP with is a solution to the least 1-norm problem that also maximizes .

LP Perturbation Regime #2 l Our LP can be rewritten as: l Similarly, by LP Perturbation Theory, there exists a such that –The solution to the LP with is the solution that minimizes least error (  ) among all minimizers of average tolerated error.

Motivation for dual variable substitution l Primal: l Dual: