Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Slides:

Advertisements

Similar presentations

ECG Signal processing (2)

Advertisements

Linear Regression.

Regularization David Kauchak CS 451 – Fall 2013.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.

An Introduction of Support Vector Machine

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CMPUT 466/551 Principal Source: CMU

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

x – independent variable (input)

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Curve-Fitting Regression

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Reduced Support Vector Machine

Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.

Binary Classification Problem Learn a Classifier from the Training Set

Support Vector Machines

Active Set Support Vector Regression

Support Vector Machines

Solver & Optimization Problems n An optimization problem is a problem in which we wish to determine the best values for decision variables that will maximize.

Mathematical Programming in Support Vector Machines

An Introduction to Support Vector Machines Martin Law.

Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.

Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,

An Introduction to Support Vector Machines (M. Law)

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison

Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison November 14, 2015 TexPoint.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

CS 478 – Tools for Machine Learning and Data Mining SVM.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.

Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Regularized Least-Squares and Convex Optimization.

Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,

Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.

Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.

Support vector machines

PREDICT 422: Practical Machine Learning

Deep Feedforward Networks

Multiplicative updates for L1-regularized regression

Solver & Optimization Problems

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Support Vector Machines

Support vector machines

Support vector machines

Support vector machines

University of Wisconsin - Madison

University of Wisconsin - Madison

Minimal Kernel Classifiers

Presentation transcript:

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10,

2 Outline l Robust Regression –Huber M-Estimator loss function –New quadratic programming formulation –Numerical comparisons –Nonlinear kernels l Tolerant Regression –New formulation of Support Vector Regression (SVR) –Numerical comparisons –Massive regression: Row-column chunking l Conclusions & Future Work

Focus 1: Robust Regression a.k.a. Huber Regression   

4 “Standard” Linear Regression Find w, b such that: m points in R n, represented by an m x n matrix A. y in R m is the vector to be approximated.

5 Optimization problem l Find w, b such that: l Bound the error by s: l Minimize the error: Traditional approach: minimize squared error.

6 Examining the loss function l Standard regression uses a squared error loss function. –Points which are far from the predicted line (outliers) are overemphasized.

7 Alternative loss function l Instead of squared error, try absolute value of the error: This is the 1-norm loss function.

8 1-Norm Problems And Solution –Overemphasizes error on points close to the predicted line l Solution: Huber loss function hybrid approach Quadratic Linear Many practitioners prefer the Huber loss function.

9 Mathematical Formulation  indicates switchover from quadratic to linear    Larger  means “more quadratic.”

10 Regression Approach Summary l Quadratic Loss Function –Standard method in statistics –Over-emphasizes outliers l Linear Loss Function (1-norm) –Formulates well as a linear program –Over-emphasizes small errors l Huber Loss Function (hybrid approach) –Appropriate emphasis on large and small errors

11 Previous attempts complicated l Earlier efforts to solve Huber regression: –Huber: Gauss-Seidel method –Madsen/Nielsen: Newton Method –Li: Conjugate Gradient Method –Smola: Dual Quadratic Program l Our new approach: convex quadratic program Our new approach is simpler and faster.

12 Experimental Results: Census20k Time (CPU sec)  Faster! 20,000 points 11 features

13 Experimental Results: CPUSmall Time (CPU sec)  Faster! 8,192 points 12 features

14 Introduce nonlinear kernel l Begin with previous formulation: Substitute w = A’  and minimize  instead: l Substitute K(A,A’) for AA’:

15 Nonlinear results Nonlinear kernels improve accuracy.

Focus 2: Support Vector Tolerant Regression

17 Regression Approach Summary l Quadratic Loss Function –Standard method in statistics –Over-emphasizes outliers l Linear Loss Function (1-norm) –Formulates well as a linear program –Over-emphasizes small errors l Huber Loss Function (hybrid approach) –Appropriate emphasis on large and small errors

18 Optimization problem l Find w, b such that: l Bound the error by s: l Minimize the error: Minimize the magnitude of the error.

19 The overfitting issue l Noisy training data can be fitted “too well” –leads to poor generalization on future data l Prefer simpler regressions, i.e. where –some w coefficients are zero –line is “flatter”

20 Reducing overfitting l To achieve both goals –minimize magnitude of w vector l C is a parameter to balance the two goals –Chosen by experimentation l Reduces overfitting due to points far from surface

21 Overfitting again: “close” points l “Close points” may be wrong due to noise only –Line should be influenced by “real” data, not noise l Ignore errors from those points which are close!

22 Tolerant regression Allow an interval of size  with uniform error How large should  be? –Large as possible, while preserving accuracy

23 How about a nonlinear surface?

24 Introduce nonlinear kernel l Begin with previous formulation: Substitute w = A’  and minimize  instead: l Substitute K(A,A’) for AA’: K(A,A’) = nonlinear kernel function

25 Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation l Our formulation single error bound tolerance as a constraint

26 l Smola, Schölkopf, Rätsch multiple error bounds

27 l Reduction in: –Variables: 4m+2 --> 3m+2 –Solution time

28 Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation l Our formulation l Smola, Schölkopf, Rätsch l Reduction in: –Variables: 4m+2 --> 3m+2 –Solution time multiple error bounds single error bound tolerance as a constraint

29 l Perturbation theory results show there exists a fixed such that: l For all –we solve the above stabilized least 1-norm problem –additionally we maximize  the least error component As  goes from 0 to 1, –least error component  is monotonically nondecreasing function of  Natural interpretation for  l our linear program is equivalent to classical stabilized least 1-norm approximation problem

30 Numerical Testing l Two sets of tests –Compare computational times of our method (MM) and the SSR method –Row-column chunking for massive datasets l Datasets: –US Census Bureau Adult Dataset: 300,000 points in R 11 –Delve Comp-Activ Dataset: 8192 points in R 13 –UCI Boston Housing Dataset: 506 points in R 13 –Gaussian noise was added to each of these datasets. l Hardware: Locop2: Dell PowerEdge 6300 server with: –Four gigabytes of memory, 36 gigabytes of disk space –Windows NT Server 4.0 –CPLEX 6.5 solver

31  is a parameter which needs to be determined experimentally Use a hold-out tuning set to determine optimal value for  l Algorithm:  = 0 while (tuning set accuracy continues to improve) { Solve LP  =  } l Run for both our method and SSR methods and compare times Experimental Process

32 Comparison Results

33 Linear Programming Row Chunking l Basic approach: (PSB/OLM) for classification problems l Classification problem is solved for a subset, or chunk of constraints (data points) l Those constraints with positive multipliers are preserved and integrated into next chunk (support vectors) l Objective function is montonically nondecreasing l Dataset is repeatedly scanned until objective function stops increasing

34 Innovation: Simultaneous Row-Column Chunking l Row Chunking –Cannot handle problems with large numbers of variables –Therefore: Linear kernel only l Row-Column Chunking –New data increase the dimensionality of K(A,A’) by adding both rows and columns (variables) to the problem. –We handle this with row-column chunking. –General nonlinear kernel

35 while (problem termination criteria not satisfied) { choose set of rows as row chunk while (row chunk termination criteria not satisfied) { from row chunk, select set of columns solve LP allowing only these columns to vary add columns with nonzero values to next column chunk } add rows with nonzero multipliers to next row chunk } Row-Column Chunking Algorithm

36 Row-Column Chunking Diagram

37 Row-Column Chunking Diagram

38 Row-Column Chunking Diagram

39 Row-Column Chunking Diagram

40 Row-Column Chunking Diagram

41 Row-Column Chunking Diagram

42 Row-Column Chunking Diagram

43 Chunking Experimental Results

44 Objective Value & Tuning Set Error for Billion-Element Matrix

45 Conclusions and Future Work l Conclusions –Robust regression can be modeled simply and efficiently as a quadratic program –Tolerant Regression can be handled more efficiently using improvements on previous formulations –Row-column chunking is a new approach which can handle massive regression problems l Future work –Chunking via parallel and distributed approaches –Scaling Huber regression to larger problems

46 Questions?

47 LP Perturbation Regime #1 l Our LP is given by: When  = 0, the solution is the stabilized least 1- norm solution. l Therefore, by LP Perturbation Theory, there exists a such that –The solution to the LP with is a solution to the least 1-norm problem that also maximizes .

48 LP Perturbation Regime #2 l Our LP can be rewritten as: l Similarly, by LP Perturbation Theory, there exists a such that –The solution to the LP with is the solution that minimizes least error (  ) among all minimizers of average tolerated error.

49 Motivation for dual variable substitution l Primal: l Dual: