Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Pattern Recognition and Machine Learning: Kernel Methods.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
x – independent variable (input)
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Reduced Support Vector Machine
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Active Set Support Vector Regression
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Solver & Optimization Problems n An optimization problem is a problem in which we wish to determine the best values for decision variables that will maximize.
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Solver & Optimization Problems n An optimization problem is a problem in which we wish to determine the best values for decision variables that will maximize.
Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Support Vector Machines in Data Mining AFOSR Software & Systems Annual Meeting Syracuse, NY June 3-7, 2002 Olvi L. Mangasarian Data Mining Institute University.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
An Introduction to Support Vector Machines (M. Law)
1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison November 14, 2015 TexPoint.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Biointelligence Laboratory, Seoul National University
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Regularized Least-Squares and Convex Optimization.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
PREDICT 422: Practical Machine Learning
Deep Feedforward Networks
Solver & Optimization Problems
Supervised Time Series Pattern Discovery through Local Importance
Support Vector Machines
An Introduction to Support Vector Machines
COSC 4335: Other Classification Techniques
COSC 4368 Machine Learning Organization
University of Wisconsin - Madison
Minimal Kernel Classifiers
Presentation transcript:

Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June 2, 2000

Overview l Regression and its role in data mining l Robust support vector regression –Our general formulation l Tolerant support vector regression –Our contributions –Massive support vector regression –Integration with data mining tools l Active support vector machines l Other research and future directions

What is regression? l Regression forms a rule for predicting an unknown numerical feature from known ones. l Example: Predicting purchase habits. l Can we use... –age, income, level of education l To predict... –purchasing patterns? l And simultaneously... –avoid the “pitfalls” that standard statistical regression falls into?

Regression example l Can we use. l To predict:

Role in data mining l Goal: Find new relationships in data –e.g. customer behavior, scientific experimentation l Regression explores importance of each known feature in predicting the unknown one. –Feature selection l Regression is a form of supervised learning –Use data where the predictive value is known for given instances, to form a rule l Massive datasets Regression is a fundamental task in data mining.

Part I: Robust Regression a.k.a. Huber Regression   

“Standard” Linear Regression Find w, b such that:

Optimization problem l Find w, b such that: l Bound the error by s: l Minimize the error: Traditional approach: minimize squared error.

Examining the loss function l Standard regression uses a squared error loss function. –Points which are far from the predicted line (outliers) are overemphasized.

Alternative loss function l Instead of squared error, try absolute value of the error: This is called the 1-norm loss function.

1-Norm Problems And Solution –Overemphasizes error on points close to the predicted line l Solution: Huber loss function hybrid approach Quadratic Linear Many practitioners prefer the Huber loss function.

Mathematical Formulation  indicates switchover from quadratic to linear    Larger  means “more quadratic.”

Regression Approach Summary l Quadratic Loss Function –Standard method in statistics –Over-emphasizes outliers l Linear Loss Function (1-norm) –Formulates well as a linear program –Over-emphasizes small errors l Huber Loss Function (hybrid approach) –Appropriate emphasis on large and small errors

Previous attempts complicated l Earlier efforts to solve Huber regression: –Huber: Gauss-Seidel method –Madsen/Nielsen: Newton Method –Li: Conjugate Gradient Method –Smola: Dual Quadratic Program l Our new approach: convex quadratic program Our new approach is simpler and faster.

Experimental Results: Census20k Time (CPU sec)  Faster! 20,000 points 11 features

Experimental Results: CPUSmall Time (CPU sec)  Faster! 8,192 points 12 features

Introduce nonlinear kernel! l Begin with previous formulation: Substitute w = A’  and minimize  instead: l Substitute K(A,A’) for AA’: A kernel is nonlinear function.

Nonlinear results Nonlinear kernels improve accuracy.

Part II: Tolerant Regression a.k.a. Tolerant Training

Regression Approach Summary l Quadratic Loss Function –Standard method in statistics –Over-emphasizes outliers l Linear Loss Function (1-norm) –Formulates well as a linear program –Over-emphasizes small errors l Huber Loss Function (hybrid approach) –Appropriate emphasis on large and small errors

Optimization problem (1-norm) l Find w, b such that: l Bound the error by s: l Minimize the error: Minimize the magnitude of the error.

The overfitting issue l Noisy training data can be fitted “too well” –leads to poor generalization on future data l Prefer simpler regressions, i.e. where –some w coefficients are zero –line is “flatter”

Reducing overfitting l To achieve both goals –minimize magnitude of w vector l C is a parameter to balance the two goals –Chosen by experimentation l Reduces overfitting due to points far from surface

Overfitting again: “close” points l “Close points” may be wrong due to noise only –Line should be influenced by “real” data, not noise l Ignore errors from those points which are close!

Tolerant regression Allow an interval of size  with uniform error How large should  be? –Large as possible, while preserving accuracy

How about a nonlinear surface?

Introduce nonlinear kernel! l Begin with previous formulation: Substitute w = A’  and minimize  instead: l Substitute K(A,A’) for AA’: A kernel is nonlinear function.

Our improvements l This formulation and interpretation is new! –Improves intuition from prior results –Uses less variables –Solves faster! l Computational tests run on DMI Locop2 –Dell PowerEdge 6300 server with –Four gigabytes of memory, 36 gigabytes of disk space –Windows NT Server 4.0 –CPLEX 6.5 solver Donated to UW by Microsoft Corporation

Comparison Results

Problem size concerns l How does the problem scale? –m = number of points –n = number of features l For linear kernel: problem size is O(mn) l For nonlinear kernel: problem size is O(m 2 ) l Thousands of data points ==> massive problem! Need an algorithm that will scale well.

Chunking approach l Idea: Use a chunking method –Bring as much into memory as possible –Solve this subset of the problem –Retain solution and integrate into next subset l Explored in depth by Paul Bradley and O.L. Mangasarian for linear kernels Solve in pieces, one chunk at a time.

Row-Column Chunking l Why column chunking also? –If non-linear kernel is used, chunks are very wide. –A wide chunk must have a small number of rows to fit in memory. Both these chunks use the same memory!

Chunking Experimental Results

Objective Value & Tuning Set Error for Billion-Element Matrix Given enough time, we find the right answer!

Integration into data mining tools l Method runs as a stand-alone application, with data resident on disk l With minimal effort, could sit on top of a RDBMS to manage data input/output –Queries select a subset of data - easily SQLable l Database queries occur “infrequently” –Data mining can be performed on a different machine from the one maintaining the DBMS l Licensing of a linear program solver necessary Algorithm can integrate with data mining tools.

Part III: Active Support Vector Machines a.k.a. ASVM

The Classification Problem Separating Surface: A+ A- Find surface to best separate two classes.

Active Support Vector Machine l Features –Solves classification problems –No special software tools necessary! No LP or QP! –FAST. Works on very large problems. –Web page: Available for download and can be integrated into data mining tools MATLAB integration already provided

Summary and Future Work l Summary –Robust regression can be modeled simply and efficiently as a quadratic program –Tolerant regression can be used to solve massive regression problems –ASVM can solve massive classification problems quickly l Future work –Parallel approaches –Distributed approaches –ASVM for various types of regression

Questions?