Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,

Slides:

Advertisements

Similar presentations

Optimization in Data Mining Olvi L. Mangasarian with G. M. Fung, J. W. Shavlik, Y.-J. Lee, E.W. Wild & Collaborators at ExonHit – Paris University of Wisconsin.

Advertisements

Introduction to Support Vector Machines (SVM)

ECG Signal processing (2)

Support Vector Machine Classification Computation & Informatics in Biology & Medicine Madison Retreat, November 15, 2002 Olvi L. Mangasarian with G. M.

Machine learning continued Image source:

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.

Kernel Technique Based on Mercer’s Condition (1909)

Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.

Support Vector Machines Formulation  Solve the quadratic program for some : min s. t.,, denotes where or membership.  Different error functions and measures.

Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.

Binary Classification Problem Learn a Classifier from the Training Set

Unconstrained Optimization Problem

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:

Classification and Regression

Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, Rutgers University Y.-J.

Mathematical Programming in Support Vector Machines

An Introduction to Support Vector Machines Martin Law.

Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.

The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,

Support Vector Machines in Data Mining AFOSR Software & Systems Annual Meeting Syracuse, NY June 3-7, 2002 Olvi L. Mangasarian Data Mining Institute University.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.

An Introduction to Support Vector Machines (M. Law)

Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison

CS 478 – Tools for Machine Learning and Data Mining SVM.

Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.

Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.

Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Classification via Mathematical Programming Based Support Vector Machines Glenn M. Fung Computer Sciences Dept. University of Wisconsin - Madison November.

Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Support vector machines

PREDICT 422: Practical Machine Learning

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Geometrical intuition behind the dual problem

Computer Sciences Dept. University of Wisconsin - Madison

Support vector machines

Support vector machines

Support vector machines

Concave Minimization for Support Vector Machine Classifiers

University of Wisconsin - Madison

University of Wisconsin - Madison

Minimal Kernel Classifiers

Presentation transcript:

Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California, Nov 17-20, 2002

Outline of Talk  Linear Support Vector Machines (SVM)  Linear separating surface  Quadratic programming (QP) formulation  Linear programming (LP) formulation  Nonlinear Support Vector Machines  Nonlinear kernel separating surface  LP formulation  The Minimal Kernel Classifier (MKC)  The pound loss function (#)  MKC Algorithm  Numerical experiments  Conclusion

What is a Support Vector Machine?  An optimally defined surface  Linear or nonlinear in the input space  Linear in a higher dimensional feature space  Implicitly defined by a kernel function

What are Support Vector Machines Used For?  Classification  Regression & Data Fitting  Supervised & Unsupervised Learning

Generalized Support Vector Machines 2-Category Linearly Separable Case A+ A-

Support Vector Machines Maximizing the Margin between Bounding Planes A+ A- Support vectors

Support Vector Machine Formulation Algebra of 2-Category Linearly Separable Case  Given m points in n dimensional space  Represented by an m-by-n matrix A  Membership of each in class +1 or –1 specified by:  An m-by-m diagonal matrix D with +1 & -1 entries  More succinctly: where e is a vector of ones.  Separate by two bounding planes,

QP Support Vector Machine Formulation  Margin is maximized by minimizing  Solve the quadratic program for some : min s. t. (QP),, denotes where or membership.

Support Vector Machines Linear Programming Formulation  Use the 1-norm instead of the 2-norm: min s.t.  This is equivalent to the following linear program: min s.t.

Nonlinear Kernel: LP Formulation  Linear SVM: (Linear separating surface: ) (LP) min s.t.  Replace by a nonlinear kernel : min s.t. in the “dual space”, gives: By QP “duality”,. Maximizing the margin min s.t.

The Nonlinear Classifier  The nonlinear classifier:  Where K is a nonlinear kernel, e.g.:  Gaussian (Radial Basis) Kernel :  The -entry of represents “similarity” between the data points and

Nonlinear PSVM: Spiral Dataset 94 Red Dots & 94 White Dots

Model Simplification  Goal #1: Generate a very sparse solution vector.  Goal #2: Minimize number of active constraints.  Simplifies separating surface.  Why? Minimizes number of kernel functions used.  Why? Reduces data dependence.  Useful for massive incremental classification.

Model Simplification Goal #1 Simplifying Separating Surface  The nonlinear separating surface: The separating surface does not depend explicitly on the datapoint  Minimize the number of nonzero

Model Simplification Goal #2 Minimize Data Dependence  By KKT conditions: Hence:  Minimize the number of nonzero

Achieving Model Simplification: Minimal Kernel Classifier Formulation min s.t.  The new loss function # is given by:  Where is given by:

The (Pound) Loss Function #

Approximating the Pound Loss Function #

Minimal Kernel Classifier as a Concave Minimization Problem min s.t.  For we have:  That can be effectively solved using the finite Successive Linearization Algorithm (SLA) (Magasarian 1996)

Minimal Kernel Algorithm (SLA) min s.t.  Start with:  Having determine by solving the LP:  Stop when:

Minimal Kernel Algorithm (SLA)  Each iteration of the algorithm solves a Linear program.  The algorithm terminates in a finite number of iterations (typically 5 to 7 iterations).  Solution obtained satisfies the Minimum Principle necessary optimality condition.

Checkerboard Separating Surface # of Kernel Functions=27 * # of Active Constraints= 30 o

Numerical Experiments Results for Six Public Datasets Data Set m x n Reduced rectangular Kernel nnz(t) x nnz(u) MKC Test % (Time Sec.) SVM Test % (#SV) Kernel SV reduction % Testing time Reduction % (SVM-MKC time Sec.) Ionosphere 351 x x (172.3) 92.9 (288.2) (3.05 – 0.16) Cleveland Heart 297 x x (147.2) 85.5 (241.0) ( ) Pima Indians 768 x x (303.3) 76.6 (637.3) 98.8 (3.95 – 0.05) BUPA Liver 345 x x (285.9) 72.7 (310.5) (0.59 – 0.02) Tic-Tac-Toe 958 x x (150.4) 98.3 (861.4) (6.97 – 0.13) Mushroom 8124 x x (2763.5) oom (NA) NA

Conclusion  A finite algorithm that generates a classifier depending on a fraction of the input data only.  Important for fast online testing of unseen data, e.g. fraud or intrusion detection.  Useful for incremental training of massive data.  Overall algorithm consists of solving 5 to 7 LPs.  Kernel data dependence reduced up to 98.8% of the data used by a standard SVM.  Testing time reduction up to: 98.2%.  MKC testing set correctness comparable to that of more complex standard SVM.