Minimal Kernel Classifiers

Minimal Kernel Classifiers
Informs 2002 San Jose, California, Nov 17-20, 2002 Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison

Outline of Talk Linear Support Vector Machines (SVM)
Linear separating surface Quadratic programming (QP) formulation Linear programming (LP) formulation Nonlinear Support Vector Machines Nonlinear kernel separating surface LP formulation The Minimal Kernel Classifier (MKC) The pound loss function (#) MKC Algorithm Numerical experiments Conclusion

What is a Support Vector Machine?
An optimally defined surface Linear or nonlinear in the input space Linear in a higher dimensional feature space Implicitly defined by a kernel function

What are Support Vector Machines Used For?
Classification Regression & Data Fitting Supervised & Unsupervised Learning

Generalized Support Vector Machines 2-Category Linearly Separable Case

Support Vector Machines Maximizing the Margin between Bounding Planes
vectors A-

Support Vector Machine Formulation Algebra of 2-Category Linearly Separable Case
Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each in class +1 or –1 specified by: An m-by-m diagonal matrix D with +1 & -1 entries Separate by two bounding planes, More succinctly: where e is a vector of ones.

QP Support Vector Machine Formulation
Solve the quadratic program for some : min s. t. (QP) , , denotes where or membership. Margin is maximized by minimizing

Support Vector Machines Linear Programming Formulation
Use the 1-norm instead of the 2-norm: min s.t. This is equivalent to the following linear program: min s.t.

Nonlinear Kernel: LP Formulation
Linear SVM: (Linear separating surface: ) (LP) min s.t. in the “dual space” , gives: By QP “duality”, . Maximizing the margin min s.t. Replace by a nonlinear kernel : min s.t.

The Nonlinear Classifier
Where K is a nonlinear kernel, e.g.: Gaussian (Radial Basis) Kernel : The -entry of represents “similarity” between the data points and

Nonlinear PSVM: Spiral Dataset 94 Red Dots & 94 White Dots

Goal #1: Generate a very sparse solution vector .
Model Simplification Goal #1: Generate a very sparse solution vector . Why? Minimizes number of kernel functions used. Simplifies separating surface. Goal #2: Minimize number of active constraints. Why? Reduces data dependence. Useful for massive incremental classification.

Model Simplification Goal #1 Simplifying Separating Surface
The nonlinear separating surface: The separating surface does not depend explicitly on the datapoint Minimize the number of nonzero

Model Simplification Goal #2 Minimize Data Dependence
By KKT conditions: Hence: Minimize the number of nonzero

Achieving Model Simplification: Minimal Kernel Classifier Formulation
s.t. Where is given by: The new loss function # is given by:

The (Pound) Loss Function #

Approximating the Pound Loss Function #

Minimal Kernel Classifier as a Concave Minimization Problem
s.t. For we have: That can be effectively solved using the finite Successive Linearization Algorithm (SLA) (Magasarian 1996)

Minimal Kernel Algorithm (SLA)
Start with: Having determine by solving the LP: min s.t. Stop when:

Minimal Kernel Algorithm (SLA)
Each iteration of the algorithm solves a Linear program. The algorithm terminates in a finite number of iterations (typically 5 to 7 iterations). Solution obtained satisfies the Minimum Principle necessary optimality condition.

Checkerboard Separating Surface # of Kernel Functions=27
Checkerboard Separating Surface # of Kernel Functions=27 * # of Active Constraints= 30 o

Numerical Experiments Results for six public datasets
m x n Reduced rectangular Kernel nnz(t) x nnz(u) MKC Test % (Time Sec.) SVM (#SV) Kernel SV reduction % Testing time Reduction % (SVM-MKC time Sec.) Ionosphere 351 x 34 30.2 x 15.7 94.9 (172.3) 92.9 (288.2) 94.6 (3.05 – 0.16) Cleveland Heart 297 x 13 64.6 x 7.6 85.8 (147.2) 85.5 (241.0) 96.9 96.3 ( ) Pima Indians 768 x 8 263.1 x 7.8 77.7 (303.3) 76.6 (637.3) 98.8 (3.95 – 0.05) BUPA Liver 345 x 6 144.4 x 10.5 75.0 (285.9) 72.7 (310.5) 96.6 97.5 (0.59 – 0.02) Tic-Tac-Toe 958 x 9 31.3 x 14.3 98.4 (150.4) 98.3 (861.4) 98.2 (6.97 – 0.13) Mushroom 8124 x 22 933.8 x 47.9 89.3 (2763.5) oom (NA) NA

Conclusion A finite algorithm generating a classifier that depends on a fraction of input data only. Important for fast online testing of unseen data, e.g. fraud or intrusion detection . Useful for incremental training of massive data Overall algorithm consists of solving 5 to 7 LPs. Kernel data dependence reduced up to 98.8% of the data used by a standard SVM. Testing time reduction: 98.2%. MKC testing set correctness comparable to that of a more complex standard SVM.

Minimal Kernel Classifiers

Similar presentations

Presentation on theme: "Minimal Kernel Classifiers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Minimal Kernel Classifiers

Similar presentations

Presentation on theme: "Minimal Kernel Classifiers"— Presentation transcript:

Similar presentations

About project

Feedback