Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Lecture 9 Support Vector Machines
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
Support Vector Machine
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Reduced Support Vector Machine
Support Vector Machines Kernel Machines
Support Vector Machine (SVM) Classification
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines and Kernel Methods
Support Vector Machines
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
ML Concepts Covered in 678 Advanced MLP concepts: Higher Order, Batch, Classification Based, etc. Recurrent Neural Networks Support Vector Machines Relaxation.
Support Vector Machines
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Efficient Model Selection for Support Vector Machines
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Universit at Dortmund, LS VIII
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Support Vector Machines Project מגישים : גיל טל ואורן אגם מנחה : מיקי אלעד נובמבר 1999 הטכניון מכון טכנולוגי לישראל הפקולטה להנדסת חשמל המעבדה לעיבוד וניתוח.
Support Vector Machines S.V.M. Special session Bernhard Schölkopf & Stéphane Canu GMD-FIRST I.N.S.A. - P.S.I.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
A TUTORIAL ON SUPPORT VECTOR MACHINES FOR PATTERN RECOGNITION ASLI TAŞÇI Christopher J.C. Burges, Data Mining and Knowledge Discovery 2, , 1998.
Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
CS 2750: Machine Learning Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh February 17, 2016.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support vector machines
CS 9633 Machine Learning Support Vector Machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Geometrical intuition behind the dual problem
Jan Rupnik Jozef Stefan Institute
The following slides are taken from:
University of Wisconsin - Madison
Presentation transcript:

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell

Overview Introduction –Motivation –General SVMs –General SVM training –Related Work Sequential Minimal Optimization (SMO) –Choosing the smallest optimization problem –Solving the smallest optimization problem Benchmarks Conclusion Remarks & Future Work References

Motivation Traditional SVM Training Algorithms –Require quadratic programming (QP) package –SVM training is slow, especially for large problems Sequential Minimal Optimization (SMO) –Requires no QP package –Easy to implement –Often faster –Good scalability properties

General SVMs u =  i  i y i K(x i,x) – b(1) u : SVM output  : weights to blend different kernels y in {-1, +1} : desired output b : threshold x i : stored training example (vector) x : input (vector) K : kernel function to measure similarity of x i to x i

General SVMs (2) For linear SVMs, K is linear, thus (1) can be expressed as the dot product of w and x minus the threshold: u = w * x – b(2) w =  i  i y i x i (3) Where w, x, and x i are vectors

General SVM Training Training an SVM is finding  i, expressed as minimizing a dual quadratic form: min   (  ) = min  ½  i  j y i y j K(x i, x j )  i  j –  i  i (4) Subject to box constraints: 0 <=  i <= C, for all I(5) And the linear equality constraint:  i y i  i = 0(6) The  i are Lagrange multipliers of a primal QP problem: there is a one-to-one correspondence between each  i and each training example x i

General SVM Training (2) SMO solves the QP expressed in (4-6) Terminates when all of the Karush-Kuhn-Tucker (KKT) optimality conditions are fulfilled:  i = 0-> y i u i >= 1(7) 0 y i u i = 1(8)  i = C-> y i u i <= 1(9) Where u i is the SVM output for the ith training example

Related Work “Chunking” [9] –Removing training examples with  i = 0 does not change solution. –Breaks down large QP problem into smaller sub- problems to identify non-zero  i. –The QP sub-problem consists of every non-zero  i from previous sub-problem combined with M worst examples that violate (7-9) for some M [1]. –Last step solves the entire QP problem as all non-zero  i have been found. –Cannot handle large-scale training problems if standard QP techniques are used. Kaufman [3] describes QP algorithm to overcome this.

Related Work (2) Decomposition [6]: –Breaks the large QP problem into smaller QP sub- problems. –Osuna et al. [6] suggest using fixed size matrix for every sub-problem – allows very large training sets. –Joachims [2] suggests adding and subtracting examples according to heuristics for rapid convergence. –Until SMO, requires numerical QP library, which can be costly or slow.

Sequential Minimal Optimization SMO decomposes the overall QP problem (4-6), into fixed size QP sub-problems. Chooses the smallest optimization problem (SOP) at each step. –This optimization consists of two elements of  because of the linear equality constraint. SMO repeatedly chooses two elements of  to jointly optimize until the overall QP problem is solved.

Choosing the SOP Heuristic based approach Terminates when the entire training set obeys (7-9) within  (typically <= ) Repeatedly finds  1 and  2 and optimizes until termination

Finding  1 “First choice heuristic” –Searches through examples most likely to violate conditions (non-bound subset) –  i at the bounds likely to stay there, non-bound  i will move as others are optimized “Shrinking Heuristic” –Finds examples which fulfill (7-9) more than the worst example failed –Ignores these examples until a final pass at the end to ensure all examples fulfill (7-9)

Finding  2 Chosen to maximize the size of the step taken during the joint optimization of  1 and  2 Each non-bound has a cached error value E for each non-bound example If E 1 is negative, chooses  2 with minimum E 2 If E 1 is positive, chooses  2 with maximum E 2

Solving the SOP Computes minimum along the direction of the linear equality constant:  2 new = y 2 (E 1 -E 2 )/(K(x 1,x 1 )+K(x 2,x 2 )–2K(x 1, x 2 ))(10) E i = u i -y i (11) Clips  2 new within [L,H]: L = max(0,  2 +s  (s+1)C)(12) H = min(C,  2 +s  (s-1)C)(13) s = y 1 y 2 (14) Calculates  1 new :  1 new =  1 + s(  2 –  2 new,clipped )(15)

Benchmarks UCI Adult: SVM is given 14 attributes of a census and is asked to predict if household income is greater than $50k. 8 categorical attributes, 6 continues = 123 binary attributes. Web: classify if a web page is in a category or not. 300 sparse binary keyword attributes. MNIST: One classifier is trained dimensional, non-binary vectors stored as sparse vectors.

Description of Benchmarks Web and Adult are trained with linear and Gaussian SVMs. Performed with and without sparse inputs, with and without kernel caching PCG chunking always uses caching

Benchmarking SMO

Conclusions PCG chunking slower than SMO, SMO ignores examples whose Lagrange multipliers are at C. Overhead of PCG chunking not involved with kernel (kernel optimizations do not greatly effect time).

Conclusions (2) SVM light solves 10 dimensional QP sub-problems. Differences mostly due to kernel optimizations and numerical QP overhead. SMO faster on linear problems due to linear SVM folding, but SVM light can potentially use this as well. SVM light benefits from complex kernel cache while SVM does have a complex kernel cache and thus does not benefit from it at large problem sizes.

Remarks & Future Work Heuristic based approach to finding  1 and  2 to optimize: –Possible to determine optimal choice strategy to minimize the number of steps? Proof that SMO always minimizes the QP problem?

References [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), [2] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184. MIT Press, 1998.

References (2) [3] L. Kaufman. Solving the quadratic programming problem arising in support vector classification. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 147– 168. MIT Press, [6] E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proc. IEEE Neural Networks in Signal Processing ’97, 1997.

References (3) [9] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.