Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
Regularization David Kauchak CS 451 – Fall 2013.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
INTRODUCTION TO Machine Learning 2nd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
CHAPTER 10: Linear Discrimination
SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Support Vector Machines
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Separating Hyperplanes
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Support Vector Machine (SVM) Classification
Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
SVM by Sequential Minimal Optimization (SMO)
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Universit at Dortmund, LS VIII
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
Non-Bayes classifiers. Linear discriminants, neural networks.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Classification & Regression Part II
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Biointelligence Laboratory, Seoul National University
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Neural networks and support vector machines
Support vector machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
LECTURE 11: Advanced Discriminant Analysis
Dan Roth Department of Computer and Information Science
Jan Rupnik Jozef Stefan Institute
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Linear machines 28/02/2017.
Large Scale Support Vector Machines
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Support vector machines
Support Vector Machines
Support vector machines
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Support vector machines
Presentation transcript:

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute

Outline Introduction Support Vector Machines Stochastic Subgradient Descent SVM - Pegasos Experiments

Introduction Support Vector Machines (SVMs) have become one of the most popular classification tools in the last decade Straightforward implementations could not handle large sets of training examples Recently methods for solving SVMs arose with linear computational complexity Pegasos: primal estimated subgradient approach for solving SVMs

Hard Margin SVM Many possible hyperplanes perfectly separate the two classes (e.g. red line). Only one hyperplane with the maximum margin (blue line). Support vectors

Soft Margin SVM Allow a small number of examples to be misclassified to find a large margin classifier. The trade off between the classification accuracy on the training set and the size of the margin is a parameter for the SVM

Problem setting Let S = {(x 1,y 1 ),..., (x m, y m )} be the set of input- output pairs, where x i є R N and y i є {-1,1}. Find the hyperplane with the normal vector w є R N and offset b є R that has good classification accuracy on the training set S and has a large margin. Classify a new example x as sign(w’x – b)

Optimization problem Regularized hinge loss: min w λ/2 w’w + 1/m Σ i (1 – y i (w’x i – b)) + Trade off between margin and loss Size of the margin Expected hinge loss on the training set Positive for correctly classified examples, else negative (1 – z) + := max{0, 1-z} (hinge loss) First summand is a quadratic function, the sum is a piecewise linear function. The whole objective: piecewise quadratic.

Perceptron We ignore the offset parameter b from now on (b = 0) Regularized Hinge Loss (SVM): min w λ/2 w’w + 1/m Σ i (1 – y i (w’x i )) + Perceptron min w 1/m Σ i (–y i (w’x i )) +

0 1 Loss functions 0 1 Standard 0/1 loss Perceptron loss Hinge loss Penalizes all incorrectly classified examples with the same amount Penalizes incorrectly classified examples x proportionally to the size of |w’x| Penalizes incorrectly classified examples and correctly classified examples that lie within the margin Examples that are correctly classified but fall within the margin

Stochastic Subgradient Descent Gradient descent optimization in perceptron (smooth objective) Subgradient descent in pegasos (non differentiable objective) Gradient (unique) Subgradients (all equally valid)

Stochastic Subgradient Subgradient in perceptron: 1/m Σ–y i x i for all misclassified examples Subgradient in SVM: λw + 1/m Σ i (1 – y i x i ) for all misclassified examples For every point w the subgradient is a function of the training sample S. We can estimate it from a smaller random subset of S of size k, A, (stochastic part) and speed up computations. Stochastic subgradient in SVM: λw + 1/k Σ (x,y) є A && misclassified (1 – yx)

Pegasos – the algorithm Learning rate Subsample Subgradient step Projection into a ball (rescaling) Subgradient is zero on other training points

What’s new in pegasos? Sub-gradient descent technique 50 years old Soft Margin SVM 14 years old Typically the gradient descent methods suffer from slow convergence Authors of Pegasos proved that aggressive decrease in learning rate μ t still leads to convergence. – Previous works: μ t = 1/(λ√t)) – pegasos: μ t = 1/ (λt) Proved that the solution always lies in a ball of radius 1/√λ

SVM Light A popular SVM solver with superlinear computational complexity Solves a large quadratic program Solution can be expressed in terms a small subset of training vectors, called support vectors Active set method to find the support vectors Solve a series of smaller quadratic problems Highly accurate solutions Algorithm and implementation by Thorsten Joachims T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT Press, 1999

Experiments Data – Reuters RCV2 – Roughly news articles – Already preprocessed to bag of word vectors, publicly available – Number of features roughly – Sparse vectors – category CCAT, which consists of news

Quick convergence to suboptimal solutions 200 iterations took 9.2 CPU seconds, the objective value was 0.3% higher than the optimal solution 560 iterations to get a 0.1% accurate solution SVM Light takes roughly 4 hours of CPU time

Test set error Optimizing the objective value to a high precission is often not necessary The lowest error on the test set is achieved much earlier

Parameters k and T The product of kT determines how close to the optimal value we get If kT is fixed the k does not play a significant role

Conclusions and final notes Pegasos – one of the most efficient suboptimal SVM solvers Suboptimal solutions often generalize to new examples well Can take advantage of sparsity Linear solver Nonlinear extensions have been proposed, but they suffer from slower convergence

Thank you!