Large Scale Support Vector Machines

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Support Vector Machines
Lecture 10: Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
Linear Discriminators Chapter 20 From Data to Knowledge.
Large-scale Classification and Regression Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
This week: overview on pattern recognition (related to machine learning)
Support Vector Machine & Image Classification Applications
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Classification & Regression Part II
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
Neural networks and support vector machines
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Gradient descent David Kauchak CS 158 – Fall 2016.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
Large Margin classifiers
ECE 5424: Introduction to Machine Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computer Science and Engineering, Seoul National University
Dan Roth Department of Computer and Information Science
Support Vector Machines
Large-Scale Machine Learning: k-NN, Perceptron
Jan Rupnik Jozef Stefan Institute
Perceptrons Support-Vector Machines
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
Linear Discriminators
Support Vector Machines
Linear machines 28/02/2017.
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Collaborative Filtering Matrix Factorization Approach
COSC 4335: Other Classification Techniques
Kai-Wei Chang University of Virginia
Support vector machines
Machine Learning Week 3.
Support Vector Machines
Support Vector Machine I
Support vector machines
Support vector machines
COSC 4368 Machine Learning Organization
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Introduction to Machine Learning
Presentation transcript:

Large Scale Support Vector Machines CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Issues with Perceptrons Overfitting: Regularization: If the data is not separable weights dance around Mediocre generalization: Finds a “barely” separating solution 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Support Vector Machines Which is best linear separator? Data: Examples: (x1, y1) … (xn, yn) Example i: xi = ( x1(1),… , x1(d) ) Real values yi  { -1, +1 } Inner product: wx = Σj w(j) x(j) + + + - + + - - + - - - - 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Largest Margin Why define margin as distance to the closest example, why not avg. or something else 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Largest Margin Prediction = sign(wx + b) “Confidence” = (w x + b) y For i-th datapoint: i = (w xi+b) yi Want to solve: Can rewrite as + + + + w  x + b = 0 - + - + - + - - - - 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Support Vector Machine Maximize the margin: Good according to intuition, theory & practice + + wx+b=0 + + + -  +  + - - - +  - - - Maximizing the margin 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Support Vector Machines Separating hyperplane is defined by the support vectors Points on +/- planes from the solution. If you knew these points, you could ignore the rest If no degeneracies, d+1 support vectors (for d dim. data) 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Canonical Hyperplane: Problem wx+b=0 wx+b=+1 wx+b=-1 Problem: Let (w x+b) y =  then (2w x+ 2b) y = 2  Scaling w increases margin! Solution: Normalize w: 𝑤 | 𝑤 | | w |= 𝑗=1 𝑑 𝑤 𝑗 2 Also require hyperplanes touching support vectors to be: 𝑤⋅𝑥+𝑏=±1 Note: w j … j-th component of w 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Canonical Hyperplane: Solution What is the relation between x1 and x2? 𝒙 𝟏 = 𝒙 𝟐 +𝟐𝜸 𝒘 ||𝒘|| We also know: 𝑤⋅ 𝑥 1 +𝑏=+1 𝑤⋅ 𝑥 2 +𝑏=−1 Then: 𝒘 𝒙 𝟐 +𝟐𝜸 𝒘 ||𝒘|| +𝒃=+𝟏 𝒘⋅ 𝒙 𝟐 +𝒃+𝟐𝜸 𝒘⋅𝒘 ||𝒘|| wx+b=-1 wx+b=0 wx+b=+1 2 Note: -1 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Maximizing the Margin We started with We normalized and.. Then: wx+b=0 wx+b=+1 wx+b=-1 2 We started with But w can be arbitrarily large We normalized and.. Then: This is called SVM with “hard” constraints 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Non linearly separable data? If data is not separable introduce penalty: Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + wx+b=0 + - + + - + - - + - + - - - - 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Support Vector Machines Introduce slack variables i If point xi is on the wrong side of the margin then get penalty i Slack penalty C: C=: only want to w,b that separate the data C=0: can set i to anything, then w=0 (basically ignores the data) + + wx+b=0 + - + i + + - i - + + - - - For each datapoint: If margin  1, don’t care If margin < 1, pay linear penalty 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Support Vector Machines SVM in the “natural” form SVM uses “Hinge Loss”: Margin Empirical loss (how well we fit training data) Regularization parameter -1 0 1 2 0/1 loss penalty Hinge loss: max{0, 1-z} 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

SVM: How to estimate w? Want to estimate w,b: Use a quadratic solver: Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data! 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

SVM: How to estimate w? Want to estimate w,b! Alternative approach: Want to minimize f(w,b): How to minimize convex functions? Gradient descent: minz f(z) Iterate: zt+1  zt –  f’(zt) f(z) z 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

SVM: How to estimate w? Want to minimize f(w,b): Take the gradient w.r.t wj: 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

SVM: How to estimate w? Gradient descent: Iterate until convergence: For j = 1 … d Evaluate Update: wj  wj -  j Problem: Computing j takes O(n) time!! n … size of the training dataset 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

SVM: How to estimate w? Stochastic gradient descent (SGD) We just had: Stochastic gradient descent (SGD) Instead of evaluating gradient over all examples evaluate it for each individual example Stochastic gradient descent: Iterate until convergence: For i = 1… n For j = 1 … d Evaluate j,i Update: w j  w j -  j, i n … # training examples d … # features 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Example: Text categorization Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification n = 781,000 training examples 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Example: Text categorization Questions: Is SGD successful at minimizing f(w,b)? How quickly does SGD find the minimum of f(w,b)? What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Optimization “Accuracy” | f(w,b) – foptimal(w,b) | 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Subsampling What if we subsample the training data and solve exactly? The point of this slid is that methods that converge faster don’t necessary work faster – simple thing that you can do fast (but many times) is better. Check SVM slides from 2009 and see if there is a better slide to make this point. This is not too good slide What if we subsample the training data and solve exactly? SGD on full dataset vs. Batch Conjugate Gradient on a sample of n training examples 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Practical Considerations Need to choose learning rate  and t0 Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates  (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use  for next 100k iterations on the full dataset 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Practical Considerations The two steps do not decompose. Why does it work? Since eta is close to 0? You can do this. You get one more eta^2 term that is low order Sparse Linear SVM: Feature vector xi is sparse (contains many zeros) Do not do: xi=[0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] The update Can be computed in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Practical Considerations Give the equations for step 1 and 2. w = v.s Solution 1: Represent vector w as the product of scalar s and vector v Perform step (1) by updating v and step (2) by updating s Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher  Two step update procedure: (1) (2) 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Practical Considerations Why do you think training on A and B would take the same time as on the full dataset? Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

What about multiple classes? One against all: Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo How to classify? Return class c arg maxc wc x + bc 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Learn 1 classifier: Multiclass SVM Learn 3 sets of weights simoultaneously For each class c estimate wc, bc Want the correct class to have highest margin: wyi xi + byi  1 + wc xi + bc c  yi , i (xi, yi) 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

Multiclass SVM Optimization problem: To obtain parameters wc, bc (for each class c) we can use similar techniques as for 2 class SVM 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu