Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Scale Support Vector Machines

Similar presentations


Presentation on theme: "Large Scale Support Vector Machines"— Presentation transcript:

1 Large Scale Support Vector Machines
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

2 Issues with Perceptrons
Overfitting: Regularization: If the data is not separable weights dance around Mediocre generalization: Finds a “barely” separating solution 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

3 Support Vector Machines
Which is best linear separator? Data: Examples: (x1, y1) … (xn, yn) Example i: xi = ( x1(1),… , x1(d) ) Real values yi  { -1, +1 } Inner product: wx = Σj w(j) x(j) + + + - + + - - + - - - - 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

4 Largest Margin Why define margin as distance to the closest example, why not avg. or something else 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

5 Largest Margin Prediction = sign(wx + b) “Confidence” = (w x + b) y
For i-th datapoint: i = (w xi+b) yi Want to solve: Can rewrite as + + + + w  x + b = 0 - + - + - + - - - - 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

6 Support Vector Machine
Maximize the margin: Good according to intuition, theory & practice + + wx+b=0 + + + - + + - - - + - - - Maximizing the margin 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

7 Support Vector Machines
Separating hyperplane is defined by the support vectors Points on +/- planes from the solution. If you knew these points, you could ignore the rest If no degeneracies, d+1 support vectors (for d dim. data) 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

8 Canonical Hyperplane: Problem
wx+b=0 wx+b=+1 wx+b=-1 Problem: Let (w x+b) y =  then (2w x+ 2b) y = 2  Scaling w increases margin! Solution: Normalize w: 𝑤 | 𝑤 | | w |= 𝑗=1 𝑑 𝑤 𝑗 2 Also require hyperplanes touching support vectors to be: 𝑤⋅𝑥+𝑏=±1 Note: w j … j-th component of w 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

9 Canonical Hyperplane: Solution
What is the relation between x1 and x2? 𝒙 𝟏 = 𝒙 𝟐 +𝟐𝜸 𝒘 ||𝒘|| We also know: 𝑤⋅ 𝑥 1 +𝑏=+1 𝑤⋅ 𝑥 2 +𝑏=−1 Then: 𝒘 𝒙 𝟐 +𝟐𝜸 𝒘 ||𝒘|| +𝒃=+𝟏 𝒘⋅ 𝒙 𝟐 +𝒃+𝟐𝜸 𝒘⋅𝒘 ||𝒘|| wx+b=-1 wx+b=0 wx+b=+1 2 Note: -1 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

10 Maximizing the Margin We started with We normalized and.. Then:
wx+b=0 wx+b=+1 wx+b=-1 2 We started with But w can be arbitrarily large We normalized and.. Then: This is called SVM with “hard” constraints 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

11 Non linearly separable data?
If data is not separable introduce penalty: Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + wx+b=0 + - + + - + - - + - + - - - - 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

12 Support Vector Machines
Introduce slack variables i If point xi is on the wrong side of the margin then get penalty i Slack penalty C: C=: only want to w,b that separate the data C=0: can set i to anything, then w=0 (basically ignores the data) + + wx+b=0 + - + i + + - i - + + - - - For each datapoint: If margin  1, don’t care If margin < 1, pay linear penalty 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

13 Support Vector Machines
SVM in the “natural” form SVM uses “Hinge Loss”: Margin Empirical loss (how well we fit training data) Regularization parameter 0/1 loss penalty Hinge loss: max{0, 1-z} 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

14 SVM: How to estimate w? Want to estimate w,b: Use a quadratic solver:
Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data! 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

15 SVM: How to estimate w? Want to estimate w,b! Alternative approach:
Want to minimize f(w,b): How to minimize convex functions? Gradient descent: minz f(z) Iterate: zt+1  zt –  f’(zt) f(z) z 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

16 SVM: How to estimate w? Want to minimize f(w,b):
Take the gradient w.r.t wj: 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

17 SVM: How to estimate w? Gradient descent: Iterate until convergence:
For j = 1 … d Evaluate Update: wj  wj -  j Problem: Computing j takes O(n) time!! n … size of the training dataset 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

18 SVM: How to estimate w? Stochastic gradient descent (SGD)
We just had: Stochastic gradient descent (SGD) Instead of evaluating gradient over all examples evaluate it for each individual example Stochastic gradient descent: Iterate until convergence: For i = 1… n For j = 1 … d Evaluate j,i Update: w j  w j -  j, i n … # training examples d … # features 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

19 Example: Text categorization
Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification n = 781,000 training examples 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

20 Example: Text categorization
Questions: Is SGD successful at minimizing f(w,b)? How quickly does SGD find the minimum of f(w,b)? What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

21 Optimization “Accuracy”
| f(w,b) – foptimal(w,b) | 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

22 Subsampling What if we subsample the training data and solve exactly?
The point of this slid is that methods that converge faster don’t necessary work faster – simple thing that you can do fast (but many times) is better. Check SVM slides from 2009 and see if there is a better slide to make this point. This is not too good slide What if we subsample the training data and solve exactly? SGD on full dataset vs. Batch Conjugate Gradient on a sample of n training examples 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

23 Practical Considerations
Need to choose learning rate  and t0 Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates  (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use  for next 100k iterations on the full dataset 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

24 Practical Considerations
The two steps do not decompose. Why does it work? Since eta is close to 0? You can do this. You get one more eta^2 term that is low order Sparse Linear SVM: Feature vector xi is sparse (contains many zeros) Do not do: xi=[0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] The update Can be computed in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

25 Practical Considerations
Give the equations for step 1 and 2. w = v.s Solution 1: Represent vector w as the product of scalar s and vector v Perform step (1) by updating v and step (2) by updating s Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher  Two step update procedure: (1) (2) 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

26 Practical Considerations
Why do you think training on A and B would take the same time as on the full dataset? Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

27 What about multiple classes?
One against all: Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo How to classify? Return class c arg maxc wc x + bc 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

28 Learn 1 classifier: Multiclass SVM
Learn 3 sets of weights simoultaneously For each class c estimate wc, bc Want the correct class to have highest margin: wyi xi + byi  1 + wc xi + bc c  yi , i (xi, yi) 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

29 Multiclass SVM Optimization problem:
To obtain parameters wc, bc (for each class c) we can use similar techniques as for 2 class SVM 12/9/2018 Jure Leskovec, Stanford CS246: Mining Massive Datasets,


Download ppt "Large Scale Support Vector Machines"

Similar presentations


Ads by Google