Classification & Regression Part II

Name: Classification & Regression Part II
Uploaded: 2017-10-04T15:43:01+00:00
Duration: PTM25S3
Channel: May Cobb
Description: Classification & Regression Part II

Classification & Regression Part II
COSC 526 Class 8 Classification & Regression Part II Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: Start with ebola…

Last Class We saw different techniques for handling large- scale classification: Decision trees K Nearest Neighbors (k-NN) Modified Decision trees to handle streaming datasets For k-NN, we discussed efficient data structures to handle large-scale datasets

This class… More classification… Support vector machines:
Basic outline of the algorithm Modifications for large-scale datasets Online approaches

Class Logistics

10% Grade… 40% – assignments (2 instead of 3) 50% - class project
What do we do with the 5%? Class participation Select papers updated on the class website Present your take on it (Critique) Per class 2 presentations – 15 minutes each Starting Tue (Feb 17)

Critique Structure 2-3 slides: what is the paper all about?
2-4 slides: key results presented 2-3 slides: key limitations/shortcomings? 2-3 slides: what could have been done better? More discussion oriented presentation rather than an in-depth view of the paper Peer review of critiques due every week…

Support Vector Machines (SVM)

Support Vector Machines (SVM)
Easiest explanation: Find the “best linear separator” for a dataset Training examples: {(x1, y1), (x2, y2), …, (xn, yn)} Each data point xi = (xi(1), xi(2) … xi(d)) yi = {-1, +1} In higher dimensional datasets we want to find the “best hyperplane” α x f(x,w) y

Classifier Margin… Margin: width of the boundary that could be increased by before hitting a data point Interested in the maximum margin: Simplest is the maximum margin SVM classifier Support Vectors support vectors

Why are we interested in Max Margin SVM?
Intuitively this feels right: we need the maximum margin to fit the data correctly Robust to location of the boundary: even if we have made a small error in the location of the boundary, not much effect on classification Model obtained is tolerant to removal of any non- support vector data points Validation works well! Works very well in practice

Why is the maximum margin a good thing?
theoretical convenience and existence of generalization error bounds that depend on the value of margin

Why maximizing 𝜸 a good idea?
We all know what the dot product means 𝑨 𝒄𝒐𝒔𝜽 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Why maximizing 𝜸 a good idea?
Dot product What is , ? So, roughly corresponds to the margin Bigger bigger the separation w  x + b = 0 𝒘 + x2 x1 + x1 x2 𝒘 + x2 In this case 𝜸 𝟐 ≈ 𝟐 𝒘 𝟐 x1 𝒘 In this case 𝜸 𝟏 ≈ 𝒘 𝟐

+ What is the margin? Distance from a point to a line A (xA(1), xA(2))
Note we assume 𝒘 𝟐 =𝟏 L Let: Line L: w∙x+b = w(1)x(1)+w(2)x(2)+b=0 w = (w(1), w(2)) Point A = (xA(1), xA(2)) Point M on a line = (xM(1), xM(2)) w A (xA(1), xA(2)) w ∙ x + b = 0 + d(A, L) H (0,0) d(A, L) = |AH| = |(A-M) ∙ w| = |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)| = xA(1) w(1) + xA(2) w(2) + b = w ∙ A + b M (x1, x2) Remember xM(1)w(1) + xM(2)w(2) = - b since M belongs to line L J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

+ + + + - + - + - + - - - - Largest Margin For ith data point:
𝒘 + + + + w  x + b = 0 For ith data point: Want to solve: - + - + - + - - - - J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Support Vector Machine
Maximize the margin: Good according to intuition, theory (VC dimension) & practice 𝜸 is margin … distance from the separating hyperplane + + wx+b=0 + + + -  +  + - - - +  - - - Maximizing the margin J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

How do we derive the margin?
Separating hyperplane is defined by the support vectors Points on +/- planes from the solution If you knew these points, you could ignore the rest Generally, d+1 support vectors (for d dim. data) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Canonical Hyperplane: Problem
wx+b=0 wx+b=+1 wx+b=-1 Let Now, Scaling w increases margin Solution Work with normalized w Also require support vectors to be in the plane defined by: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Canonical Hyperplane: Solution
Want to maximize margin γ What is the relation between x1 and x2? We also know: wx+b=-1 wx+b=0 wx+b=+1 2 -1 Note:

Maximizing the Margin 2 We started with
But w can be arbitrarily large! We normalized and... Then: wx+b=-1 wx+b=0 wx+b=+1 2 This is called SVM with “hard” constraints

Non-linearly Separable Data
If data is not separable introduce penalty: Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + wx+b=0 + - + + - + - - + - + - - - -

Support Vector Machines
Introduce slack variables i If point xi is on the wrong side of the margin then get penalty i + + wx+b=0 + - + i + + - j - + + - - ksi - For each data point: If margin  1, don’t care If margin < 1, pay linear penalty

+ + + + + + - - + + - - - - What is the role of slack penalty C:
C=: Only want to w, b that separate the data C=0: Can set i to anything, then w=0 (basically ignores the data) small C + “good” C + + + + big C + - - + + - - - (0,0) -

Support Vector Machines
SVM in the “natural” form SVM uses “Hinge Loss”: Margin Empirical loss L (how well we fit training data) Regularization parameter 0/1 loss penalty Hinge loss: max{0, 1-z}

SVM: How to estimate w? Want to estimate and ! Use a quadratic solver:
Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data!

SVM: How to estimate w? Want to estimate w, b! Alternative approach:
Want to minimize f(w,b): Side note: How to minimize convex functions ? Use gradient descent: minz g(z) Iterate: zt+1  zt –  g(zt) g(z) z

SVM: How to estimate w? Want to minimize f(w,b):
Compute the gradient (j) w.r.t. w(j) Empirical loss 𝑳( 𝒙 𝒊 𝒚 𝒊 )

SVM: How to estimate w? Iterate until convergence: For j = 1 … d
Gradient descent: Problem: Computing f(j) takes O(n) time! n … size of the training dataset Iterate until convergence: For j = 1 … d Evaluate: Update: w(j)  w(j) - f(j) …learning rate parameter C… regularization parameter

SVM: How to estimate w? Stochastic Gradient Descent
We just had: Stochastic Gradient Descent Instead of evaluating gradient over all examples evaluate it for each individual training example Stochastic gradient descent: Notice: no summation over i anymore Iterate until convergence: For i = 1 … n For j = 1 … d Compute: f(j)(xi) Update: w(j)  w(j) -  f(j)(xi)

Making SVMs work with Big Data

Optimization problem set up
kernel function This is good optimization problem for computers to solve called quadratic programming

Problems with SVMs on Big Data
SVM complexity: quadratic programming (QP) Time: O(m3) Space: O(m2) m: size of training data Two types of approaches: Modify SVM algorithm to work with large datasets Select representative training data to use normal SVM

Reducing the Training Set
How do we reduce the size of the training set so that we can reduce the time? Combine (a large number of) ‘small’ SVMs together to obtain the final SVM Reduced SVM: use random rectangular subset of the kernel matrix All techniques only find an approximation to the optimal solution by an iterative approach why not exploit this?

Problematic datasets: even in 1D!
How will a SVM handle this dataset?

The Kernel Trick: Using higher dimensions to separate data
Project data into a higher dimension Now find the separation in this space This is a common trick in ML for many algorithms

Commonly used SVM basis functions
zk = (Polynomial terms of xk of degree 1-q) zk = (Radial basis functions of xk) zk = (sigmoid functions of xk) These and many other functions are valid

How to tackle training set sizes?
We need: Efficient and effective method for selecting a “working” set “shrink” the optimization problem: Much less support vectors than training examples many support vectors have an αi at the upper bound C Computational improvements including caching and incremental updates of the gradient

How does this algorithm look like?
In each iteration of our optimization, we will split αi into two categories: set B of free variables: updated in the current iteration set N of fixed variables: temporarily fixed in the current iteration Then divide and solve the optimization While the optimality constraints are violated: Select q variables for the working set B. Remaining l-q variables are part of the fixed set N. Solve the QP-sub-problem with W(α) on B Joachims, T., Making large-scale SVM practical, Large-scale Machine Learning

Now, how do we select a good “working” set?
Select the set of variables such that the current iteration will make progress towards the minimum of W(α) Use first order approximation, i.e., steepest direction d of descent which has only q non-zero elements Convergence: terminate only when the optimal solution is found If not, take a step towards the optimum…

Other techniques to identify appropriate (and reduced) training sets
Minimum enclosing ball (MEB) for selecting “core” sets after core sets are selected, solve the same optimization problem… Complexity: Time: O(m/ε2 + 1/ε4) Space: O(1/ε8) εR R Tsang, I.W., Kwok, J.T., and Cheung, P.-M., JMLR (6): (2005).

Incremental (and Decremental) SVM Learning
Solving this QP formulation, we find this:

Example of an SVM working with Big Datasets
Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification m = 781,000 training examples (documents) 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words

Text categorization Questions:
(1) Is SGD successful at minimizing f(w,b)? (2) How quickly does SGD find the min of f(w,b)? (3) What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

Optimization “Accuracy”
SGD SVM Conventional SVM Optimization quality: | f(w,b) – f (wopt,bopt) | For optimizing f(w,b) within reasonable quality SGD-SVM is super fast

SGD vs. Batch Conjugate Gradient
SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time 𝒌. Conjugate gradient converges in 𝒌 . Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times 𝒌… condition number

Practical Considerations
Need to choose learning rate  and t0 Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates  (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use  for next 100k iterations on the full dataset

Advanced Topics…

Sparse Linear SVMs Feature vector xi is sparse (contains many zeros)
Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] Can we do the SGD update more efficiently? Approximated in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated The two-step update is not equivalent! Why does it work? Since eta is close to 0 you can do this. You get one more eta^2 term that is low order

Sparse Linear SVMs: Practical Considerations
Solution 1: Represent vector w as the product of scalar s and vector v Then the update procedure is: (1) (2) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher  Two step update procedure: (1) (2)

Practical Considerations
Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create a validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset

What about multiple classes?
Idea 1: One against all Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo How to classify? Return class c arg maxc wc x + bc

Learn 1 classifier: Multiclass SVM
Idea 2: Learn 3 sets of weights simoultaneously! For each class c estimate wc, bc Want the correct class to have highest margin: wyi xi + byi  1 + wc xi + bc c  yi , i (xi, yi)

Multiclass SVM Optimization problem:
To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM

SVM : what you must know? One of the most successful ML algorithms
Modifications for handling big datasets: Reduce the training set (“core set”) Modify SVM training algorithm Incremental algorithm Multi-class modifications are more complex

Classification & Regression Part II

Similar presentations

Presentation on theme: "Classification & Regression Part II"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification & Regression Part II

Similar presentations

Presentation on theme: "Classification & Regression Part II"— Presentation transcript:

Similar presentations

About project

Feedback