Classification & Regression Part II COSC 526 Class 8 Classification & Regression Part II Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.gov Start with ebola…
Last Class We saw different techniques for handling large- scale classification: Decision trees K Nearest Neighbors (k-NN) Modified Decision trees to handle streaming datasets For k-NN, we discussed efficient data structures to handle large-scale datasets
This class… More classification… Support vector machines: Basic outline of the algorithm Modifications for large-scale datasets Online approaches
Class Logistics
10% Grade… 40% – assignments (2 instead of 3) 50% - class project What do we do with the 5%? Class participation Select papers updated on the class website Present your take on it (Critique) Per class 2 presentations – 15 minutes each Starting Tue (Feb 17)
Critique Structure 2-3 slides: what is the paper all about? 2-4 slides: key results presented 2-3 slides: key limitations/shortcomings? 2-3 slides: what could have been done better? More discussion oriented presentation rather than an in-depth view of the paper Peer review of critiques due every week…
Support Vector Machines (SVM)
Support Vector Machines (SVM) Easiest explanation: Find the “best linear separator” for a dataset Training examples: {(x1, y1), (x2, y2), …, (xn, yn)} Each data point xi = (xi(1), xi(2) … xi(d)) yi = {-1, +1} In higher dimensional datasets we want to find the “best hyperplane” α x f(x,w) y
Classifier Margin… Margin: width of the boundary that could be increased by before hitting a data point Interested in the maximum margin: Simplest is the maximum margin SVM classifier Support Vectors support vectors
Why are we interested in Max Margin SVM? Intuitively this feels right: we need the maximum margin to fit the data correctly Robust to location of the boundary: even if we have made a small error in the location of the boundary, not much effect on classification Model obtained is tolerant to removal of any non- support vector data points Validation works well! Works very well in practice
Why is the maximum margin a good thing? theoretical convenience and existence of generalization error bounds that depend on the value of margin
Why maximizing 𝜸 a good idea? We all know what the dot product means 𝑨 𝒄𝒐𝒔𝜽 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Why maximizing 𝜸 a good idea? Dot product What is , ? So, roughly corresponds to the margin Bigger bigger the separation w x + b = 0 𝒘 + x2 x1 + x1 x2 𝒘 + x2 In this case 𝜸 𝟐 ≈ 𝟐 𝒘 𝟐 x1 𝒘 In this case 𝜸 𝟏 ≈ 𝒘 𝟐
+ What is the margin? Distance from a point to a line A (xA(1), xA(2)) Note we assume 𝒘 𝟐 =𝟏 L Let: Line L: w∙x+b = w(1)x(1)+w(2)x(2)+b=0 w = (w(1), w(2)) Point A = (xA(1), xA(2)) Point M on a line = (xM(1), xM(2)) w A (xA(1), xA(2)) w ∙ x + b = 0 + d(A, L) H (0,0) d(A, L) = |AH| = |(A-M) ∙ w| = |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)| = xA(1) w(1) + xA(2) w(2) + b = w ∙ A + b M (x1, x2) Remember xM(1)w(1) + xM(2)w(2) = - b since M belongs to line L J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ + + + - + - + - + - - - - Largest Margin For ith data point: 𝒘 + + + + w x + b = 0 For ith data point: Want to solve: - + - + - + - - - - J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Support Vector Machine Maximize the margin: Good according to intuition, theory (VC dimension) & practice 𝜸 is margin … distance from the separating hyperplane + + wx+b=0 + + + - + + - - - + - - - Maximizing the margin J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
How do we derive the margin? Separating hyperplane is defined by the support vectors Points on +/- planes from the solution If you knew these points, you could ignore the rest Generally, d+1 support vectors (for d dim. data) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Canonical Hyperplane: Problem wx+b=0 wx+b=+1 wx+b=-1 Let Now, Scaling w increases margin Solution Work with normalized w Also require support vectors to be in the plane defined by: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Canonical Hyperplane: Solution Want to maximize margin γ What is the relation between x1 and x2? We also know: wx+b=-1 wx+b=0 wx+b=+1 2 -1 Note:
Maximizing the Margin 2 We started with But w can be arbitrarily large! We normalized and... Then: wx+b=-1 wx+b=0 wx+b=+1 2 This is called SVM with “hard” constraints
Non-linearly Separable Data If data is not separable introduce penalty: Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + wx+b=0 + - + + - + - - + - + - - - -
Support Vector Machines Introduce slack variables i If point xi is on the wrong side of the margin then get penalty i + + wx+b=0 + - + i + + - j - + + - - ksi - For each data point: If margin 1, don’t care If margin < 1, pay linear penalty
+ + + + + + - - + + - - - - What is the role of slack penalty C: C=: Only want to w, b that separate the data C=0: Can set i to anything, then w=0 (basically ignores the data) small C + “good” C + + + + big C + - - + + - - - (0,0) -
Support Vector Machines SVM in the “natural” form SVM uses “Hinge Loss”: Margin Empirical loss L (how well we fit training data) Regularization parameter -1 0 1 2 0/1 loss penalty Hinge loss: max{0, 1-z}
SVM: How to estimate w? Want to estimate and ! Use a quadratic solver: Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data!
SVM: How to estimate w? Want to estimate w, b! Alternative approach: Want to minimize f(w,b): Side note: How to minimize convex functions ? Use gradient descent: minz g(z) Iterate: zt+1 zt – g(zt) g(z) z
SVM: How to estimate w? Want to minimize f(w,b): Compute the gradient (j) w.r.t. w(j) Empirical loss 𝑳( 𝒙 𝒊 𝒚 𝒊 )
SVM: How to estimate w? Iterate until convergence: For j = 1 … d Gradient descent: Problem: Computing f(j) takes O(n) time! n … size of the training dataset Iterate until convergence: For j = 1 … d Evaluate: Update: w(j) w(j) - f(j) …learning rate parameter C… regularization parameter
SVM: How to estimate w? Stochastic Gradient Descent We just had: Stochastic Gradient Descent Instead of evaluating gradient over all examples evaluate it for each individual training example Stochastic gradient descent: Notice: no summation over i anymore Iterate until convergence: For i = 1 … n For j = 1 … d Compute: f(j)(xi) Update: w(j) w(j) - f(j)(xi)
Making SVMs work with Big Data
Optimization problem set up kernel function This is good optimization problem for computers to solve called quadratic programming
Problems with SVMs on Big Data SVM complexity: quadratic programming (QP) Time: O(m3) Space: O(m2) m: size of training data Two types of approaches: Modify SVM algorithm to work with large datasets Select representative training data to use normal SVM
Reducing the Training Set How do we reduce the size of the training set so that we can reduce the time? Combine (a large number of) ‘small’ SVMs together to obtain the final SVM Reduced SVM: use random rectangular subset of the kernel matrix All techniques only find an approximation to the optimal solution by an iterative approach why not exploit this?
Problematic datasets: even in 1D! How will a SVM handle this dataset?
The Kernel Trick: Using higher dimensions to separate data Project data into a higher dimension Now find the separation in this space This is a common trick in ML for many algorithms
Commonly used SVM basis functions zk = (Polynomial terms of xk of degree 1-q) zk = (Radial basis functions of xk) zk = (sigmoid functions of xk) These and many other functions are valid
How to tackle training set sizes? We need: Efficient and effective method for selecting a “working” set “shrink” the optimization problem: Much less support vectors than training examples many support vectors have an αi at the upper bound C Computational improvements including caching and incremental updates of the gradient
How does this algorithm look like? In each iteration of our optimization, we will split αi into two categories: set B of free variables: updated in the current iteration set N of fixed variables: temporarily fixed in the current iteration Then divide and solve the optimization While the optimality constraints are violated: Select q variables for the working set B. Remaining l-q variables are part of the fixed set N. Solve the QP-sub-problem with W(α) on B Joachims, T., Making large-scale SVM practical, Large-scale Machine Learning
Now, how do we select a good “working” set? Select the set of variables such that the current iteration will make progress towards the minimum of W(α) Use first order approximation, i.e., steepest direction d of descent which has only q non-zero elements Convergence: terminate only when the optimal solution is found If not, take a step towards the optimum…
Other techniques to identify appropriate (and reduced) training sets Minimum enclosing ball (MEB) for selecting “core” sets after core sets are selected, solve the same optimization problem… Complexity: Time: O(m/ε2 + 1/ε4) Space: O(1/ε8) εR R Tsang, I.W., Kwok, J.T., and Cheung, P.-M., JMLR (6): 363-392 (2005).
Incremental (and Decremental) SVM Learning Solving this QP formulation, we find this:
Example of an SVM working with Big Datasets Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification m = 781,000 training examples (documents) 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words
Text categorization Questions: (1) Is SGD successful at minimizing f(w,b)? (2) How quickly does SGD find the min of f(w,b)? (3) What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable
Optimization “Accuracy” SGD SVM Conventional SVM Optimization quality: | f(w,b) – f (wopt,bopt) | For optimizing f(w,b) within reasonable quality SGD-SVM is super fast
SGD vs. Batch Conjugate Gradient SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time 𝒌. Conjugate gradient converges in 𝒌 . Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times 𝒌… condition number
Practical Considerations Need to choose learning rate and t0 Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use for next 100k iterations on the full dataset
Advanced Topics…
Sparse Linear SVMs Feature vector xi is sparse (contains many zeros) Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] Can we do the SGD update more efficiently? Approximated in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated The two-step update is not equivalent! Why does it work? Since eta is close to 0 you can do this. You get one more eta^2 term that is low order
Sparse Linear SVMs: Practical Considerations Solution 1: Represent vector w as the product of scalar s and vector v Then the update procedure is: (1) (2) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher Two step update procedure: (1) (2)
Practical Considerations Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create a validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset
What about multiple classes? Idea 1: One against all Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo How to classify? Return class c arg maxc wc x + bc
Learn 1 classifier: Multiclass SVM Idea 2: Learn 3 sets of weights simoultaneously! For each class c estimate wc, bc Want the correct class to have highest margin: wyi xi + byi 1 + wc xi + bc c yi , i (xi, yi)
Multiclass SVM Optimization problem: To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM
SVM : what you must know? One of the most successful ML algorithms Modifications for handling big datasets: Reduce the training set (“core set”) Modify SVM training algorithm Incremental algorithm Multi-class modifications are more complex