Classification & Regression Part II

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Linear Classifiers/SVMs
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Crash Course on Machine Learning Part III
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Classification and Decision Boundaries
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Lecture 10: Support Vector Machines
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Large-scale Classification and Regression Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
This week: overview on pattern recognition (related to machine learning)
Efficient Model Selection for Support Vector Machines
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Biointelligence Laboratory, Seoul National University
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines
Neural networks and support vector machines
PREDICT 422: Practical Machine Learning
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Jan Rupnik Jozef Stefan Institute
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CS 2750: Machine Learning Support Vector Machines
Collaborative Filtering Matrix Factorization Approach
Large Scale Support Vector Machines
COSC 4335: Other Classification Techniques
Support vector machines
Support Vector Machines
COSC 4368 Machine Learning Organization
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Classification & Regression Part II COSC 526 Class 8 Classification & Regression Part II Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.gov Start with ebola…

Last Class We saw different techniques for handling large- scale classification: Decision trees K Nearest Neighbors (k-NN) Modified Decision trees to handle streaming datasets For k-NN, we discussed efficient data structures to handle large-scale datasets

This class… More classification… Support vector machines: Basic outline of the algorithm Modifications for large-scale datasets Online approaches

Class Logistics

10% Grade… 40% – assignments (2 instead of 3) 50% - class project What do we do with the 5%? Class participation Select papers updated on the class website Present your take on it (Critique) Per class 2 presentations – 15 minutes each Starting Tue (Feb 17)

Critique Structure 2-3 slides: what is the paper all about? 2-4 slides: key results presented 2-3 slides: key limitations/shortcomings? 2-3 slides: what could have been done better? More discussion oriented presentation rather than an in-depth view of the paper Peer review of critiques due every week…

Support Vector Machines (SVM)

Support Vector Machines (SVM) Easiest explanation: Find the “best linear separator” for a dataset Training examples: {(x1, y1), (x2, y2), …, (xn, yn)} Each data point xi = (xi(1), xi(2) … xi(d)) yi = {-1, +1} In higher dimensional datasets we want to find the “best hyperplane” α x f(x,w) y

Classifier Margin… Margin: width of the boundary that could be increased by before hitting a data point Interested in the maximum margin: Simplest is the maximum margin SVM classifier Support Vectors support vectors

Why are we interested in Max Margin SVM? Intuitively this feels right: we need the maximum margin to fit the data correctly Robust to location of the boundary: even if we have made a small error in the location of the boundary, not much effect on classification Model obtained is tolerant to removal of any non- support vector data points Validation works well! Works very well in practice

Why is the maximum margin a good thing? theoretical convenience and existence of generalization error bounds that depend on the value of margin

Why maximizing 𝜸 a good idea? We all know what the dot product means 𝑨 𝒄𝒐𝒔𝜽 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Why maximizing 𝜸 a good idea? Dot product What is , ? So, roughly corresponds to the margin Bigger bigger the separation w  x + b = 0 𝒘 + x2 x1 + x1 x2 𝒘 + x2 In this case 𝜸 𝟐 ≈ 𝟐 𝒘 𝟐 x1 𝒘 In this case 𝜸 𝟏 ≈ 𝒘 𝟐

+ What is the margin? Distance from a point to a line A (xA(1), xA(2)) Note we assume 𝒘 𝟐 =𝟏 L Let: Line L: w∙x+b = w(1)x(1)+w(2)x(2)+b=0 w = (w(1), w(2)) Point A = (xA(1), xA(2)) Point M on a line = (xM(1), xM(2)) w A (xA(1), xA(2)) w ∙ x + b = 0 + d(A, L) H (0,0) d(A, L) = |AH| = |(A-M) ∙ w| = |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)| = xA(1) w(1) + xA(2) w(2) + b = w ∙ A + b M (x1, x2) Remember xM(1)w(1) + xM(2)w(2) = - b since M belongs to line L J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

+ + + + - + - + - + - - - - Largest Margin For ith data point: 𝒘 + + + + w  x + b = 0 For ith data point: Want to solve: - + - + - + - - - - J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Support Vector Machine Maximize the margin: Good according to intuition, theory (VC dimension) & practice 𝜸 is margin … distance from the separating hyperplane + + wx+b=0 + + + -  +  + - - - +  - - - Maximizing the margin J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How do we derive the margin? Separating hyperplane is defined by the support vectors Points on +/- planes from the solution If you knew these points, you could ignore the rest Generally, d+1 support vectors (for d dim. data) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Canonical Hyperplane: Problem wx+b=0 wx+b=+1 wx+b=-1 Let Now, Scaling w increases margin Solution Work with normalized w Also require support vectors to be in the plane defined by: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Canonical Hyperplane: Solution Want to maximize margin γ What is the relation between x1 and x2? We also know: wx+b=-1 wx+b=0 wx+b=+1 2 -1 Note:

Maximizing the Margin 2 We started with But w can be arbitrarily large! We normalized and... Then: wx+b=-1 wx+b=0 wx+b=+1 2 This is called SVM with “hard” constraints

Non-linearly Separable Data If data is not separable introduce penalty: Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + wx+b=0 + - + + - + - - + - + - - - -

Support Vector Machines Introduce slack variables i If point xi is on the wrong side of the margin then get penalty i + + wx+b=0 + - + i + + - j - + + - - ksi - For each data point: If margin  1, don’t care If margin < 1, pay linear penalty

+ + + + + + - - + + - - - - What is the role of slack penalty C: C=: Only want to w, b that separate the data C=0: Can set i to anything, then w=0 (basically ignores the data) small C + “good” C + + + + big C + - - + + - - - (0,0) -

Support Vector Machines SVM in the “natural” form SVM uses “Hinge Loss”: Margin Empirical loss L (how well we fit training data) Regularization parameter -1 0 1 2 0/1 loss penalty Hinge loss: max{0, 1-z}

SVM: How to estimate w? Want to estimate and ! Use a quadratic solver: Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data!

SVM: How to estimate w? Want to estimate w, b! Alternative approach: Want to minimize f(w,b): Side note: How to minimize convex functions ? Use gradient descent: minz g(z) Iterate: zt+1  zt –  g(zt) g(z) z

SVM: How to estimate w? Want to minimize f(w,b): Compute the gradient (j) w.r.t. w(j) Empirical loss 𝑳( 𝒙 𝒊 𝒚 𝒊 )

SVM: How to estimate w? Iterate until convergence: For j = 1 … d Gradient descent: Problem: Computing f(j) takes O(n) time! n … size of the training dataset Iterate until convergence: For j = 1 … d Evaluate: Update: w(j)  w(j) - f(j) …learning rate parameter C… regularization parameter

SVM: How to estimate w? Stochastic Gradient Descent We just had: Stochastic Gradient Descent Instead of evaluating gradient over all examples evaluate it for each individual training example Stochastic gradient descent: Notice: no summation over i anymore Iterate until convergence: For i = 1 … n For j = 1 … d Compute: f(j)(xi) Update: w(j)  w(j) -  f(j)(xi)

Making SVMs work with Big Data

Optimization problem set up kernel function This is good optimization problem for computers to solve called quadratic programming

Problems with SVMs on Big Data SVM complexity: quadratic programming (QP) Time: O(m3) Space: O(m2) m: size of training data Two types of approaches: Modify SVM algorithm to work with large datasets Select representative training data to use normal SVM

Reducing the Training Set How do we reduce the size of the training set so that we can reduce the time? Combine (a large number of) ‘small’ SVMs together to obtain the final SVM Reduced SVM: use random rectangular subset of the kernel matrix All techniques only find an approximation to the optimal solution by an iterative approach why not exploit this?

Problematic datasets: even in 1D! How will a SVM handle this dataset?

The Kernel Trick: Using higher dimensions to separate data Project data into a higher dimension Now find the separation in this space This is a common trick in ML for many algorithms

Commonly used SVM basis functions zk = (Polynomial terms of xk of degree 1-q) zk = (Radial basis functions of xk) zk = (sigmoid functions of xk) These and many other functions are valid

How to tackle training set sizes? We need: Efficient and effective method for selecting a “working” set “shrink” the optimization problem: Much less support vectors than training examples many support vectors have an αi at the upper bound C Computational improvements including caching and incremental updates of the gradient

How does this algorithm look like? In each iteration of our optimization, we will split αi into two categories: set B of free variables: updated in the current iteration set N of fixed variables: temporarily fixed in the current iteration Then divide and solve the optimization While the optimality constraints are violated: Select q variables for the working set B. Remaining l-q variables are part of the fixed set N. Solve the QP-sub-problem with W(α) on B Joachims, T., Making large-scale SVM practical, Large-scale Machine Learning

Now, how do we select a good “working” set? Select the set of variables such that the current iteration will make progress towards the minimum of W(α) Use first order approximation, i.e., steepest direction d of descent which has only q non-zero elements Convergence: terminate only when the optimal solution is found If not, take a step towards the optimum…

Other techniques to identify appropriate (and reduced) training sets Minimum enclosing ball (MEB) for selecting “core” sets after core sets are selected, solve the same optimization problem… Complexity: Time: O(m/ε2 + 1/ε4) Space: O(1/ε8) εR R Tsang, I.W., Kwok, J.T., and Cheung, P.-M., JMLR (6): 363-392 (2005).

Incremental (and Decremental) SVM Learning Solving this QP formulation, we find this:

Example of an SVM working with Big Datasets Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification m = 781,000 training examples (documents) 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words

Text categorization Questions: (1) Is SGD successful at minimizing f(w,b)? (2) How quickly does SGD find the min of f(w,b)? (3) What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

Optimization “Accuracy” SGD SVM Conventional SVM Optimization quality: | f(w,b) – f (wopt,bopt) | For optimizing f(w,b) within reasonable quality SGD-SVM is super fast

SGD vs. Batch Conjugate Gradient SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time 𝒌. Conjugate gradient converges in 𝒌 . Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times 𝒌… condition number

Practical Considerations Need to choose learning rate  and t0 Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates  (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use  for next 100k iterations on the full dataset

Advanced Topics…

Sparse Linear SVMs Feature vector xi is sparse (contains many zeros) Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] Can we do the SGD update more efficiently? Approximated in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated The two-step update is not equivalent! Why does it work? Since eta is close to 0 you can do this. You get one more eta^2 term that is low order

Sparse Linear SVMs: Practical Considerations Solution 1: Represent vector w as the product of scalar s and vector v Then the update procedure is: (1) (2) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher  Two step update procedure: (1) (2)

Practical Considerations Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create a validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset

What about multiple classes? Idea 1: One against all Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo How to classify? Return class c arg maxc wc x + bc

Learn 1 classifier: Multiclass SVM Idea 2: Learn 3 sets of weights simoultaneously! For each class c estimate wc, bc Want the correct class to have highest margin: wyi xi + byi  1 + wc xi + bc c  yi , i (xi, yi)

Multiclass SVM Optimization problem: To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM

SVM : what you must know? One of the most successful ML algorithms Modifications for handling big datasets: Reduce the training set (“core set”) Modify SVM training algorithm Incremental algorithm Multi-class modifications are more complex