Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan

Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b=0 w x + b<0 w x + b>0

Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data?

Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best?

Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class

Classifier Margin denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Classifier Margin denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1.Maximizing the margin is good according to intuition and PAC theory 2.Implies that only support vectors are important; other training examples are ignorable. 3.Empirically it works very very well.

Linear SVM Mathematically What we know: w. x + + b = +1 w. x - + b = -1 w. (x + -x -) = 2 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 X-X- x+x+ M=Margin Width

Linear SVM Mathematically Goal: 1) Correctly classify all training data if y i = +1 if y i = -1 for all i 2) Maximize the Margin same as minimize We can formulate a Quadratic Optimization Problem and solve for w and b Minimize subject to

Solving the Optimization Problem Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. Find w and b such that Φ(w) =½ w T w is minimized; and for all { ( x i,y i )} : y i (w T x i + b) ≥ 1

Solving the Optimization Problem s.t. Quadratic programming with linear constraints s.t. Lagrangian Function

Karush Kuhn Tucker (KKT) Conditions

Solving the Optimization Problem s.t.

Solving the Optimization Problem s.t., and Lagrangian Dual Problem

Solving the Optimization Problem The solution has the form: From KKT condition, we know: Thus, only support vectors have x1x1 x2x2 w T x + b = 0 w T x + b = -1 w T x + b = 1 x+x+ x+x+ x-x- Support Vectors

Solving the Optimization Problem The linear discriminant function is: Notice it relies on a dot product between the test point x and the support vectors x i Also keep in mind that solving the optimization problem involved computing the dot products x i T x j between all pairs of training points

Properties of SVM Duality Margin Sparseness Convexity Kernels

Dataset with noise Hard Margin: So far we require all data points be classified correctly - No training error What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING!

Dataset with noise What if data is not linear separable? (noisy data, outliers, etc.) Slack variables ξ i can be added to allow mis- classification of difficult or noisy data points x1x1 x2x2 denotes +1 denotes -1 w T x + b = 0 w T x + b = -1 w T x + b = 1

Large Margin Linear Classifier Formulation: such that Parameter C can be viewed as a way to control over-fitting. Known as C-SVM, produces a soft margin

Large Margin Linear Classifier Formulation: (Lagrangian Dual Problem) such that A small value for C will increase the number of training errors, while a large C will lead to a behavior similar to that of a hard-margin SVM

Large Margin Linear Classifier The parameter C controls the trade off between errors of the SVM on training data and margin maximization (C = ∞ leads to hard margin SVM). If it is too large, we have a high penalty for non- separable points and we may store many support vectors and overfit. If it is too small, we may have underfitting. How to choose C: Grid search over the parameters

Large Margin Linear Classifier Another variant of SVM C = 1/ ν N, where 0<= ν<=1 denotes the fraction of misclassifications that can be accepted Known as ν-SVM

Cost sensitive SVM 2C SVM: 2ν-SVM: ν + and ν - These can be the fraction of support vectors from the two classes or the fraction of misclassifications allowed

How to classify non-linearly separable databases?

Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2

Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Similar presentations

Presentation on theme: "Support Vector Machine Slides from Andrew Moore and Mingyue Tan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Similar presentations

Presentation on theme: "Support Vector Machine Slides from Andrew Moore and Mingyue Tan."— Presentation transcript:

Similar presentations

About project

Feedback