Download presentation
Presentation is loading. Please wait.
Published byMorgan Taylor Modified over 8 years ago
1
Support Vector Machine Slides from Andrew Moore and Mingyue Tan
2
Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b=0 w x + b<0 w x + b>0
3
Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data?
4
Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data?
5
Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best?
6
Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class
7
Classifier Margin denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Classifier Margin denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
8
Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1.Maximizing the margin is good according to intuition and PAC theory 2.Implies that only support vectors are important; other training examples are ignorable. 3.Empirically it works very very well.
9
Linear SVM Mathematically What we know: w. x + + b = +1 w. x - + b = -1 w. (x + -x -) = 2 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 X-X- x+x+ M=Margin Width
10
Linear SVM Mathematically Goal: 1) Correctly classify all training data if y i = +1 if y i = -1 for all i 2) Maximize the Margin same as minimize We can formulate a Quadratic Optimization Problem and solve for w and b Minimize subject to
11
Solving the Optimization Problem Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. Find w and b such that Φ(w) =½ w T w is minimized; and for all { ( x i,y i )} : y i (w T x i + b) ≥ 1
12
Solving the Optimization Problem s.t. Quadratic programming with linear constraints s.t. Lagrangian Function
13
Karush Kuhn Tucker (KKT) Conditions
14
Solving the Optimization Problem s.t.
15
Solving the Optimization Problem s.t., and Lagrangian Dual Problem
16
Solving the Optimization Problem The solution has the form: From KKT condition, we know: Thus, only support vectors have x1x1 x2x2 w T x + b = 0 w T x + b = -1 w T x + b = 1 x+x+ x+x+ x-x- Support Vectors
17
Solving the Optimization Problem The linear discriminant function is: Notice it relies on a dot product between the test point x and the support vectors x i Also keep in mind that solving the optimization problem involved computing the dot products x i T x j between all pairs of training points
18
Properties of SVM Duality Margin Sparseness Convexity Kernels
19
Dataset with noise Hard Margin: So far we require all data points be classified correctly - No training error What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING!
20
Dataset with noise What if data is not linear separable? (noisy data, outliers, etc.) Slack variables ξ i can be added to allow mis- classification of difficult or noisy data points x1x1 x2x2 denotes +1 denotes -1 w T x + b = 0 w T x + b = -1 w T x + b = 1
21
Large Margin Linear Classifier Formulation: such that Parameter C can be viewed as a way to control over-fitting. Known as C-SVM, produces a soft margin
22
Large Margin Linear Classifier Formulation: (Lagrangian Dual Problem) such that A small value for C will increase the number of training errors, while a large C will lead to a behavior similar to that of a hard-margin SVM
23
Large Margin Linear Classifier The parameter C controls the trade off between errors of the SVM on training data and margin maximization (C = ∞ leads to hard margin SVM). If it is too large, we have a high penalty for non- separable points and we may store many support vectors and overfit. If it is too small, we may have underfitting. How to choose C: Grid search over the parameters
24
Large Margin Linear Classifier Another variant of SVM C = 1/ ν N, where 0<= ν<=1 denotes the fraction of misclassifications that can be accepted Known as ν-SVM
25
Cost sensitive SVM 2C SVM: 2ν-SVM: ν + and ν - These can be the fraction of support vectors from the two classes or the fraction of misclassifications allowed
26
How to classify non-linearly separable databases?
27
Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2
28
Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)
29
Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.