Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.

Introduction to SVMs

SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural Risk Minimization

Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 4.Empirically it works very very well.

A “ Good ” Separator X X O O O O O O X X X X X X O O

Noise in the Observations X X O O O O O O X X X X X X O O

Ruling Out Some Separators X X O O O O O O X X X X X X O O

Lots of Noise X X O O O O O O X X X X X X O O

Maximizing the Margin X X O O O O O O X X X X X X O O

Specifying a line and margin How do we represent this mathematically? … in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

Specifying a line and margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as.. wx+b=1 wx+b=0 wx+b=-1 +1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1

Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b?

Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? Let u and v be two vectors on the Plus Plane. What is w. ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane

Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ Any location in  m : not necessarily a datapoint Any location in R m : not necessarily a datapoint

Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+

Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ The line from x - to x + is perpendicular to the planes. So to get from x - to x + travel some distance in direction w.

Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It ’ s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width x-x- x+x+

Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It ’ s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width w. (x - + w) + b = 1 => w. x - + b + w.w = 1 => -1 + w.w = 1 => x-x- x+x+

Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = M = |x + - x - | =| w |= x-x- x+x+

Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w ’ s and b ’ s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton ’ s Method? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+

Linear Programming find w argmax c  w subject to w  a i  b i, for i = 1, …, m w j  0 for j = 1, …, n Don’t worry… it’s good for you… Don’t worry… it’s good for you… There are fast algorithms for solving linear programs including the simplex algorithm and Karmarkar’s algorithm

Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Quadratic Programming Find And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion Subject to

Quadratic Programming Find Subject to And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself)

Learning the Maximum Margin Classifier Given guess of w, b we can Compute whether all data points are in the correct half-planes Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? How many constraints will we have? What should they be? Minimize w.w R w. x k + b >= 1 if y k = 1 w. x k + b <= -1 if y k = -1

Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problem: Two things to minimize makes for an ill-defined optimization

Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter

Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) So… any other ideas?

Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 2.0: Minimize w.w + C (distance of error points to their correct place)

Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? How many constraints will we have? What should they be?

10/10/201538 Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible –We should maximize the margin, m –Distance between the origin and the line w t x=k is k/||w|| Class 1 Class 2 m

10/10/201539 Finding the Decision Boundary Let {x 1,..., x n } be our data set and let y i  {1,-1} be the class label of x i The decision boundary should classify all points correctly  The decision boundary can be found by solving the following constrained optimization problem This is a constrained optimization problem. Solving it requires some new tools –Feel free to ignore the following several slides; what is important is the constrained optimization problem above

10/10/201540 Back to the Original Problem The Lagrangian is –Note that ||w|| 2 = w T w Setting the gradient of w.r.t. w and b to zero, we have

The Karush-Kuhn-Tucker conditions, 10/10/201541

10/10/201542 The Dual Problem If we substitute to, we have Note that This is a function of  i only

43 The Dual Problem The new objective function is in terms of  i only It is known as the dual problem: if we know w, we know all  i ; if we know all  i, we know w The original problem is known as the primal problem The objective function of the dual problem needs to be maximized! The dual problem is therefore: Properties of  i when we introduce the Lagrange multipliers The result when we differentiate the original Lagrangian w.r.t. b

10/10/201544 The Dual Problem This is a quadratic programming (QP) problem –A global maximum of  i can always be found w can be recovered by

10/10/201545 Characteristics of the Solution Many of the  i are zero –w is a linear combination of a small number of data points –This “sparse” representation can be viewed as data compression as in the construction of knn classifier x i with non-zero  i are called support vectors (SV) –The decision boundary is determined only by the SV –Let t j (j=1,..., s) be the indices of the s support vectors. We can write For testing with a new data z –Compute and classify z as class 1 if the sum is positive, and class 2 otherwise –Note: w need not be formed explicitly

10/10/201546  6 =1.4 A Geometrical Interpretation Class 1 Class 2  1 =0.8  2 =0  3 =0  4 =0  5 =0  7 =0  8 =0.6  9 =0  10 =0

10/10/201547 Non-linearly Separable Problems We allow “error”  i in classification; it is based on the output of the discriminant function w T x+b  i approximates the number of misclassified samples Class 1 Class 2

Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 How many constraints will we have? R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1

Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 Our original (noiseless data) QP had m+1 variables: w 1, w 2, … w m, and b. Our new (noisy data) QP has m+1+R variables: w 1, w 2, … w m, b,  k,  1,…  R m = # input dimensions How many constraints will we have? R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1 R= # records

Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 How many constraints will we have? R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 There’s a bug in this QP. Can you spot it?

Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 How many constraints will we have? 2R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1  k >= 0 for all k

An Equivalent Dual QP Maximize where Subject to these constraints: Minimize w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1  k >= 0 ， for all k

An Equivalent Dual QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b)

Example XOR problem revisited: Let the nonlinear mapping be :  (x) = (1,x 1 2, 2 1/2 x 1 x 2, x 2 2, 2 1/2 x 1, 2 1/2 x 2 ) T And:  (x i )=(1,x i1 2, 2 1/2 x i1 x i2, x i2 2, 2 1/2 x i1, 2 1/2 x i2 ) T Therefore the feature space is in 6D with input data in 2D x 1 = (-1,-1), d 1 = - 1 x 2 = (-1, 1), d 2 = 1 x 3 = ( 1,-1), d 3 = 1 x 4 = (-1,-1), d 4 = -1

Q(a)=  a i – ½  a i a j d i d j  x i ) T  x j ) =a 1 +a 2 +a 3 +a 4 – ½(9 a 1 a 1 - 2a 1 a 2 -2 a 1 a 3 +2a 1 a 4 +9a 2 a 2 + 2a 2 a 3 -2a 2 a 4 +9a 3 a 3 -2a 3 a 4 +9 a 4 a 4 ) To minimize Q, we only need to calculate (due to optimality conditions) which gives 1 = 9 a 1 - a 2 - a 3 + a 4 1 = -a 1 + 9 a 2 + a 3 - a 4 1 = -a 1 + a 2 + 9 a 3 - a 4 1 = a 1 - a 2 - a 3 + 9 a 4

The solution of which gives the optimal values: a 0,1 =a 0,2 =a 0,3 =a 0,4 =1/8 w 0 =  a 0,i d i  x i ) = 1/8[  x 1 )-  x 2 )-  x 3 )+  x 4 )] Where the first element of w 0 gives the bias b

From earlier we have that the optimal hyperplane is defined by: w 0 T  x) = 0 That is: w 0 T  x) which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique

Output for polynomial RBF

Harder 1-dimensional dataset Remember how permitting nonlinear basis functions made linear regression so much nicer? Let’s permit them here too x=0

For a non-linearly separable problem we have to first map data onto feature space so that they are linear separable x i  x i ) Given the training data sample {(x i,y i ), i=1, …,N}, find the optimum values of the weight vector w and bias b w =  a 0,i y i  x i ) where a 0,i are the optimal Lagrange multipliers determined by maximizing the following objective function subject to the constraints  a i y i =0 ; a i >0

SVM building procedure: 1.Pick a nonlinear mapping  2.Solve for the optimal weight vector However: how do we pick the function  In practical applications, if it is not totally impossible to find  it is very hard In the previous example, the function  is quite complex: How would we find it? Answer: the Kernel Trick

Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimization can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel How do we relate the output of the SVM to the kernel K? Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations

In the XOR problem, we chose to use the kernel function: K(x, x i ) = (x T x i +1) 2 = 1+ x 1 2 x i1 2 + 2 x 1 x 2 x i1 x i2 + x 2 2 x i2 2 + 2x 1 x i1,+ 2x 2 x i2 Which implied the form of our nonlinear functions:  (x) = (1,x 1 2, 2 1/2 x 1 x 2, x 2 2, 2 1/2 x 1, 2 1/2 x 2 ) T And:  (x i )=(1,x i1 2, 2 1/2 x i1 x i2, x i2 2, 2 1/2 x i1, 2 1/2 x i2 ) T However, we did not need to calculate  at all and could simply have used the kernel to calculate: Q(a) =  a i – ½  a i a j d i d j  x i  x j  Maximized and solved for a i and derived the hyperplane via:

We therefore only need a suitable choice of kernel function cf: Mercer’s Theorem: Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form  (x,y) =  x) T  y) provided it is positive definite. Some of the usual choices for K are: Polynomial SVM (x T x i +1) p p specified by user RBF SVMexp(-1/(2   ) || x – x i || 2 )  specified by user MLP SVM tanh(s 0 x T x i + s 1 )

Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) Datapoints with  k > 0 will be the support vectors..so this sum only needs to be over the support vectors. An Equivalent Dual QP

Quadratic Basis Functions Constant Term Linear Terms Pure Quadratic Terms Quadratic Cross-Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = (as near as makes no difference) m 2 /2 You may be wondering what those ’s are doing. You should be happy that they do no harm You’ll find out why they’re there soon.

QP with basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) Maximize

QP with basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. Each dot product requires m 2 /2 additions and multiplications The whole thing costs R 2 m 2 /4. Yeeks! …or does it? Maximize

+ + + Quadratic Dot Products

Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products

Just out of casual, innocent, interest, let’s look at another function of a and b: They’re the same! And this is only O(m) to compute! Quadratic Dot Products

QP with Quadratic basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. Each dot product now only requires m additions and multiplications Maximize

Higher Order Polynomials Polynomia l (x)(x) Cost to build Q kl matrix traditionally Cost if 100 inputs  (a).  (b) Cost to build Q kl matrix efficiently Cost if 100 inputs QuadraticAll m 2 /2 terms up to degree 2 m 2 R 2 /42,500 R 2 (a.b+1) 2 m R 2 / 250 R 2 CubicAll m 3 /6 terms up to degree 3 m 3 R 2 /1283,000 R 2 (a.b+1) 3 m R 2 / 250 R 2 QuarticAll m 4 /24 terms up to degree 4 m 4 R 2 /481,960,000 R 2 (a.b+1) 4 m R 2 / 250 R 2

QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b)

QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they? The fear of overfitting with this enormous number of terms The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)

QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they? The fear of overfitting with this enormous number of terms The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) Because each w.  (x) (see below) needs 75 million operations. What can be done? The use of Maximum Margin magically makes this not a problem

QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they? The fear of overfitting with this enormous number of terms The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) Because each w.  (x) (see below) needs 75 million operations. What can be done? The use of Maximum Margin magically makes this not a problem Only Sm operations (S=#support vectors)

QP with Quintic basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) Why SVMs don’t overfit as much as you’d think: No matter what the basis function, there are really only up to R parameters:  1,  2..  R, and usually most are set to zero by the Maximum Margin. Asking for small w.w is like “weight decay” in Neural Nets and like Ridge Regression parameters in Linear regression and like the use of Priors in Bayesian Regression---all designed to smooth the function and reduce overfitting. Only Sm operations (S=#support vectors) Maximize

SVM Kernel Functions K(a,b)=(a. b +1) d is an example of an SVM Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function –Radial-Basis-style Kernel Function: –Neural-net-style Kernel Function: ,  and  are magic parameters that must be chosen by a model selection method such as CV or VCSRM

SVM Implementations Sequential Minimal Optimization, SMO, efficient implementation of SVMs, Platt –in Weka SVM light –http://svmlight.joachims.org/

References Tutorial on VC-dimension and Support Vector Machines: C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998

Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.

Similar presentations

Presentation on theme: "Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.

Similar presentations

Presentation on theme: "Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural."— Presentation transcript:

Similar presentations

About project

Feedback