Support Vector Machine: An Introduction. (C) 2001-2005 by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Support Vector Machines
Support vector machine
Separating Hyperplanes
Intro. ANN & Fuzzy Systems Lecture 8. Learning (V): Perceptron Learning.
Support Vector Machines
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machine (SVM) Classification
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines and Kernel Methods
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
An Introduction to Support Vector Machine (SVM)
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
Support Vector Machine
Support Vector Machines
Lecture 19. SVM (III): Kernel Formulation
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines Most of the slides were taken from:
CSSE463: Image Recognition Day 14
Support vector machines
Lecture 18. SVM (II): Non-separable Cases
Support vector machines
Support vector machines
Presentation transcript:

Support Vector Machine: An Introduction

(C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For x in the side of  : w T x + b  0; d =  1. Distance from x to H: r = w T x/|w|  (  b/|w|) = g(x) /|w| Given: {(x i, d i ); i = 1 to N, d i  {+1,  1}}. A linear hyper-plane classifier is a hyper-plane consisting of points x such that H = {x| g(x) = w T x + b = 0} g(x): a discriminant function!. x1x1 x2x2 w  b/|w| x r H

(C) by Yu Hen Hu 3 Distance from a Point to a Hyper-plane Hence r = (w T x * +b)/|w| = g(x*)/|w| If x* is on the other side of H (same side as the origin), then r =  (w T x * +b)/|w| =  g(x*)/|w| The hyper-plane H is characterized by w T x + b = 0 (*) w: normal vector perpendicular to H. (*) says any vector x on H that project to w will have a length of OA =  b/|w|. Consider a special point C corresponding to vector x *. Its magnitude of projection onto vector w is w T x * /|w| = OA + BC. Or equivalently, w T x * /|w| =  b/|w| + r r w X* H A B C

(C) by Yu Hen Hu 4 Optimal Hyper-plane: Linearly Separable Case For d i = +1, g(x i ) = w T x i + b   |w|  w o T x i + b o  1 For d i =  1, g(x i ) = w T x i + b   |w|  w o T x i + b o   1  Optimal hyper-plane should be in the center of the gap.  Support Vectors  Samples on the boundaries. Support vectors alone can determine optimal hyper-plane.  Question: How to find optimal hyper-plane? x1x1 x2x2  

(C) by Yu Hen Hu 5 Separation Gap For x i being a support vector, For d i = +1, g(x i ) = w T x i + b =  |w|  w o T x i + b o = 1 For d i =  1, g(x i ) = w T x i + b =  |w|  w o T x i + b o =  1 Hence w o = w/(  |w|), b o = b/(  |w|). But distance from x i to hyper-plane is  = g(x i )/|w|. Thus w o = w/g(x i ), and  = 1/|w o |. The maximum distance between the two classes is 2  = 2/|w o |. The objective is to find w o, b o to minimize |w o | (so that  is maximized) subject to the constraints that w o T x i + b o  1 for d i = +1; and w o T x i + b o  1 for d i =  1. Combine these constraints, one has: d i  (w o T x i + b o )  1

(C) by Yu Hen Hu 6 Quadratic Optimization Problem Formulation Given {(x i, d i ); i = 1 to N}, find w and b such that  (w) = w T w/2 is minimized subject to N constraints d i  (w T x i + b)  0; 1  i  N. Method of Lagrange Multiplier

(C) by Yu Hen Hu 7 Optimization (continued) The solution of Lagrange multiplier problem is at a saddle point where the minimum is sought w.r.t. w and b, while the maximum is sought w.r.t.  i. Kuhn-Tucker Condition: at the saddle point,  i [d i (w T x i + b)  1] = 0 for 1  i  N. If x i is NOT a support vector, the corresponding  i = 0! Hence, only support vector will affect the result of optimization!

(C) by Yu Hen Hu 8 A Numerical Example 3 inequalities: 1  w + b   1; 2  w + b  +1; 3  w + b  + 1 J = w 2 /2   1 (  w  b  1)   2 (2w+b  1)   3 (3w+b  1)  J/  w = 0  w =     3  J/  b = 0  0 =  1   2   3 Kuhn-Tucker condition implies: (a)  1 (  w  b  1) = 0 (b)  2 (2w+b  1) = 0 (c);  3 (3w + b  1) = 0 Later, we will see the solution is  1 =  2 = 2 and  3 = 0. This yields w = 2, b =  3. Hence the solution of decision boundary is: 2x  3 = 0. or x = 1.5! This is shown as the dash line in above figure. (1,  1) (2,1) (3,1)

(C) by Yu Hen Hu 9 Primal/Dual Problem Formulation Given a constrained optimization problem with a convex cost function and linear constraints; a dual problem with the Lagrange multipliers providing the solution can be formulated. Duality Theorem (Bertsekas 1995) (a)If the primal problem has an optimal solution, then the dual problem has an optimal solution with the same optimal values. (b) In order for w o to be an optimal solution and  o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem and  (w o ) = J(w o,b o,  o ) = Min w J(w,b o,  o )

(C) by Yu Hen Hu 10 Formulating the Dual Problem At the saddle point, we have and, substituting these relations into above, then we have the Dual Problem Maximize Subject to: and  i  0 for i = 1, 2, …, N. Note

(C) by Yu Hen Hu 11 Numerical Example (cont’d) or Q(  ) =  1 +  2 +  3  [0.5    3 2  2  1  2  3  1   2  3 ] subject to constraints:  1 +  2 +  3 = 0, and  1  0,  2  0, and  3  0. Use Matlab  Optimization tool box command: x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq) The solution is [  1  2  3 ] = [2 2 0] as expected.

(C) by Yu Hen Hu 12 Implication of Minimizing ||w|| Let D denote the diameter of the smallest hyper-ball that encloses all the input training vectors {x 1, x 2, …, x N }. The set of optimal hyper-planes described by the equation W o T x + b o = 0 has a VC-dimension h bounded from above as h  min {  D 2 /  2 , m 0 } + 1 where m 0 is the dimension of the input vectors, and  = 2/||w o || is the margin of the separation of the hyper- planes. VC-dimension determines the complexity of the classifier structure, and usually the smaller the better.

(C) by Yu Hen Hu 13 Non-separable Cases Recall that in linearly separable case, each training sample pair (x i, d i ) represents a linear inequality constraint d i (w T x i + b)  1, i = 1, 2, …, N (*) If the training samples are not linearly separable, the constraint can be modified to yield a soft constraint: d i (w T x i + b)  1  i, i = 1, 2, …, N (**) {  i ; 1  i  N} are known as slack variables. Note that originally, (*) is a normalized version of d i g(x i )/|w|  . With the slack variable  I, that eq. becomes d i g(x i )/|w|   (1  i ). Hence with the slack variable, we allow some samples x i fall within the gap. Moreover, if  i > 1, then the corresponding (x i, d i ) is mis-classified because the sample will fall on the wrong side of the hyper-plane H.

(C) by Yu Hen Hu 14 Non-Separable Case Since  i > 1 implies mis- classification, the cost function must include a term to minimize the number of samples that are mis- classified: where is a Lagrange multiplier. But this formulation is non-convex and a solution is difficult to find using existing nonlinear optimization algorithms. Hence, we may instead use an approximated cost function With this approximated cost function, the goal is to maximize  (minimize ||W||) while minimize  i (  0 ).  i : not counted if x i outside gap and on the correct side. 0 <  i < 1: x i inside gap, but on the correct side.  i > 1: x i on the wrong side (inside or outside gap). 01 

(C) by Yu Hen Hu 15 Primal Problem Formulation Primal Optimization Problem Given {(x i, d i );1  i  N}. Find w, b such that is minimized subject to the constraints (i)  i  0, and (ii)d i (w T x i + b)  1  i for i = 1, 2, …, N. Using  i and  i as Lagrange multipliers, the unconstrained cost function becomes

(C) by Yu Hen Hu 16 Dual Problem Formulation Note that Dual Optimization Problem Given {(x i,  i );1  i  N}. Find Lagrange multipliers {  i ; 1  i  N} such that is maximized subject to the constraints (i) 0   i  C (a user-specified positive number) and (ii)

(C) by Yu Hen Hu 17 Solution to the Dual Problem By the Karush-Kuhn-Tucker condition: for i = 1, 2, …, N, (i)  i [d i (w T x i + b)  1 +  i ] = 0(*) (ii)  i  i = 0 At optimal point  i +  i = C. Thus, one may deduce that if 0 <  i < C, then  i = 0 and d i (w T x i +b) = 1 if  i = C, then  i  0 and d i (w T x i +b) = 1-  i  1 if  i = 0, then d i (w T x i +b)  1: x i is not a support vector Finally, the optimal solutions are: where I o = {i; 0 <  i < C}

(C) by Yu Hen Hu 18 Inner Product Kernels In general, if the input is first transformed via a set of nonlinear functions {  i (x)} and then subject to the hyperplane classifier Define the inner product kernel as one may obtain a dual optimization problem formulation as: Often, dim of  (=p+1) >> dim of x!

(C) by Yu Hen Hu 19 Polynomial Kernel Consider a polynomial kernel Let K(x,y) =  T (x)  (y), then  (x) = [1 x 1 2, , x m 2,  2 x 1, ,  2x m,  2 x 1 x 2, ,  2 x 1 x m,  2 x 2 x 3, ,  2 x 2 x m, ,  2 x m  1 x m ] = [1  1 (x), ,  p (x)] where p = 1 +m + m + (m  1) + (m  2) +  + 1 = (m+2)(m+1)/2 Hence, using a kernel, a low dimensional pattern classification problem (with dimension m) is solved in a higher dimensional space (dimension p+1). But only  j (x) corresponding to support vectors are used for pattern classification!

(C) by Yu Hen Hu 20 Numerical Example: XOR Problem Training samples: (  1  1;  1), (  ), (1  1 +1), (1 1  1) x = [x 1, x 2 ] T. Use K(x,y) = (1 + x T y) 2 one has  (x) = [1 x 1 2 x 2 2  2 x 1,  2 x 2,  2 x 1 x 2 ] T Note dim[  (x)] = 6 > dim[x] = 2! Dim(K) = N s = # of support vectors.

(C) by Yu Hen Hu 21 XOR Problem (Continued) Note that K(x i, x j ) can be calculated directly without using  ! The corresponding Lagrange multiplier  = (1/8)[ ] T. Hence the hyper-plane is: y = w T  (x) =  x 1 x 2 (x 1, x 2 ) (  1,  1)(  1, +1)(+1,  1) (+1,+1) y =  1 x 1 x 2 11 +1 11

(C) by Yu Hen Hu 22 Other Types of Kernels type of SVMK(x,y)Comments Polynomial learning machine (x T y + 1) p p: selected a priori Radial basis function  2 : selected a priori Two-layer perceptron tanh(  o x T y +  1 )only some  o and  1 values are feasible. What kernel is feasible? It must satisfy the "Mercer's theorem"!

(C) by Yu Hen Hu 23 Mercer's Theorem Let K(x,y) be a continuous, symmetric kernel, defined on a  x,y  b. K(x,y) admits an eigen-function expansion with i > 0 for each i. This expansion converges absolutely and uniformly if and only if for all  (x) such that

(C) by Yu Hen Hu 24 Testing with Kernels For many types of kernels,  (x) can not be explicitly represented or even found. However, Hence there is no need to know  (x) explicitly! For example, in the XOR problem, f = (1/8)[   1] T. Suppose that x = (  1, +1), then

(C) by Yu Hen Hu 25 SVM Using Nonlinear Kernels Using kernel, low dimensional feature vectors will be mapped to high dimensional (may be infinite dim) kernel feature space where the data are likely to be linearly separable. 00 00 PP PP K(x,x j ) x1x1 xNxN Nonlinear transformKernel evaluation + W + f 00 00 PP PP x1x1 xNxN Nonlinear transform