Support Vector Machine: An Introduction
(C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b 0; d = +1; For x in the side of : w T x + b 0; d = 1. Distance from x to H: r = w T x/|w| ( b/|w|) = g(x) /|w| Given: {(x i, d i ); i = 1 to N, d i {+1, 1}}. A linear hyper-plane classifier is a hyper-plane consisting of points x such that H = {x| g(x) = w T x + b = 0} g(x): a discriminant function!. x1x1 x2x2 w b/|w| x r H
(C) by Yu Hen Hu 3 Distance from a Point to a Hyper-plane Hence r = (w T x * +b)/|w| = g(x*)/|w| If x* is on the other side of H (same side as the origin), then r = (w T x * +b)/|w| = g(x*)/|w| The hyper-plane H is characterized by w T x + b = 0 (*) w: normal vector perpendicular to H. (*) says any vector x on H that project to w will have a length of OA = b/|w|. Consider a special point C corresponding to vector x *. Its magnitude of projection onto vector w is w T x * /|w| = OA + BC. Or equivalently, w T x * /|w| = b/|w| + r r w X* H A B C
(C) by Yu Hen Hu 4 Optimal Hyper-plane: Linearly Separable Case For d i = +1, g(x i ) = w T x i + b |w| w o T x i + b o 1 For d i = 1, g(x i ) = w T x i + b |w| w o T x i + b o 1 Optimal hyper-plane should be in the center of the gap. Support Vectors Samples on the boundaries. Support vectors alone can determine optimal hyper-plane. Question: How to find optimal hyper-plane? x1x1 x2x2
(C) by Yu Hen Hu 5 Separation Gap For x i being a support vector, For d i = +1, g(x i ) = w T x i + b = |w| w o T x i + b o = 1 For d i = 1, g(x i ) = w T x i + b = |w| w o T x i + b o = 1 Hence w o = w/( |w|), b o = b/( |w|). But distance from x i to hyper-plane is = g(x i )/|w|. Thus w o = w/g(x i ), and = 1/|w o |. The maximum distance between the two classes is 2 = 2/|w o |. The objective is to find w o, b o to minimize |w o | (so that is maximized) subject to the constraints that w o T x i + b o 1 for d i = +1; and w o T x i + b o 1 for d i = 1. Combine these constraints, one has: d i (w o T x i + b o ) 1
(C) by Yu Hen Hu 6 Quadratic Optimization Problem Formulation Given {(x i, d i ); i = 1 to N}, find w and b such that (w) = w T w/2 is minimized subject to N constraints d i (w T x i + b) 0; 1 i N. Method of Lagrange Multiplier
(C) by Yu Hen Hu 7 Optimization (continued) The solution of Lagrange multiplier problem is at a saddle point where the minimum is sought w.r.t. w and b, while the maximum is sought w.r.t. i. Kuhn-Tucker Condition: at the saddle point, i [d i (w T x i + b) 1] = 0 for 1 i N. If x i is NOT a support vector, the corresponding i = 0! Hence, only support vector will affect the result of optimization!
(C) by Yu Hen Hu 8 A Numerical Example 3 inequalities: 1 w + b 1; 2 w + b +1; 3 w + b + 1 J = w 2 /2 1 ( w b 1) 2 (2w+b 1) 3 (3w+b 1) J/ w = 0 w = 3 J/ b = 0 0 = 1 2 3 Kuhn-Tucker condition implies: (a) 1 ( w b 1) = 0 (b) 2 (2w+b 1) = 0 (c); 3 (3w + b 1) = 0 Later, we will see the solution is 1 = 2 = 2 and 3 = 0. This yields w = 2, b = 3. Hence the solution of decision boundary is: 2x 3 = 0. or x = 1.5! This is shown as the dash line in above figure. (1, 1) (2,1) (3,1)
(C) by Yu Hen Hu 9 Primal/Dual Problem Formulation Given a constrained optimization problem with a convex cost function and linear constraints; a dual problem with the Lagrange multipliers providing the solution can be formulated. Duality Theorem (Bertsekas 1995) (a)If the primal problem has an optimal solution, then the dual problem has an optimal solution with the same optimal values. (b) In order for w o to be an optimal solution and o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem and (w o ) = J(w o,b o, o ) = Min w J(w,b o, o )
(C) by Yu Hen Hu 10 Formulating the Dual Problem At the saddle point, we have and, substituting these relations into above, then we have the Dual Problem Maximize Subject to: and i 0 for i = 1, 2, …, N. Note
(C) by Yu Hen Hu 11 Numerical Example (cont’d) or Q( ) = 1 + 2 + 3 [0.5 3 2 2 1 2 3 1 2 3 ] subject to constraints: 1 + 2 + 3 = 0, and 1 0, 2 0, and 3 0. Use Matlab Optimization tool box command: x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq) The solution is [ 1 2 3 ] = [2 2 0] as expected.
(C) by Yu Hen Hu 12 Implication of Minimizing ||w|| Let D denote the diameter of the smallest hyper-ball that encloses all the input training vectors {x 1, x 2, …, x N }. The set of optimal hyper-planes described by the equation W o T x + b o = 0 has a VC-dimension h bounded from above as h min { D 2 / 2 , m 0 } + 1 where m 0 is the dimension of the input vectors, and = 2/||w o || is the margin of the separation of the hyper- planes. VC-dimension determines the complexity of the classifier structure, and usually the smaller the better.
(C) by Yu Hen Hu 13 Non-separable Cases Recall that in linearly separable case, each training sample pair (x i, d i ) represents a linear inequality constraint d i (w T x i + b) 1, i = 1, 2, …, N (*) If the training samples are not linearly separable, the constraint can be modified to yield a soft constraint: d i (w T x i + b) 1 i, i = 1, 2, …, N (**) { i ; 1 i N} are known as slack variables. Note that originally, (*) is a normalized version of d i g(x i )/|w| . With the slack variable I, that eq. becomes d i g(x i )/|w| (1 i ). Hence with the slack variable, we allow some samples x i fall within the gap. Moreover, if i > 1, then the corresponding (x i, d i ) is mis-classified because the sample will fall on the wrong side of the hyper-plane H.
(C) by Yu Hen Hu 14 Non-Separable Case Since i > 1 implies mis- classification, the cost function must include a term to minimize the number of samples that are mis- classified: where is a Lagrange multiplier. But this formulation is non-convex and a solution is difficult to find using existing nonlinear optimization algorithms. Hence, we may instead use an approximated cost function With this approximated cost function, the goal is to maximize (minimize ||W||) while minimize i ( 0 ). i : not counted if x i outside gap and on the correct side. 0 < i < 1: x i inside gap, but on the correct side. i > 1: x i on the wrong side (inside or outside gap). 01
(C) by Yu Hen Hu 15 Primal Problem Formulation Primal Optimization Problem Given {(x i, d i );1 i N}. Find w, b such that is minimized subject to the constraints (i) i 0, and (ii)d i (w T x i + b) 1 i for i = 1, 2, …, N. Using i and i as Lagrange multipliers, the unconstrained cost function becomes
(C) by Yu Hen Hu 16 Dual Problem Formulation Note that Dual Optimization Problem Given {(x i, i );1 i N}. Find Lagrange multipliers { i ; 1 i N} such that is maximized subject to the constraints (i) 0 i C (a user-specified positive number) and (ii)
(C) by Yu Hen Hu 17 Solution to the Dual Problem By the Karush-Kuhn-Tucker condition: for i = 1, 2, …, N, (i) i [d i (w T x i + b) 1 + i ] = 0(*) (ii) i i = 0 At optimal point i + i = C. Thus, one may deduce that if 0 < i < C, then i = 0 and d i (w T x i +b) = 1 if i = C, then i 0 and d i (w T x i +b) = 1- i 1 if i = 0, then d i (w T x i +b) 1: x i is not a support vector Finally, the optimal solutions are: where I o = {i; 0 < i < C}
(C) by Yu Hen Hu 18 Inner Product Kernels In general, if the input is first transformed via a set of nonlinear functions { i (x)} and then subject to the hyperplane classifier Define the inner product kernel as one may obtain a dual optimization problem formulation as: Often, dim of (=p+1) >> dim of x!
(C) by Yu Hen Hu 19 Polynomial Kernel Consider a polynomial kernel Let K(x,y) = T (x) (y), then (x) = [1 x 1 2, , x m 2, 2 x 1, , 2x m, 2 x 1 x 2, , 2 x 1 x m, 2 x 2 x 3, , 2 x 2 x m, , 2 x m 1 x m ] = [1 1 (x), , p (x)] where p = 1 +m + m + (m 1) + (m 2) + + 1 = (m+2)(m+1)/2 Hence, using a kernel, a low dimensional pattern classification problem (with dimension m) is solved in a higher dimensional space (dimension p+1). But only j (x) corresponding to support vectors are used for pattern classification!
(C) by Yu Hen Hu 20 Numerical Example: XOR Problem Training samples: ( 1 1; 1), ( ), (1 1 +1), (1 1 1) x = [x 1, x 2 ] T. Use K(x,y) = (1 + x T y) 2 one has (x) = [1 x 1 2 x 2 2 2 x 1, 2 x 2, 2 x 1 x 2 ] T Note dim[ (x)] = 6 > dim[x] = 2! Dim(K) = N s = # of support vectors.
(C) by Yu Hen Hu 21 XOR Problem (Continued) Note that K(x i, x j ) can be calculated directly without using ! The corresponding Lagrange multiplier = (1/8)[ ] T. Hence the hyper-plane is: y = w T (x) = x 1 x 2 (x 1, x 2 ) ( 1, 1)( 1, +1)(+1, 1) (+1,+1) y = 1 x 1 x 2 11 +1 11
(C) by Yu Hen Hu 22 Other Types of Kernels type of SVMK(x,y)Comments Polynomial learning machine (x T y + 1) p p: selected a priori Radial basis function 2 : selected a priori Two-layer perceptron tanh( o x T y + 1 )only some o and 1 values are feasible. What kernel is feasible? It must satisfy the "Mercer's theorem"!
(C) by Yu Hen Hu 23 Mercer's Theorem Let K(x,y) be a continuous, symmetric kernel, defined on a x,y b. K(x,y) admits an eigen-function expansion with i > 0 for each i. This expansion converges absolutely and uniformly if and only if for all (x) such that
(C) by Yu Hen Hu 24 Testing with Kernels For many types of kernels, (x) can not be explicitly represented or even found. However, Hence there is no need to know (x) explicitly! For example, in the XOR problem, f = (1/8)[ 1] T. Suppose that x = ( 1, +1), then
(C) by Yu Hen Hu 25 SVM Using Nonlinear Kernels Using kernel, low dimensional feature vectors will be mapped to high dimensional (may be infinite dim) kernel feature space where the data are likely to be linearly separable. 00 00 PP PP K(x,x j ) x1x1 xNxN Nonlinear transformKernel evaluation + W + f 00 00 PP PP x1x1 xNxN Nonlinear transform