Download presentation
Presentation is loading. Please wait.
1
2806 Neural Computation Support Vector Machines Lecture 6 2005 Ari Visa
2
Agenda n Some historical notes n Some theory n Support Vector Machines n Conclusions
3
Some Historical Notes Linear discriminant functions (Fischer 1936) -> one should know the underlying distributions Smith 1969: A multicategory classifier using two- category procedures. -> Linear machines have been applied to larger and larger data sets, linear programming (Block & Levin 1970) and stochastic approximation methods (Yau & Schumpert 1968) -> neural networks direction : Minsky & Papert: Perceptron 1969
4
Some Historical Notes n Boser, Guyon,Vapnik 1992 and Schölkopf, Burges, Vapnik 1995 gave the key ideas. n The Kuhn-Tucker construction (1951)
5
Some Theory n A multicategory classifier using two-category procedures n a) Reduce the problem to two-class problems n b) Use c(c-1)/2 linear discriminants, one for every pair of classes n a) and b) can lead to unclassified regions
6
Some Theory n Consider the training sample{(x i, d i )} N i=1 where x i is the input pattern for the ith example and d i is the corresponding desired response. The pattern presented by the subset d i = +1 and the pattern represented by the d i = -1 are linearly separable. n c) By defining a linear machine: g i (x) = w i t x +w i0 n and assigning x to i if g i (x) > g j (x) for all j i
7
Some Theory n a discriminant function g(x) = w T x +b w = weight vector b = bias or threshold We may write: w T x +b 0 for d i = +1 w T x +b< 0 for d i = -1 The margin of separation = the separation between the hyperplane and the closest data point.
8
Some Theory n The goal in training a Support Vector Machine is to find the separating hyperplane with the largest margin. n g(x) = w 0 t x +b 0 gives an algebraic measure of distance from x to the hyperplane (w 0 and b 0 denote the optimum value) n x =x p + r w 0 / w o n r = g(x)/ w o n Support vectors = data points that lie closest to the decision surface
9
Some Theory n Now the algebraic distance from the support vector x (s) to the optimal hyperplane is r = g(x)/ w o n This is = 1/ w o if d (s) = +1 or = - 1/ w o if d (s) = -1 n = 2r = 2 / w o n The optimal hyperplane is unique (= the maximum possible separation between positive and negative examples).
10
Some Theory n Finding the optimal Hyperplane n Problem: Given the training sample {(x i, d i )} N i=1, find the optimum values of weight vector w and bias b such that they satisfy the constraint d i (w T x i +b) 1 for i = 1,2,…N and the weight vector w minimizes the cost function: (w) = ½ w T w
11
Some Theory n The cost function (w) is a convex function of w. n The constraints are linear in w. n The constrained optimization problem may be solved by the method of Lagrange multipliers. n J(w,b, ) = ½ w T w – N i=1 i [d i (w T x i +b)-1] n The solution to the constrained optimization problem is determined by the saddle point of the J(w,b, ) (has to beminimized with respect to w and b; has to be maximized with respect to .
12
Some Theory n Kuhn-Tucker condition and solution of the dual problem. n Duality theorem: a) If the primal problem has an optimal solution, the dual problem has an optimal solution and the corresponding optimal values are equal. b)In order for w o to be an optimal primal solution and o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem, and (w o ) = J(w o,b o, o ) = min w J(w,b o, o ) n J(w,b, ) = ½ w T w – N i=1 i d i w T x i - b N i=1 i d i + N i=1 i
13
Some Theory n The dual problem: Given the training sample {(x i, d i )} N i=1, find the Lagrange multipliers { i } N i=1 that maximize the objective function J(w,b, ) = Q ( ) = N i=1 i - ½ N i=1 N j=1 i j d i d j x i T x j subject to the constraints 1) N i=1 i d i = 0 2) i 0 for i = 1,2,…N n w o = N i=1 o,i d i x i n b o = 1 – w o T x (s) for d (s) = 1
14
Some Theory n Optimal Hyperplane for Nonseparable Patterns n The margin of separation between classes is said to be soft if a data point (x i,d i ) violates the following condition: d i (w T x i +b) 1, i = 1…N n Slack variable { i } N i=1 = measures the deviation of a data point from the ideal condition of pattern separability : d i (w T x i +b) 1- , for i = 1,2,…N
15
Some Theory n Our goal is to find a separating hyperplane for which the missclassification error, averaged on the training set, is minimized. n We may minimize the functional ( ) = N i=1 I( i – 1) with respect to the weight vector w, subject to the constraint d i (w T x i +b) 1- , and the constraint w 2 . n Minimization of ( ) with respect to w is a nonconvex optimization problem (=NP-complete)
16
Some Theory n We approximate the functional ( ) by writing: (w, ) = ½ w T w – C N i=1 i n The first term is related to minimizimg the VC dimension and the second term is an upper bound on the number of test errors. n C is determined either experimentally or analytically by estimating the VC dimension.
17
Some Theory n Problem: Given the training sample {(x i,d i )} N i=1, find the optimum values of weight vector w and bias b such that they satisfy the constraint d i (w T x i +b) 1- i for i = 1,2,…N, i 0 for all i and such the weight vector w and the slack variables i minimize the cost function: (w, ) = ½ w T w – C N i=1 i where C is a user-specified positive parameter
18
Some Theory n The dual problem for nonseparable patterns: Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers { i } N i=1 that maximize the objective function Q ( ) = N i=1 i - ½ N i=1 N j=1 i j d i d j x i T x j subject to the constraints 1) N i=1 i d i = 0 2) 0 i C for i = 1,2,…,N where C is a user-specified positive parameter n The optimum solution: w o = Ns i=1 o,i d i x i where N s is the number of support vectors. n i [d i (w T x i +b)-1+ i ] = 0 i =1,2,...,N Take the mean value of b o from all such data points (x i,d i ) in the training set that 0 < o,i < C.
19
Support Vector Machines n The goal of a support vector machine is to find the particular hyper plane for which the margin of separation is maximized. n The support vectors consist of a small subset of the training data extracted by the algorithm. Depending on how inner-product kernel is generated we may construct different learning machines characterized by nonlinear decision surfaces of their own. n Polynomial learning machines n Radial-basis function networks n Two-layer perceptrons
20
Support Vector Machines n The idea: n 1. Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both the input and output n 2. Construction of an optimal hyperplane for separating the features discovered in step 1.
21
Support Vector Machines n Let x denote a vector drawn from the input space (dimension m 0 ). Let { i (x)} m1 i=1 denote a set of nonlinear transformatons from the input space to the feature space (dimension m 1 ). i (x) is defined a priori for all j. We may define a hyperplane: m1 j=1 w j j (x) + b = 0 m1 j=0 w j j (x) where it is assumed that 0 (x) = 1 for all x so that w o denotes b. n The decision surface: w T (x) = 0 n w = N i=1 ,i d i (x i ) N i=1 ,i d i T (x i ) (x) =0 n K(x,x i ) = T (x) (x i ) = m1 j=0 j (x) j (x i ) for i=1,2,...N n The optimal hyperplane: N i=1 ,i d i K(x,x i ) n Mercer’s Theorem tells us whether or not a candidate kernel is actually an inner-product kernel in some space.
22
Support Vector Machines n The expansion of the inner-product kernel K(x,x i ) permits us to construct a decision surface that is nonlinear in the input space, but its image in the feature space is linear. n Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers { i } N i=1 that maximize the objective function Q ( ) = N i=1 i - ½ N i=1 N j=1 i j d i d j K(x i, x j ) subject to the constraints 1) N i=1 i d i = 0 2) 0 i C for i = 1,2,…,N where C is a user-specified positive parameter. n K = {K(x i,x j )} N i,j=1 w o = Ns i=1 o,i d i (x i ) where (x i ) is the image induced in the feature space due to x i. The first component of w o represents the optimum bias b 0.
23
Support Vector Machines n The requirement on the kernel K(x,x i ) is to satisfy Mercer’s theorem. n The inner-product kernels for polynomial and radial-basis function types always satisfy Mercer’s theorem. n The dimensionality of the feature space is determined by the number of support vectors extracted from the training data by the solution to the constrained optimization problem. n The underlying theory of a SVM avoids the need for heuristics often used in the design of conventional RBF networks and MLPs.
24
Support Vector Machines n In the RBF type of a SVM, the number of radial-basis functions and their centers are determined automatically by the number of support vectors and their values, respectively n In the two-layer perceptron type of a support vector machine, the number of hidden neurons and their weight vectors are determined automatically by the number of support vectors and their values, respectively n Conceptual problem: Dimensionality of the feature space is made very large. n Computational problem: The curse of dimesionality is avoided by using the notation of an inner-product kernel and solving the dual form of the constrained optimization problem formulated in the input space.
25
Support Vector Machines n The XOR problem: (x 1 OR x 2 ) AND NOT (x 1 AND x 2 )
26
Support Vector Machines for Nonlinear Regression n Consider a nonlinear regressive model in which the dependence of a scalar d on a vector x is described by d =f(x) + v n A set of training data {(x i,d i )} N i=1, where x i is a sample value of the input vector x and d i is the corresponding value of the model output d. The problem is to provide an estimate of the dependence of d on x. n y = m1 j=0 w j j (x) = w T (x) n Minimize the empirical risk R emp = 1/N N i=1 L (d i, y i ) subject to the inequality w 2 c o
27
Support Vector Machines for Nonlinear Regression n Introduce two sets of nonnegative slack variables { i } N i=1 and { ’ i } N i=1 d i - w T (x i ) + i i =1,2,...,N w T (x i ) - d i + ’ i i =1,2,...,N i 0, ’ i 0 i =1,2,...,N (w, , ’) = ½ w T w + C N i=1 ( i + ’ i ) J(w, , ’, , ’, , ’) = C N i=1 ( i + ’ i ) + ½ w T w - N i=1 i [w T (x i ) - d i + + i ] - N i=1 ’ i [d i - w T (x i ) + + ’ i ] - N i=1 ( i i + ’ i ’ i ) w = N i=1 ( ,i - , ’ i ) (x i ) i = C - ,i and ’ i = C - ’,i K(x i,x j ) = T (x i ) (x j )
28
Support Vector Machines for Nonlinear Regression n Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers { i } N i=1 and { ’ i } N i=1 that maximize the objective function n Q ( i, ’ i ) = N i=1 d i ( i - ’ i ) - N i=1 ( i + ’ i ) - ½ N i=1 N j=1 ( i - ’ i )( j - ’ j ) K(x i x j ) n subject to the constraints : 1) N i=1 ( i - ’ i ) = 0 2) 0 i C for i = 1,2,…,N 0 ’ i C for i = 1,2,…,N where C is a user-specified constant. n The parameters and C are free parameters and selected by the user. They must be tuned simultaneously. n Regression is intrinsically more difficult than pattern classification.
29
Summary n The SVM is an elegant and highly principled learning method for the design of a feedforward network with a single layer of nonlinear units. n The SVM includes the polynomial learning machine, radial-basis function network, and two-layer perceptron as special cases. n SVM provides a method for controlling model complexity independently of dimensinality. n The SVM learning algorithm operates only in a batch mode. n By using a suitable inner-product kernel, the SVM automaticly computes all the important parameters pertaining to that choice of kernel. n In terms of running time, SVMs are currently slower than other neural networks.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.