2806 Neural Computation Support Vector Machines Lecture 6 2005 Ari Visa.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
7. Support Vector Machines (SVMs)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support Vector Machines
Methods of Pattern Recognition chapter 5 of: Statistical learning methods by Vapnik Zahra Zojaji.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines and Kernel Methods
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Theory Simulations Applications Theory Simulations Applications.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Classification and Regression
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
SVM – Support Vector Machines Presented By: Bella Specktor.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
Omer Boehm A tutorial about SVM Omer Boehm
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines Most of the slides were taken from:
CSSE463: Image Recognition Day 14
Presentation transcript:

2806 Neural Computation Support Vector Machines Lecture Ari Visa

Agenda n Some historical notes n Some theory n Support Vector Machines n Conclusions

Some Historical Notes Linear discriminant functions (Fischer 1936) -> one should know the underlying distributions Smith 1969: A multicategory classifier using two- category procedures. -> Linear machines have been applied to larger and larger data sets, linear programming (Block & Levin 1970) and stochastic approximation methods (Yau & Schumpert 1968) -> neural networks direction : Minsky & Papert: Perceptron 1969

Some Historical Notes n Boser, Guyon,Vapnik 1992 and Schölkopf, Burges, Vapnik 1995 gave the key ideas. n The Kuhn-Tucker construction (1951)

Some Theory n A multicategory classifier using two-category procedures n a) Reduce the problem to two-class problems n b) Use c(c-1)/2 linear discriminants, one for every pair of classes n a) and b) can lead to unclassified regions

Some Theory n Consider the training sample{(x i, d i )} N i=1 where x i is the input pattern for the ith example and d i is the corresponding desired response. The pattern presented by the subset d i = +1 and the pattern represented by the d i = -1 are linearly separable. n c) By defining a linear machine: g i (x) = w i t x +w i0 n and assigning x to  i if g i (x) > g j (x) for all j  i

Some Theory n a discriminant function g(x) = w T x +b w = weight vector b = bias or threshold We may write: w T x +b  0 for d i = +1 w T x +b< 0 for d i = -1 The margin of separation = the separation between the hyperplane and the closest data point.

Some Theory n The goal in training a Support Vector Machine is to find the separating hyperplane with the largest margin. n g(x) = w 0 t x +b 0 gives an algebraic measure of distance from x to the hyperplane (w 0 and b 0 denote the optimum value) n x =x p + r w 0 /  w o  n r = g(x)/   w o   n Support vectors = data points that lie closest to the decision surface

Some Theory n Now the algebraic distance from the support vector x (s) to the optimal hyperplane is r = g(x)/  w o  n This is = 1/  w o  if d (s) = +1 or = - 1/  w o  if d (s) = -1 n  = 2r   = 2 /  w o  n The optimal hyperplane is unique (= the maximum possible separation between positive and negative examples).

Some Theory n Finding the optimal Hyperplane n Problem: Given the training sample {(x i, d i )} N i=1, find the optimum values of weight vector w and bias b such that they satisfy the constraint d i (w T x i +b)  1 for i = 1,2,…N and the weight vector w minimizes the cost function:  (w) = ½ w T w

Some Theory n The cost function  (w) is a convex function of w. n The constraints are linear in w. n The constrained optimization problem may be solved by the method of Lagrange multipliers. n J(w,b,  ) = ½ w T w –  N i=1  i [d i (w T x i +b)-1] n The solution to the constrained optimization problem is determined by the saddle point of the J(w,b,  ) (has to beminimized with respect to w and b; has to be maximized with respect to . 

Some Theory n Kuhn-Tucker condition and solution of the dual problem. n Duality theorem: a) If the primal problem has an optimal solution, the dual problem has an optimal solution and the corresponding optimal values are equal. b)In order for w o to be an optimal primal solution and  o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem, and  (w o ) = J(w o,b o,  o ) = min w J(w,b o,  o )  n J(w,b,  ) = ½ w T w –  N i=1  i d i w T x i - b  N i=1  i d i +  N i=1  i

Some Theory n The dual problem: Given the training sample {(x i, d i )} N i=1, find the Lagrange multipliers {  i } N i=1 that maximize the objective function J(w,b,  ) = Q (  ) =  N i=1  i - ½  N i=1  N j=1  i  j d i d j x i T x j subject to the constraints 1)  N i=1  i d i = 0 2)  i  0 for i = 1,2,…N n w o =  N i=1  o,i d i x i n b o = 1 – w o T x (s) for d (s) = 1

Some Theory n Optimal Hyperplane for Nonseparable Patterns n The margin of separation between classes is said to be soft if a data point (x i,d i ) violates the following condition: d i (w T x i +b)  1, i = 1…N n Slack variable {  i } N i=1 = measures the deviation of a data point from the ideal condition of pattern separability : d i (w T x i +b)  1- , for i = 1,2,…N

Some Theory n Our goal is to find a separating hyperplane for which the missclassification error, averaged on the training set, is minimized. n We may minimize the functional  (  ) =  N i=1 I(  i – 1) with respect to the weight vector w, subject to the constraint d i (w T x i +b)  1- , and the constraint  w 2 .  n Minimization of  (  ) with respect to w is a nonconvex optimization problem (=NP-complete)

Some Theory n We approximate the functional  (  ) by writing:  (w,  ) = ½ w T w – C  N i=1  i n The first term is related to minimizimg the VC dimension and the second term is an upper bound on the number of test errors. n C is determined either experimentally or analytically by estimating the VC dimension.

Some Theory n Problem: Given the training sample {(x i,d i )} N i=1, find the optimum values of weight vector w and bias b such that they satisfy the constraint d i (w T x i +b)  1-  i for i = 1,2,…N,  i  0 for all i and such the weight vector w and the slack variables  i minimize the cost function:  (w,  ) = ½ w T w – C  N i=1  i where C is a user-specified positive parameter 

Some Theory n The dual problem for nonseparable patterns: Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers {  i } N i=1 that maximize the objective function Q (  ) =  N i=1  i - ½  N i=1  N j=1  i  j d i d j x i T x j subject to the constraints 1)  N i=1  i d i = 0 2) 0  i  C for i = 1,2,…,N where C is a user-specified positive parameter n The optimum solution: w o =  Ns i=1  o,i d i x i where N s is the number of support vectors. n  i [d i (w T x i +b)-1+  i ] = 0 i =1,2,...,N Take the mean value of b o from all such data points (x i,d i ) in the training set that 0 <  o,i < C.

Support Vector Machines n The goal of a support vector machine is to find the particular hyper plane for which the margin of separation  is maximized. n The support vectors consist of a small subset of the training data extracted by the algorithm. Depending on how inner-product kernel is generated we may construct different learning machines characterized by nonlinear decision surfaces of their own. n Polynomial learning machines n Radial-basis function networks n Two-layer perceptrons

Support Vector Machines n The idea: n 1. Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both the input and output n 2. Construction of an optimal hyperplane for separating the features discovered in step 1.

Support Vector Machines n Let x denote a vector drawn from the input space (dimension m 0 ). Let {  i (x)} m1 i=1 denote a set of nonlinear transformatons from the input space to the feature space (dimension m 1 ).  i (x) is defined a priori for all j. We may define a hyperplane:  m1 j=1 w j  j (x) + b = 0   m1 j=0 w j  j (x) where it is assumed that  0 (x) = 1 for all x so that w o denotes b. n The decision surface: w T  (x) = 0 n w =  N i=1 ,i d i  (x i )   N i=1 ,i d i  T (x i )  (x) =0 n K(x,x i ) =  T (x)  (x i ) =  m1 j=0  j (x)  j (x i ) for i=1,2,...N n The optimal hyperplane:  N i=1 ,i d i K(x,x i ) n Mercer’s Theorem tells us whether or not a candidate kernel is actually an inner-product kernel in some space.

Support Vector Machines n The expansion of the inner-product kernel K(x,x i ) permits us to construct a decision surface that is nonlinear in the input space, but its image in the feature space is linear. n Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers {  i } N i=1 that maximize the objective function Q (  ) =  N i=1  i - ½  N i=1  N j=1  i  j d i d j K(x i, x j ) subject to the constraints 1)  N i=1  i d i = 0 2) 0  i  C for i = 1,2,…,N where C is a user-specified positive parameter. n K = {K(x i,x j )} N i,j=1 w o =  Ns i=1  o,i d i  (x i ) where  (x i ) is the image induced in the feature space due to x i. The first component of w o represents the optimum bias b 0.

Support Vector Machines n The requirement on the kernel K(x,x i ) is to satisfy Mercer’s theorem. n The inner-product kernels for polynomial and radial-basis function types always satisfy Mercer’s theorem. n The dimensionality of the feature space is determined by the number of support vectors extracted from the training data by the solution to the constrained optimization problem. n The underlying theory of a SVM avoids the need for heuristics often used in the design of conventional RBF networks and MLPs.

Support Vector Machines n In the RBF type of a SVM, the number of radial-basis functions and their centers are determined automatically by the number of support vectors and their values, respectively n In the two-layer perceptron type of a support vector machine, the number of hidden neurons and their weight vectors are determined automatically by the number of support vectors and their values, respectively n Conceptual problem: Dimensionality of the feature space is made very large. n Computational problem: The curse of dimesionality is avoided by using the notation of an inner-product kernel and solving the dual form of the constrained optimization problem formulated in the input space.

Support Vector Machines n The XOR problem: (x 1 OR x 2 ) AND NOT (x 1 AND x 2 )

Support Vector Machines for Nonlinear Regression n Consider a nonlinear regressive model in which the dependence of a scalar d on a vector x is described by d =f(x) + v n A set of training data {(x i,d i )} N i=1, where x i is a sample value of the input vector x and d i is the corresponding value of the model output d. The problem is to provide an estimate of the dependence of d on x. n y =  m1 j=0 w j  j (x) = w T  (x) n Minimize the empirical risk R emp = 1/N  N i=1 L  (d i, y i ) subject to the inequality  w  2  c o

Support Vector Machines for Nonlinear Regression n Introduce two sets of nonnegative slack variables {  i } N i=1 and {  ’ i } N i=1 d i - w T  (x i )   +  i i =1,2,...,N w T  (x i ) - d i   +  ’ i i =1,2,...,N  i  0,  ’ i  0 i =1,2,...,N  (w, ,  ’) = ½ w T w + C  N i=1 (  i +  ’ i ) J(w, ,  ’, ,  ’, ,  ’) = C  N i=1 (  i +  ’ i ) + ½ w T w -  N i=1  i [w T  (x i ) - d i +  +  i ] -  N i=1  ’ i [d i - w T  (x i ) +  +  ’ i ] -  N i=1 (  i  i +  ’ i  ’ i )  w =  N i=1 ( ,i - , ’ i )  (x i )  i = C - ,i and  ’ i = C -  ’,i K(x i,x j ) =  T (x i )  (x j )

Support Vector Machines for Nonlinear Regression n Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers {  i } N i=1 and {  ’ i } N i=1 that maximize the objective function n Q (  i,  ’ i ) =  N i=1 d i (  i -  ’ i ) -  N i=1 (  i +  ’ i ) - ½  N i=1  N j=1 (  i -  ’ i )(  j -  ’ j ) K(x i x j ) n subject to the constraints : 1)  N i=1 (  i -  ’ i ) = 0 2) 0  i  C for i = 1,2,…,N 0  ’ i  C for i = 1,2,…,N where C is a user-specified constant. n The parameters  and C are free parameters and selected by the user. They must be tuned simultaneously.  n Regression is intrinsically more difficult than pattern classification.

Summary n The SVM is an elegant and highly principled learning method for the design of a feedforward network with a single layer of nonlinear units. n The SVM includes the polynomial learning machine, radial-basis function network, and two-layer perceptron as special cases. n SVM provides a method for controlling model complexity independently of dimensinality. n The SVM learning algorithm operates only in a batch mode. n By using a suitable inner-product kernel, the SVM automaticly computes all the important parameters pertaining to that choice of kernel. n In terms of running time, SVMs are currently slower than other neural networks.