Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Machine learning continued Image source:
Discriminative and generative methods for bags of features
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Machine Learning: k-Nearest Neighbor and Support Vector Machines skim 20.4, CMSC 471.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
SVM – Support Vector Machines Presented By: Bella Specktor.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
SVMs in a Nutshell.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines and Kernels
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Support Vector Machines and Kernels
Support Vector Machines 2
Presentation transcript:

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County Doing Really Well with Linear Decision Surfaces

Outline Prediction Prediction Why might predictions be wrong? Why might predictions be wrong? Support vector machines Support vector machines Doing really well with linear models Doing really well with linear models Kernels Kernels Making the non-linear linear Making the non-linear linear

Supervised ML = Prediction Given training instances (x,y) Given training instances (x,y) Learn a model f Learn a model f Such that f(x) = y Such that f(x) = y Use f to predict y for new x Use f to predict y for new x Many variations on this basic theme Many variations on this basic theme

Why might predictions be wrong? True Non-Determinism True Non-Determinism Flip a biased coin Flip a biased coin p(heads) =  p(heads) =  Estimate  Estimate  If  > 0.5 predict heads, else tails If  > 0.5 predict heads, else tails Lots of ML research on problems like this Lots of ML research on problems like this Learn a model Learn a model Do the best you can in expectation Do the best you can in expectation

Why might predictions be wrong? Partial Observability Partial Observability Something needed to predict y is missing from observation x Something needed to predict y is missing from observation x N-bit parity problem N-bit parity problem x contains N-1 bits (hard PO) x contains N-1 bits (hard PO) x contains N bits but learner ignores some of them (soft PO) x contains N bits but learner ignores some of them (soft PO)

Why might predictions be wrong? True non-determinism True non-determinism Partial observability Partial observability hard, soft hard, soft Representational bias Representational bias Algorithmic bias Algorithmic bias Bounded resources Bounded resources

Representational Bias Having the right features (x) is crucial Having the right features (x) is crucial XOOOOXXX X O O O O X X X

Support Vector Machines Doing Really Well with Linear Decision Surfaces

Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick Amenable to the kernel trick

Linear Separators Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} w   n w   n b   b   Hyperplane Hyperplane + b = 0 + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) Math Review Inner (dot) product: = a · b = ∑ a i *b i = a · b = ∑ a i *b i = a 1 b 1 + a 2 b 2 + …+a n b n

Intuitions X X O O O O O O X X X X X X O O

Intuitions X X O O O O O O X X X X X X O O

Intuitions X X O O O O O O X X X X X X O O

Intuitions X X O O O O O O X X X X X X O O

A “Good” Separator X X O O O O O O X X X X X X O O

Noise in the Observations X X O O O O O O X X X X X X O O

Ruling Out Some Separators X X O O O O O O X X X X X X O O

Lots of Noise X X O O O O O O X X X X X X O O

Maximizing the Margin X X O O O O O O X X X X X X O O

“Fat” Separators X X O O O O O O X X X X X X O O

Why Maximize Margin? Increasing margin reduces capacity Increasing margin reduces capacity Must restrict capacity to generalize Must restrict capacity to generalize m training instances m training instances 2 m ways to label them 2 m ways to label them What if function class that can separate them all? What if function class that can separate them all? Shatters the training instances Shatters the training instances VC Dimension is largest m such that function class can shatter some set of m points VC Dimension is largest m such that function class can shatter some set of m points

VC Dimension Example X XX O XX X OX X XO O OX O XO X OO O OO

Bounding Generalization Error R[f] = risk, test error R[f] = risk, test error R emp [f] = empirical risk, train error R emp [f] = empirical risk, train error h = VC dimension h = VC dimension m = number of training instances m = number of training instances  = probability that bound does not hold  = probability that bound does not hold 1 m 2m h ln  + ln h R[f]  R emp [f] +

Support Vectors X X O O O O O O O O X X X X X X

The Math Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) w   n w   n b   b   Find w and b that Find w and b that Perfectly classify training instances Perfectly classify training instances Assuming linear separability Assuming linear separability Maximize margin Maximize margin

The Math For perfect classification, we want For perfect classification, we want y i ( + b) ≥ 0 for all i y i ( + b) ≥ 0 for all i Why? Why? To maximize the margin, we want To maximize the margin, we want w that minimizes |w| 2 w that minimizes |w| 2

Dual Optimization Problem Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b)

What if Data Are Not Perfectly Linearly Separable? Cannot find w and b that satisfy Cannot find w and b that satisfy y i ( + b) ≥ 1 for all i y i ( + b) ≥ 1 for all i Introduce slack variables  i Introduce slack variables  i y i ( + b) ≥ 1 -  i for all i y i ( + b) ≥ 1 -  i for all i Minimize Minimize |w| 2 + C   i |w| 2 + C   i

Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick … Amenable to the kernel trick …

What if Surface is Non- Linear? X X X X X X O O O O O O O O O O O O O O O O O O O O Image from

Kernel Methods Making the Non-Linear Linear

When Linear Separators Fail XOOOOXXX x1x1 x2x2 X O O O O X X X x1x1 x12x12

Mapping into a New Feature Space Rather than run SVM on x i, run it on  (x i ) Rather than run SVM on x i, run it on  (x i ) Find non-linear separator in input space Find non-linear separator in input space What if  (x i ) is really big? What if  (x i ) is really big? Use kernels to compute it implicitly! Use kernels to compute it implicitly!  : x  X =  (x)  (x 1,x 2 ) = (x 1,x 2,x 1 2,x 2 2,x 1 x 2 ) Image from ~afern/classes/cs534/

Kernels Find kernel K such that Find kernel K such that K(x 1,x 2 ) = K(x 1,x 2 ) = Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Use K(x 1,x 2 ) in SVM algorithm rather than Use K(x 1,x 2 ) in SVM algorithm rather than Remarkably, this is possible Remarkably, this is possible

The Polynomial Kernel K(x 1,x 2 ) = 2 K(x 1,x 2 ) = 2 x 1 = (x 11, x 12 ) x 1 = (x 11, x 12 ) x 2 = (x 21, x 22 ) x 2 = (x 21, x 22 ) = (x 11 x 21 + x 12 x 22 ) = (x 11 x 21 + x 12 x 22 ) 2 = (x 11 2 x x 12 2 x x 11 x 12 x 21 x 22 ) 2 = (x 11 2 x x 12 2 x x 11 x 12 x 21 x 22 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 ) K(x 1,x 2 ) = K(x 1,x 2 ) =

The Polynomial Kernel  (x) contains all monomials of degree d  (x) contains all monomials of degree d Useful in visual pattern recognition Useful in visual pattern recognition Number of monomials Number of monomials 16x16 pixel image 16x16 pixel image monomials of degree monomials of degree 5 Never explicitly compute  (x)! Never explicitly compute  (x)! Variation - K(x 1,x 2 ) = ( + 1) 2 Variation - K(x 1,x 2 ) = ( + 1) 2

Kernels What does it mean to be a kernel? What does it mean to be a kernel? K(x 1,x 2 ) = for some  K(x 1,x 2 ) = for some  What does it take to be a kernel? What does it take to be a kernel? The Gram matrix G ij = K(x i, x j ) The Gram matrix G ij = K(x i, x j ) Positive definite matrix Positive definite matrix  ij c i c j G ij  0 for c i, c j    ij c i c j G ij  0 for c i, c j   Positive definite kernel Positive definite kernel For all samples of size m, induces a positive definite Gram matrix For all samples of size m, induces a positive definite Gram matrix

A Few Good Kernels Dot product kernel Dot product kernel K(x 1,x 2 ) = K(x 1,x 2 ) = Polynomial kernel Polynomial kernel K(x 1,x 2 ) = d (Monomials of degree d) K(x 1,x 2 ) = d (Monomials of degree d) K(x 1,x 2 ) = ( + 1) d (All monomials of degree 1,2,…,d) K(x 1,x 2 ) = ( + 1) d (All monomials of degree 1,2,…,d) Gaussian kernel Gaussian kernel K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) Radial basis functions Radial basis functions Sigmoid kernel Sigmoid kernel K(x 1,x 2 ) = tanh( + ) K(x 1,x 2 ) = tanh( + ) Neural networks Neural networks Establishing “kernel-hood” from first principles is non- trivial Establishing “kernel-hood” from first principles is non- trivial

The Kernel Trick “Given an algorithm which is formulated in terms of a positive definite kernel K 1, one can construct an alternative algorithm by replacing K 1 with another positive definite kernel K 2 ”  SVMs can use the kernel trick

Using a Different Kernel in the Dual Optimization Problem For example, using the polynomial kernel with d = 4 (including lower-order terms). For example, using the polynomial kernel with d = 4 (including lower-order terms). Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b) ( + 1) 4 X X These are kernels! So by the kernel trick, we just replace them!

Exotic Kernels Strings Strings Trees Trees Graphs Graphs The hard part is establishing kernel-hood The hard part is establishing kernel-hood

Application: “Beautification Engine” (Leyvand et al., 2008)

Conclusion SVMs find optimal linear separator SVMs find optimal linear separator The kernel trick makes SVMs non-linear learning algorithms The kernel trick makes SVMs non-linear learning algorithms