1 Support Vector Machines Chapter 18.9. Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machines
Lecture 9 Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Support Vector Machines
SVM—Support Vector Machines
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 CSC 463 Fall 2010 Dr. Adam P. Anthony Class #27.
Support Vector Machine
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines Kernel Machines
Support Vector Machines
Ti MACHINE VISION SUPPORT VECTOR MACHINES Maxim Mikhnevich Pavel Stepanov Pankaj Sharma Ivan Ryzhov Sergey Vlasov
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
Support Vector Machines
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Support Vector Machine & Image Classification Applications
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
Data Mining Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Support Vector Machines Louis Oliphant Cs540 section 2.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Chapter 18.9 and the paper “Support vector machines” by M. Hearst, ed., 1998 Acknowledgments: These slides combine and modify ones.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
Support Vector Machines
Support Vector Machines
Introduction to SVMs.
Support Vector Machines
Support Vector Machines
Machine Learning Week 2.
Support Vector Machines
Introduction to Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
Support Vector Machines
Support Vector Machines
Presentation transcript:

1 Support Vector Machines Chapter 18.9

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Comments and corrections gratefully received.

Support Vector Machines: Slide 3 Overviews Proposed by Vapnik and his colleagues - Started in 1963, taking shape in late 70’s as part of his statistical learning theory (with Chervonenkis) - Current form established in early 90’s (with Cortes) Becomes popular in last decade - Classification, regression (function approx.), optimization - Compared favorably to MLP Basic ideas - Overcoming linear seperability problem by transforming the problem into higher dimensional space using kernel functions - (become equiv. to 2-layer perceptron when kernel is sigmoid function) - Maximize margin of decision boundary Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machines: Slide 4 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) How would you classify this data?

Support Vector Machines: Slide 5 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) How would you classify this data?

Support Vector Machines: Slide 6 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) How would you classify this data?

Support Vector Machines: Slide 7 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) How would you classify this data?

Support Vector Machines: Slide 8 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) Any of these would be fine....but which is best?

Support Vector Machines: Slide 9 Copyright © 2001, 2003, Andrew W. Moore Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Support Vector Machines: Slide 10 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Support Vector Machines: Slide 11 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Support Vector Machines: Slide 12 Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.CV is easy since the model is immune to removal of any non-support-vector datapoints. 4.There’s some theory that this is a good thing. 5.Empirically it works very very well.

Support Vector Machines: Slide 13 Copyright © 2001, 2003, Andrew W. Moore Specifying a line and margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

Support Vector Machines: Slide 14 Copyright © 2001, 2003, Andrew W. Moore Specifying a line and margin Conditions for optimal separating hyperplane for data points (x 1, y 1 ),…,(x l, y l ) where y i =  1 1. w. x i + b  1 if y i = 1 (points in plus class) 2. w. x i + b  -1 if y i = -1 (points in minus class) Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1

Support Vector Machines: Slide 15 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b?

Support Vector Machines: Slide 16 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? Let u and v be two vectors on the Plus Plane. What is w. ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane

Support Vector Machines: Slide 17 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ Any location in  m : not necessarily a datapoint Any location in R m : not necessarily a datapoint

Support Vector Machines: Slide 18 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+

Support Vector Machines: Slide 19 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ The line from x - to x + is perpendicular to the planes. So to get from x - to x + travel some distance in direction w.

Support Vector Machines: Slide 20 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width x-x- x+x+

Support Vector Machines: Slide 21 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width w. (x - + w) + b = 1 => w. x - + b + w.w = 1 => -1 + w.w = 1 => x-x- x+x+

Support Vector Machines: Slide 22 Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = M = |x + - x - | =| w |= x-x- x+x+

Support Vector Machines: Slide 23 Copyright © 2001, 2003, Andrew W. Moore Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+

Support Vector Machines: Slide 24 Optimal separating hyperplane can be found by solving -This is a quadratic function -Once are found, the weight matrix the decision function is This optimization problem can be solved by quadratic programming QP is a well-studied class of optimization algorithms to maximize a quadratic function subject to linear constraints Copyright © 2001, 2003, Andrew W. Moore Learning via Quadratic Programming

Support Vector Machines: Slide 25 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming Find And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion Subject to

Support Vector Machines: Slide 26 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming Find Subject to And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself)

Support Vector Machines: Slide 27 Copyright © 2001, 2003, Andrew W. Moore Learning the Maximum Margin Classifier “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? How many constraints will we have? What should they be? Given guess of w, b we can Compute whether all data points are in the correct half-planes Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1

Support Vector Machines: Slide 28 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0

Support Vector Machines: Slide 29 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0

Support Vector Machines: Slide 30 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0

Support Vector Machines: Slide 31 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

Support Vector Machines: Slide 32 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

Support Vector Machines: Slide 33 Copyright © 2001, 2003, Andrew W. Moore Common SVM basis functions z k = ( polynomial terms of x k of degree 1 to q ) z k = ( radial basis functions of x k ) z k = ( sigmoid functions of x k )

Support Vector Machines: Slide 34 Copyright © 2001, 2003, Andrew W. Moore Explosion of feature space dimensionality Consider a degree 2 polynomial kernel function z =  (x) for data point x = (x 1, x 2, …, x n ) z 1 = x 1, …, z n = x n z n+1 = (x 1 ) 2, …, z 2n = (x n ) 2 z 2n+1 = x 1 x 1, …, z N = x n-1 x n where N = n(n+3)/2 When constructing polynomials of degree 5 for a 256- dimensional input space the feature space is billion- dimensional

Support Vector Machines: Slide 35 Example: polynomial kernel Copyright © 2001, 2003, Andrew W. Moore Kernel trick

Support Vector Machines: Slide 36 Max margin classifier can be found by solving the weight matrix (no need to compute and store) the decision function is Copyright © 2001, 2003, Andrew W. Moore Kernel trick + QP

Support Vector Machines: Slide 37 Copyright © 2001, 2003, Andrew W. Moore SVM Kernel Functions Use kernel functions which compute K(a, b)=(a  b +1) d is an example of an SVM polynomial Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function Radial-Basis-style Kernel Function: Neural-net-style Kernel Function: ,  and  are magic parameters that must be chosen by a model selection method such as CV or VCSRM* *see last lecture

Support Vector Machines: Slide 38 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machines: Slide 39 Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machines: Slide 40 Copyright © 2001, 2003, Andrew W. Moore SVM Performance Anecdotally they work very very well indeed. Overcomes linear separability problem Transforming input space to a higher dimension feature space Overcome dimensionality explosion by kernel trick Generalizes well (overfitting not as serious) Maximum margin separator Find MMS by quadratic programming Example: currently the best-known classifier on a well-studied hand- written-character recognition benchmark several reliable people doing practical real-world work claim that SVMs have saved them when their other favorite classifiers did poorly.

Hand-written character recognition MNIST: a data set of hand-written digits −60,000 training samples −10,000 test samples −Each sample consists of 28 x 28 = 784 pixels Various techniques have been tried −Linear classifier:12.0% −2-layer BP net (300 hidden nodes) 4.7% −3-layer BP net ( hidden nodes) 3.05% −Support vector machine (SVM) 1.4% −Convolutional net 0.4% −6 layer BP net (7500 hidden nodes): 0.35% Failure rate for test samples

Support Vector Machines: Slide 42 Copyright © 2001, 2003, Andrew W. Moore SVM Performance There is a lot of excitement and religious fervor about SVMs as of Despite this, some practitioners are a little skeptical.

Support Vector Machines: Slide 43 Copyright © 2001, 2003, Andrew W. Moore Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). Extend to output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” SVM can also be extended to compute any real value functions.

Support Vector Machines: Slide 44 Copyright © 2001, 2003, Andrew W. Moore References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 Download SVM-light: