Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Support Vector Machines Kernel Machines
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
Support Vector Machines
Lecture 10: Support Vector Machines
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
Data Mining Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
CS 478 – Tools for Machine Learning and Data Mining SVM.
1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Support Vector Machines Louis Oliphant Cs540 section 2.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Kernels Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines Chapter 18.9 and the paper “Support vector machines” by M. Hearst, ed., 1998 Acknowledgments: These slides combine and modify ones.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
Introduction to SVMs.
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Introduction to Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
Support Vector Machines
Support Vector Machines
Presentation transcript:

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Comments and corrections gratefully received.

Support Vector Machines: Slide 2 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Support Vector Machines: Slide 3 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Support Vector Machines: Slide 4 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Support Vector Machines: Slide 5 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Support Vector Machines: Slide 6 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

Support Vector Machines: Slide 7 Copyright © 2001, 2003, Andrew W. Moore Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Support Vector Machines: Slide 8 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Support Vector Machines: Slide 9 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Support Vector Machines: Slide 10 Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.LOOCV is easy since the model is immune to removal of any non- support-vector datapoints. 4.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5.Empirically it works very very well.

Support Vector Machines: Slide 11 Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin What is the distance expression for a point x to a line wx+b= 0? denotes +1 denotes -1 x wx +b = 0

Support Vector Machines: Slide 12 Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin What is the expression for margin? denotes +1 denotes -1 wx +b = 0 Margin

Support Vector Machines: Slide 13 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin

Support Vector Machines: Slide 14 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Min-max problem  game problem

Support Vector Machines: Slide 15 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Strategy:

Support Vector Machines: Slide 16 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin Linear Classifier How to solve it?

Support Vector Machines: Slide 17 Copyright © 2001, 2003, Andrew W. Moore Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Support Vector Machines: Slide 18 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming Find And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion Subject to

Support Vector Machines: Slide 19 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming Find Subject to And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself)

Support Vector Machines: Slide 20 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming

Support Vector Machines: Slide 21 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do?

Support Vector Machines: Slide 22 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization

Support Vector Machines: Slide 23 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter

Support Vector Machines: Slide 24 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) So… any other ideas?

Support Vector Machines: Slide 25 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 2.0: Minimize w.w + C (distance of error points to their correct place)

Support Vector Machines: Slide 26 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data Any problem with the above formulism? denotes +1 denotes -1

Support Vector Machines: Slide 27 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data Balance the trade off between margin and classification errors denotes +1 denotes -1

Support Vector Machines: Slide 28 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine for Noisy Data How do we determine the appropriate value for c ?

Support Vector Machines: Slide 29 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b)

Support Vector Machines: Slide 30 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize where Subject to these constraints: Then define:

Support Vector Machines: Slide 31 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where Subject to these constraints: Then define: Datapoints with  k > 0 will be the support vectors..so this sum only needs to be over the support vectors.

Support Vector Machines: Slide 32 Copyright © 2001, 2003, Andrew W. Moore Support Vectors denotes +1 denotes -1 Support Vectors Decision boundary is determined only by those support vectors !  i = 0 for non-support vectors  i  0 for support vectors

Support Vector Machines: Slide 33 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) How to determine b ?

Support Vector Machines: Slide 34 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP: Determine b A linear programming problem ! Fix w

Support Vector Machines: Slide 35 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) Datapoints with  k > 0 will be the support vectors..so this sum only needs to be over the support vectors. Why did I tell you about this equivalent QP? It’s a formulation that QP packages can optimize more quickly Because of further jaw- dropping developments you’re about to learn.

Support Vector Machines: Slide 36 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0

Support Vector Machines: Slide 37 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0

Support Vector Machines: Slide 38 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0

Support Vector Machines: Slide 39 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

Support Vector Machines: Slide 40 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

Support Vector Machines: Slide 41 Copyright © 2001, 2003, Andrew W. Moore Common SVM basis functions z k = ( polynomial terms of x k of degree 1 to q ) z k = ( radial basis functions of x k ) z k = ( sigmoid functions of x k ) This is sensible. Is that the end of the story? No…there’s one more trick!

Support Vector Machines: Slide 42 Copyright © 2001, 2003, Andrew W. Moore Quadratic Basis Functions Constant Term Linear Terms Pure Quadratic Terms Quadratic Cross-Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = (as near as makes no difference) m 2 /2 You may be wondering what those ’s are doing. You should be happy that they do no harm You’ll find out why they’re there soon.

Support Vector Machines: Slide 43 Copyright © 2001, 2003, Andrew W. Moore QP (old) Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b)

Support Vector Machines: Slide 44 Copyright © 2001, 2003, Andrew W. Moore QP with basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) Maximize Most important changes: X   (x)

Support Vector Machines: Slide 45 Copyright © 2001, 2003, Andrew W. Moore QP with basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. Each dot product requires m 2 /2 additions and multiplications The whole thing costs R 2 m 2 /4. Yeeks! …or does it? Maximize

Support Vector Machines: Slide 46 Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products + + +

Support Vector Machines: Slide 47 Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b:

Support Vector Machines: Slide 48 Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b: They’re the same! And this is only O(m) to compute!

Support Vector Machines: Slide 49 Copyright © 2001, 2003, Andrew W. Moore QP with Quadratic basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. Each dot product now only requires m additions and multiplications Maximize

Support Vector Machines: Slide 50 Copyright © 2001, 2003, Andrew W. Moore Higher Order Polynomials Poly- nomial (x)(x) Cost to build Q kl matrix tradition ally Cost if 100 inputs  (a).  (b) Cost to build Q kl matrix sneakily Cost if 100 inputs QuadraticAll m 2 /2 terms up to degree 2 m 2 R 2 /42,500 R 2 (a.b+1) 2 m R 2 / 250 R 2 CubicAll m 3 /6 terms up to degree 3 m 3 R 2 /1283,000 R 2 (a.b+1) 3 m R 2 / 250 R 2 QuarticAll m 4 /24 terms up to degree 4 m 4 R 2 /481,960,000 R 2 (a.b+1) 4 m R 2 / 250 R 2

Support Vector Machines: Slide 51 Copyright © 2001, 2003, Andrew W. Moore QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they?

Support Vector Machines: Slide 52 Copyright © 2001, 2003, Andrew W. Moore QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they? The fear of overfitting with this enormous number of terms The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)

Support Vector Machines: Slide 53 Copyright © 2001, 2003, Andrew W. Moore QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they? The fear of overfitting with this enormous number of terms The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) Because each w.  (x) (see below) needs 75 million operations. What can be done? The use of Maximum Margin magically makes this not a problem. (Not always!)

Support Vector Machines: Slide 54 Copyright © 2001, 2003, Andrew W. Moore QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b) We must do R 2 /2 dot products to get this matrix ready. In 100-d, each dot product now needs 103 operations instead of 75 million But there are still worrying things lurking away. What are they? The fear of overfitting with this enormous number of terms The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) Because each w.  (x) (see below) needs 75 million operations. What can be done? The use of Maximum Margin magically makes this not a problem Only Sm operations (S=#support vectors)

Support Vector Machines: Slide 55 Copyright © 2001, 2003, Andrew W. Moore QP with Quintic basis functions Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w.  (x) - b)

Support Vector Machines: Slide 56 Copyright © 2001, 2003, Andrew W. Moore QP with Quadratic basis functions where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(K(w, x) - b) Maximize Most important change:

Support Vector Machines: Slide 57 Copyright © 2001, 2003, Andrew W. Moore SVM Kernel Functions K(a,b)=(a. b +1) d is an example of an SVM Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function Radial-Basis-style Kernel Function: Neural-net-style Kernel Function:

Support Vector Machines: Slide 58 Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks Replacing dot product with a kernel function Not all functions are kernel functions Need to be decomposable K(a,b) =  (a)   (b) Could K(a,b) = (a-b) 3 be a kernel function ? Could K(a,b) = (a-b) 4 – (a+b) 2 be a kernel function?

Support Vector Machines: Slide 59 Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks Mercer’s condition To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)=  (x)  (y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds Could be a kernel function?

Support Vector Machines: Slide 60 Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks Pro Introducing nonlinearity into the model Computational cheap Con Still have potential overfitting problems

Support Vector Machines: Slide 61 Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (I)

Support Vector Machines: Slide 62 Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (II)

Support Vector Machines: Slide 63 Copyright © 2001, 2003, Andrew W. Moore Overfitting in SVM Training Error Testing Error

Support Vector Machines: Slide 64 Copyright © 2001, 2003, Andrew W. Moore SVM Performance Anecdotally they work very very well indeed. Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. There is a lot of excitement and religious fervor about SVMs as of Despite this, some practitioners are a little skeptical.

Support Vector Machines: Slide 65 Copyright © 2001, 2003, Andrew W. Moore Kernelize Logistic Regression How can we introduce the nonlinearity into the logistic regression?

Support Vector Machines: Slide 66 Copyright © 2001, 2003, Andrew W. Moore Kernelize Logistic Regression Representation Theorem

Support Vector Machines: Slide 67 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel Kernel function describes the correlation or similarity between two data points Given that I have a function s(x,y) that describes the similarity between two data points. Assume that it is a non- negative and symmetric function. How can we generate a kernel function based on this similarity function? A graph theory approach …

Support Vector Machines: Slide 68 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel Create a graph for the data points Each vertex corresponds to a data point The weight of each edge is the similarity s(x,y) Graph Laplacian Properties of Laplacian Negative semi-definite

Support Vector Machines: Slide 69 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel Consider a simple Laplacian Consider What do these matrixes represent? A diffusion kernel

Support Vector Machines: Slide 70 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel Consider a simple Laplacian Consider What do these matrixes represent? A diffusion kernel

Support Vector Machines: Slide 71 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel: Properties Positive definite Local relationships L induce global relationships Works for undirected weighted graphs with similarities How to compute the diffusion kernel

Support Vector Machines: Slide 72 Copyright © 2001, 2003, Andrew W. Moore Computing Diffusion Kernel Singular value decomposition of Laplacian L What is L 2 ?

Support Vector Machines: Slide 73 Copyright © 2001, 2003, Andrew W. Moore Computing Diffusion Kernel What about L n ? Compute diffusion kernel

Support Vector Machines: Slide 74 Copyright © 2001, 2003, Andrew W. Moore Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Support Vector Machines: Slide 75 Copyright © 2001, 2003, Andrew W. Moore References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 Software: SVM-light, free downloadhttp://svmlight.joachims.org/

Support Vector Machines: Slide 76 Copyright © 2001, 2003, Andrew W. Moore Ranking Problem Consider a problem of ranking essays Three ranking categories: good, ok, bad Given a input document, predict its ranking category How should we formulate this problem? A simple multiple class solution Each ranking category is a independent class But, there is something missing here … We miss the ordinal relationship between classes !

Support Vector Machines: Slide 77 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression Which choice is better? How could we formulate this problem? ‘good’ ‘OK’ ‘bad’ w w’

Support Vector Machines: Slide 78 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression What are the two decision boundaries? What is the margin for ordinal regression? Maximize margin

Support Vector Machines: Slide 79 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression What are the two decision boundaries? What is the margin for ordinal regression? Maximize margin

Support Vector Machines: Slide 80 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression What are the two decision boundaries? What is the margin for ordinal regression? Maximize margin

Support Vector Machines: Slide 81 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression How do we solve this monster ?

Support Vector Machines: Slide 82 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression The same old trick To remove the scaling invariance, set Now the problem is simplified as:

Support Vector Machines: Slide 83 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression Noisy case Is this sufficient enough?

Support Vector Machines: Slide 84 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression ‘good’ ‘OK’ ‘bad’ w