Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machines

Similar presentations


Presentation on theme: "Support Vector Machines"— Presentation transcript:

1 Support Vector Machines
Modified from the slides by Dr. Andrew W. Moore Copyright © 2001, 2003, Andrew W. Moore Nov 23rd, 2001

2 Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1
How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

3 Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1
How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

4 Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1
How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

5 Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1
How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

6 Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1
Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore

7 Classifier Margin f a yest x
f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Copyright © 2001, 2003, Andrew W. Moore

8 Maximum Margin f a yest x
f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore

9 Maximum Margin f a yest x
f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM Copyright © 2001, 2003, Andrew W. Moore

10 Why Maximum Margin? Intuitively this feels safest. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Copyright © 2001, 2003, Andrew W. Moore

11 Estimate the Margin wx +b = 0
denotes +1 denotes -1 wx +b = 0 x What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore

12 Estimate the Margin wx +b = 0 What is the expression for margin?
denotes +1 denotes -1 wx +b = 0 Margin What is the expression for margin? Copyright © 2001, 2003, Andrew W. Moore

13 Maximize Margin wx +b = 0 Margin denotes +1 denotes -1
Copyright © 2001, 2003, Andrew W. Moore

14 Maximize Margin wx +b = 0 Min-max problem  game problem Margin
denotes +1 denotes -1 wx +b = 0 Margin Min-max problem  game problem Copyright © 2001, 2003, Andrew W. Moore

15 Maximize Margin wx +b = 0 Strategy: Margin denotes +1 denotes -1
Copyright © 2001, 2003, Andrew W. Moore

16 Maximum Margin Linear Classifier
How to solve it? Copyright © 2001, 2003, Andrew W. Moore

17 Learning via Quadratic Programming
QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. Copyright © 2001, 2003, Andrew W. Moore

18 Quadratic Programming
Quadratic criterion Find Subject to n additional linear inequality constraints And subject to e additional linear equality constraints Copyright © 2001, 2003, Andrew W. Moore

19 Quadratic Programming
Quadratic criterion Find There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself) Subject to n additional linear inequality constraints And subject to e additional linear equality constraints Copyright © 2001, 2003, Andrew W. Moore

20 Quadratic Programming
Copyright © 2001, 2003, Andrew W. Moore

21 Uh-oh! This is going to be a problem! What should we do? denotes +1
Copyright © 2001, 2003, Andrew W. Moore

22 Uh-oh! This is going to be a problem! What should we do? Idea 1:
Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization Uh-oh! denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

23 Uh-oh! This is going to be a problem! What should we do? Idea 1.1:
Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Uh-oh! denotes +1 denotes -1 Tradeoff parameter Copyright © 2001, 2003, Andrew W. Moore

24 Uh-oh! This is going to be a problem! What should we do? Idea 1.1:
Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Uh-oh! denotes +1 denotes -1 Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) So… any other ideas? Copyright © 2001, 2003, Andrew W. Moore

25 Uh-oh! This is going to be a problem! What should we do? Idea 2.0:
Minimize w.w + C (distance of error points to their correct place) Uh-oh! denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

26 Support Vector Machine (SVM) for Noisy Data
denotes +1 denotes -1 Any problem with the above formulism? Copyright © 2001, 2003, Andrew W. Moore

27 Support Vector Machine (SVM) for Noisy Data
denotes +1 denotes -1 Balance the trade off between margin and classification errors Copyright © 2001, 2003, Andrew W. Moore

28 Support Vector Machine for Noisy Data
How do we determine the appropriate value for c ? Copyright © 2001, 2003, Andrew W. Moore

29 The Dual Form of QP Maximize where Subject to these constraints:
Then define: Then classify with: f(x,w,b) = sign(w. x - b) Copyright © 2001, 2003, Andrew W. Moore

30 The Dual Form of QP Maximize where Subject to these constraints:
Then define: Copyright © 2001, 2003, Andrew W. Moore

31 An Equivalent QP Maximize where Subject to these constraints:
Then define: Datapoints with ak > 0 will be the support vectors ..so this sum only needs to be over the support vectors. Copyright © 2001, 2003, Andrew W. Moore

32 Support Vectors Support Vectors i = 0 for non-support vectors
i  0 for support vectors denotes +1 denotes -1 Decision boundary is determined only by those support vectors ! Copyright © 2001, 2003, Andrew W. Moore

33 The Dual Form of QP Maximize where Subject to these constraints:
Then define: Then classify with: f(x,w,b) = sign(w. x - b) How to determine b ? Copyright © 2001, 2003, Andrew W. Moore

34 An Equivalent QP: Determine b
Fix w A linear programming problem ! Copyright © 2001, 2003, Andrew W. Moore

35 An Equivalent QP Maximize where
Why did I tell you about this equivalent QP? It’s a formulation that QP packages can optimize more quickly Because of further jaw-dropping developments you’re about to learn. Maximize where Subject to these constraints: Then define: Datapoints with ak > 0 will be the support vectors Then classify with: f(x,w,b) = sign(w. x - b) ..so this sum only needs to be over the support vectors. Copyright © 2001, 2003, Andrew W. Moore

36 Suppose we’re in 1-dimension
What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore

37 Suppose we’re in 1-dimension
Not a big surprise x=0 Positive “plane” Negative “plane” Copyright © 2001, 2003, Andrew W. Moore

38 Harder 1-dimensional dataset
That’s wiped the smirk off SVM’s face. What can be done about this? x=0 Copyright © 2001, 2003, Andrew W. Moore

39 Harder 1-dimensional dataset
Remember how permitting non-linear basis functions made linear regression so much nicer? Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore

40 Harder 1-dimensional dataset
Remember how permitting non-linear basis functions made linear regression so much nicer? Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore

41 Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk ) This is sensible. Is that the end of the story? No…there’s one more trick! Copyright © 2001, 2003, Andrew W. Moore

42 Quadratic Basis Functions
Constant Term Quadratic Basis Functions Linear Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = (as near as makes no difference) m2/2 You may be wondering what those ’s are doing. You should be happy that they do no harm You’ll find out why they’re there soon. Pure Quadratic Terms Quadratic Cross-Terms Copyright © 2001, 2003, Andrew W. Moore

43 QP (old) Maximize where Subject to these constraints: Then define:
Then classify with: f(x,w,b) = sign(w. x - b) Copyright © 2001, 2003, Andrew W. Moore

44 QP with basis functions
Most important changes: X  f(x) Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. f(x) - b) Copyright © 2001, 2003, Andrew W. Moore

45 QP with basis functions
Maximize where We must do R2/2 dot products to get this matrix ready. Each dot product requires m2/2 additions and multiplications The whole thing costs R2 m2 /4. Yeeks! …or does it? Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. f(x) - b) Copyright © 2001, 2003, Andrew W. Moore

46 Quadratic Dot Products
+ Quadratic Dot Products + + Copyright © 2001, 2003, Andrew W. Moore

47 Quadratic Dot Products
Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products Copyright © 2001, 2003, Andrew W. Moore

48 Quadratic Dot Products
Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products They’re the same! And this is only O(m) to compute! Copyright © 2001, 2003, Andrew W. Moore

49 QP with Quintic basis functions
Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. f(x) - b) Copyright © 2001, 2003, Andrew W. Moore

50 QP with Quadratic basis functions
Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(K(w, x) - b) Most important change: Copyright © 2001, 2003, Andrew W. Moore

51 Higher Order Polynomials
f(x) Cost to build Qkl matrix traditionally Cost if 100 inputs f(a).f(b) Cost to build Qkl matrix sneakily Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2 Cubic All m3/6 terms up to degree 3 m3 R2 /12 83,000 R2 (a.b+1)3 Quartic All m4/24 terms up to degree 4 m4 R2 /48 1,960,000 R2 (a.b+1)4 Copyright © 2001, 2003, Andrew W. Moore

52 SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function Radial-Basis-style Kernel Function: Neural-net-style Kernel Function: Copyright © 2001, 2003, Andrew W. Moore

53 Kernel Tricks Replacing dot product with a kernel function
Not all functions are kernel functions Need to be decomposable K(a,b) = (a)  (b) Could K(a,b) = (a-b)3 be a kernel function ? Could K(a,b) = (a-b)4 – (a+b)2 be a kernel function? Copyright © 2001, 2003, Andrew W. Moore

54 Kernel Tricks Mercer’s condition
To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)=(x)(y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds Could be a kernel function? Copyright © 2001, 2003, Andrew W. Moore

55 Kernel Tricks Pro Introducing nonlinearity into the model
Computational cheap Con Still have potential overfitting problems Copyright © 2001, 2003, Andrew W. Moore

56 Nonlinear Kernel (I) Copyright © 2001, 2003, Andrew W. Moore

57 Nonlinear Kernel (II) Copyright © 2001, 2003, Andrew W. Moore

58 Kernelize Logistic Regression
How can we introduce the nonlinearity into the logistic regression? Copyright © 2001, 2003, Andrew W. Moore

59 Kernelize Logistic Regression
Representation Theorem Copyright © 2001, 2003, Andrew W. Moore

60 Overfitting in SVM Testing Error Training Error
Copyright © 2001, 2003, Andrew W. Moore

61 SVM Performance Anecdotally they work very very well indeed.
Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. There is a lot of excitement and religious fervor about SVMs as of 2001. Despite this, some practitioners are a little skeptical. Copyright © 2001, 2003, Andrew W. Moore

62 Diffusion Kernel Kernel function describes the correlation or similarity between two data points Given that I have a function s(x,y) that describes the similarity between two data points. Assume that it is a non-negative and symmetric function. How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore

63 Diffusion Kernel Create a graph for the data points Graph Laplacian
Each vertex corresponds to a data point The weight of each edge is the similarity s(x,y) Graph Laplacian Properties of Laplacian Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore

64 Diffusion Kernel Consider a simple Laplacian Consider
What do these matrixes represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore

65 Diffusion Kernel Consider a simple Laplacian Consider
What do these matrixes represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore

66 Diffusion Kernel: Properties
Positive definite Local relationships L induce global relationships Works for undirected weighted graphs with similarities How to compute the diffusion kernel Copyright © 2001, 2003, Andrew W. Moore

67 Computing Diffusion Kernel
Singular value decomposition of Laplacian L What is L2 ? Copyright © 2001, 2003, Andrew W. Moore

68 Computing Diffusion Kernel
What about Ln ? Compute diffusion kernel Copyright © 2001, 2003, Andrew W. Moore

69 Doing multi-class classification
SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore

70 Ranking Problem Consider a problem of ranking essays
Three ranking categories: good, ok, bad Given a input document, predict its ranking category How should we formulate this problem? A simple multiple class solution Each ranking category is a independent class But, there is something missing here … We miss the ordinal relationship between classes ! Copyright © 2001, 2003, Andrew W. Moore

71 Ordinal Regression Which choice is better?
‘good’ ‘OK’ ‘bad’ w Which choice is better? How could we formulate this problem? Copyright © 2001, 2003, Andrew W. Moore

72 Ordinal Regression What are the two decision boundaries?
What is the margin for ordinal regression? Maximize margin Copyright © 2001, 2003, Andrew W. Moore

73 Ordinal Regression What are the two decision boundaries?
What is the margin for ordinal regression? Maximize margin Copyright © 2001, 2003, Andrew W. Moore

74 Ordinal Regression What are the two decision boundaries?
What is the margin for ordinal regression? Maximize margin Copyright © 2001, 2003, Andrew W. Moore

75 Ordinal Regression How do we solve this monster ?
Copyright © 2001, 2003, Andrew W. Moore

76 Ordinal Regression The same old trick
To remove the scaling invariance, set Now the problem is simplified as: Copyright © 2001, 2003, Andrew W. Moore

77 Ordinal Regression Noisy case Is this sufficient enough?
Copyright © 2001, 2003, Andrew W. Moore

78 Ordinal Regression ‘good’ ‘OK’ ‘bad’ w
Copyright © 2001, 2003, Andrew W. Moore

79 References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Software: SVM-light, free download Copyright © 2001, 2003, Andrew W. Moore


Download ppt "Support Vector Machines"

Similar presentations


Ads by Google