Download presentation
Presentation is loading. Please wait.
Published byCurtis Small Modified over 9 years ago
1
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University
2
Support Vector Machines: Slide 2 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
3
Support Vector Machines: Slide 3 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
4
Support Vector Machines: Slide 4 Copyright © 2001, Andrew W. Moore Linear Classifiers f x y est denotes +1 denotes -1 f(x,w,b) = sgn(w. x - b) How would you classify this data? Learning machine f takes an input x and transforms it, somehow using weight α into a predicted output y est = +/-1
5
Support Vector Machines: Slide 5 Copyright © 2001, Andrew W. Moore Linear Classifiers f x y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Perceptron rule stops only if no misclassification remains: w = (t y)x Any of these lines are possible
6
Support Vector Machines: Slide 6 Copyright © 2001, Andrew W. Moore Linear Classifiers f x y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) LMS rule uses linear elements and minimizes E= (t y) 2 Not necessarily guarantee no misclassification
7
Support Vector Machines: Slide 7 Copyright © 2001, Andrew W. Moore Linear Classifiers f x y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) If the objective is to assure zero sample error… Any of these would be fine....but which is best?
8
Support Vector Machines: Slide 8 Copyright © 2001, Andrew W. Moore Classifier Margin f x y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Margin of the separator is the width of separation between classes
9
Support Vector Machines: Slide 9 Copyright © 2001, Andrew W. Moore Maximum Margin f x y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against
10
Support Vector Machines: Slide 10 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
11
Support Vector Machines: Slide 11 Copyright © 2001, Andrew W. Moore Specifying a line and margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone
12
Support Vector Machines: Slide 12 Copyright © 2001, Andrew W. Moore Specifying a line and margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as..+1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1 wx+b=1 wx+b=0 wx+b=-1
13
Support Vector Machines: Slide 13 Copyright © 2001, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? Let u and v be two vectors on the Plus Plane. What is w. ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane
14
Support Vector Machines: Slide 14 Copyright © 2001, Andrew W. Moore Computing the margin width The line from x - to x + is perpendicular to the planes. So to get from x- to x+ travel some distance in direction w
15
Support Vector Machines: Slide 15 Copyright © 2001, Andrew W. Moore Computing the margin width
16
Support Vector Machines: Slide 16 Copyright © 2001, Andrew W. Moore Computing the margin width
17
Support Vector Machines: Slide 17 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
18
Support Vector Machines: Slide 18 Copyright © 2001, Andrew W. Moore Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+
19
Support Vector Machines: Slide 19 Copyright © 2001, Andrew W. Moore Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
20
Support Vector Machines: Slide 20 Copyright © 2001, Andrew W. Moore The Optimization Problem
21
Support Vector Machines: Slide 21 Copyright © 2001, Andrew W. Moore Learning the Maximum Margin Classifier “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Q1: What is our quadratic optimization criterion? Q2: What are constraints in our QP? The objective? To maximize M While ensuring all data points are in the correct half-planes Constraints: w. x k + b >= 1 if y k = 1 w. x k + b <= -1 if y k = -1 Minimize w.w - How many constraints will we have? - What should they be? R M =
22
Support Vector Machines: Slide 22 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
23
Support Vector Machines: Slide 23 Copyright © 2001, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea : Minimize w.w + C (distance of error points to their correct place) Minimize trade-off between margin and training error
24
Support Vector Machines: Slide 24 Copyright © 2001, Andrew W. Moore Learning Maximum Margin with Noise wx+b=1 wx+b=0 wx+b=-1 Our QP criterion: Minimize How many constraints? 2R What should they be? w. x k + b >= 1- k if y k = 1 w. x k + b <= -1+ k if y k = -1 k >= 0 for all k 77 11 22
25
Support Vector Machines: Slide 25 Copyright © 2001, Andrew W. Moore Hard/Soft Margin Separation
26
Support Vector Machines: Slide 26 Copyright © 2001, Andrew W. Moore Controlling Soft Margin Separation
27
Support Vector Machines: Slide 27 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
28
Support Vector Machines: Slide 28 Copyright © 2001, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0
29
Support Vector Machines: Slide 29 Copyright © 2001, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0
30
Support Vector Machines: Slide 30 Copyright © 2001, Andrew W. Moore Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0
31
Support Vector Machines: Slide 31 Copyright © 2001, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0
32
Support Vector Machines: Slide 32 Copyright © 2001, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0
33
Support Vector Machines: Slide 33 Copyright © 2001, Andrew W. Moore Outline Linear SVMs The definition of a maximum margin classifier What QP How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit
34
Support Vector Machines: Slide 34 Copyright © 2001, Andrew W. Moore Common SVM basis functions z k = ( polynomial terms of x k of degree 1 to q ) z k = ( radial basis functions of x k ) z k = ( sigmoid functions of x k ) This is sensible. Is that the end of the story? No…there’s one more trick!
35
Support Vector Machines: Slide 35 Copyright © 2001, Andrew W. Moore The “Kernel Trick”
36
Support Vector Machines: Slide 36 Copyright © 2001, Andrew W. Moore Quadratic Dot Products + + + Constant Term Linear Terms Pure Quadratic Terms Quadratic Cross-Terms
37
Support Vector Machines: Slide 37 Copyright © 2001, Andrew W. Moore Quadratic Dot Products They’re the same! And this is only O(m) to compute!
38
Support Vector Machines: Slide 38 Copyright © 2001, Andrew W. Moore SVM Kernel Functions K(a,b)=(a. b +1) d is an example of an SVM Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function Radial-Basis-style Kernel Function: Neural-net-style Kernel Function:
39
Support Vector Machines: Slide 39 Copyright © 2001, Andrew W. Moore Non-linear SVM with Kernels
40
Support Vector Machines: Slide 40 Copyright © 2001, Andrew W. Moore Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
41
Support Vector Machines: Slide 41 Copyright © 2001, Andrew W. Moore What You Should Know Linear SVMs The definition of a maximum margin classifier What QP can do for you (but, for this class, you don’t need to know how it does it) How Maximum Margin can be turned into a QP problem How we deal with noisy (non-separable) data How we permit non-linear boundaries How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basis- function terms
42
Support Vector Machines: Slide 42 Copyright © 2001, Andrew W. Moore Remaining… Actually, I skip ERM(Empirical Risk Minimization) VC(Vapnik/Chervonenkis) dimension But, It needed at data processing time Handling unbalanced datasets Selecting C Selecting Kernel What is a Valid Kernel? How to Construct Valid Kernel? It needed at tuning time. Now, the real game is starting in earnest !!
43
Support Vector Machines: Slide 43 Copyright © 2001, Andrew W. Moore SVM light http://svmlight.joachims.org (Thorsten Joachims @ cornell Univ.)http://svmlight.joachims.org Commands svm_learn [options] [train_file] [model_file] svm_learn –c 1.5 –x 1 train_file model_file svm_classify [test_file] [model_file] [prediction_file] svm_classify test_file model_file prediction_file File Format : … : #comment 1 23:0.5 105:0.1 1023:0.8 1999:0.34 Idx1…23…105…10231024…19981999…cls Value00.50.10.8000.3401
44
Support Vector Machines: Slide 44 Copyright © 2001, Andrew W. Moore Options -z : selects whether the SVM is trained in classification (c) or regression (r) mode -c : controls the trade-off between training error and margin – the lower the value of C, the more training error is tolerated. The best value of C depends on the data and must be determined emperically. -w : width of tube (i.e Є) Є-intensive loss function used in regression. The value must be non-negative. -j : specifies cost-factors for the loss functions both in classification and regression. -b : switches between a biased (1) or an unbiased (0) hyperplane. -i : selects how training errors are treated. (1-4) -x : selects whether to compute leave-one-out estimates of the generalization performance. -t : selects the type of kernel functions (0-4) -q : maximum size of working set. -m : specifies the amount of memory (in megabytes) available for caching kernel evaluations. ……………………………………..
45
Support Vector Machines: Slide 45 Copyright © 2001, Andrew W. Moore Example 1: Sense Disambiguation AKhas 3 senses above knee, acetate kinase, artificial kidney Ex) w13, w345, w874, w8, w7345, AK(acetate kinase), w123, w13, w8, w14, w8 => 2 w8:3 w13:2 w14:1 w123:1 w345:1 w874:1 w7345:1
46
Support Vector Machines: Slide 46 Copyright © 2001, Andrew W. Moore Example of training set 1 109:0.060606 193:0.090909 211:0.052632 3079:1.000000 5882:0.200000 1 109:0.030303 143:1.000000 377:0.500000 439:0.500000 1012:0.500000 3808:1.000000 3891:1.000000 4789:0.333333 18363:0.333333 23106:1.000000 1 174:0.166667 244:0.500000 321:0.500000 332:0.500000 723:0.333333 3064:0.500000 3872:1.000000 16401:1.000000 19369:1.000000 23109:1.000000 2 108:0.250000 148:0.333333 202:0.250000 313:0.200000 380:1.000000 3303:1.000000 8944:1.000000 11513:1.000000 23110:1.000000 23111:1.000000 1 100:0.125000 109:0.030303 129:0.200000 131:0.071429 135:0.050000 543:0.500000 5880:0.200000 17153:0.100000 23112:0.100000 1 100:0.125000 109:0.060606 125:0.500000 131:0.071429 193:0.045455 2553:0.200000 4790:0.100000 5880:0.200000 2 382:1.000000 410:1.000000 790:1.000000 1160:0.500000 2052:1.000000 3260:1.000000 7323:0.500000 23113:1.000000 23114:1.000000 23115:1.000000 2 133:1.000000 147:0.250000 148:0.333333 177:0.500000 214:1.000000 960:1.000000 1131:1.000000 1359:1.000000 6378:0.333333 22842:1.000000 1 109:0.060606 917:1.000000 2747:0.500000 2748:0.500000 2749:1.000000 3252:1.000000 21699:1.000000 1 543:0.500000 557:1.000000 563:0.500000 1077:1.000000 1160:0.500000 2747:0.500000 2748:0.500000 12805:1.000000 23116:1.000000 23117:1.000000 1 593:1.000000 1124:0.500000 1615:1.000000 3607:1.000000 13443:1.000000 23118:1.000000 23119:1.000000 23120:1.000000 …………………………………………………………
47
Support Vector Machines: Slide 47 Copyright © 2001, Andrew W. Moore Trained model of SVM SVM-multiclass Version V1.01 2 # number of classes 24208 # number of base features 0 # loss function 3 # kernel type 3 # kernel parameter -d 1 # kernel parameter -g 1 # kernel parameter -s 1 # kernel parameter -r empty# kernel parameter -u 48584 # highest feature index 167 # number of training documents 335 # number of support vectors plus 1 0 # threshold b, each following line is a SV (starting with alpha*y) 0.010000000000000000208166817117217 24401:0.2 24407:0.071428999 24756:1 24909:0. 25 25283:1 25424:0.33333299 25487:0.043478001 25773:0.5 31693:1 37411:1 # -0.010000000000000000208166817117217 193:0.2 199:0.071428999 548:1 701:0.25 1075 :1 1216:0.33333299 1279:0.043478001 1565:0.5 7485:1 13203:1 # 0.010000000000000000208166817117217 24407:0.071428999 24419:0.0625 25807:0.16666 7 26905:0.166667 27631:1 27664:1 35043:1 35888:1 36118:1 48085:1 # -0.010000000000000000208166817117217 199:0.071428999 211:0.0625 1599:0.166667 26 97:0.166667 3423:1 3456:1 10835:1 11680:1 11910:1 23877:1 # ……………………………………………………
48
Support Vector Machines: Slide 48 Copyright © 2001, Andrew W. Moore Example 2: Text Chunking
49
Support Vector Machines: Slide 49 Copyright © 2001, Andrew W. Moore Example 3: Text Classification Corpus: Reuters-21578 12,902 news story 118 category ModApte split (75%trainig:9,603doc / 25%test 3,299 doc)
50
Support Vector Machines: Slide 50 Copyright © 2001, Andrew W. Moore Q & A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.