Today’s Topics 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Linear Classifiers/SVMs
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Support Vector Machines Kernel Machines
Support Vector Machine (SVM) Classification
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Unconstrained Optimization Problem
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Support Vector Machines
Lecture 10: Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machines a.k.a, Whirlwind o’ Vector Algebra Sec. 6.3 SVM Tutorial by C. Burges (on class “resources” page)
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
SVM by Sequential Minimal Optimization (SMO)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Lecture 4 Linear machine
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Today’s Topics 11/17/15CS Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
PREDICT 422: Practical Machine Learning
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 4/527: Artificial Intelligence
CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Support Vector Machines
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified Training Examples –Kernels (for non-linear models; in next lecture)

Three Key SVM Concepts Maximize the Margin Don’t choose just any separating plane Penalize Misclassified Examples Use soft constraints and ‘slack’ variables Use the ‘Kernel Trick’ to get Non-Linearity Roughly like ‘hardwiring’ the input  HU portion of ANNs (so only need a perceptron) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 102

Support Vector Machines Maximizing the Margin between Bounding Planes Support Vectors ? 2 ||w|| 2 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 103 SVMs define some inequalities we want satisfied. We then use advanced optimization methods (eg, linear programming) to find the satisfying solutions, but in cs540 we’ll do a simpler approx

Margins and Learning Theory Theorems exist that connect learning (‘PAC’) theory to the size of the margin –Basically the larger the margin, the better the expected future accuracy –See, for example, Chapter 4 of Support Vector Machines by N. Christianini & J. Shawe-Taylor, Cambridge Press, 2000 (not an assigned reading) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 104

y Support Vectors ‘Slack’ Variables Dealing with Data that is not Linearly Separable 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 105 For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane) If we deleted any/all of the non support vectors we’d get the same answer!

SVMs and Non-Linear Separating Surfaces f1f1 f2f2 + + _ _ h(f 1, f 2 ) g(f 1, f 2 ) + + _ _ Non-linearly map to new space Linearly separate in new space Result is a non-linear separator in original space 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 106

Math Review: Dot Products 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 107 X  Y  X 1  Y 1 + X 2  Y 2 + … + X n  Y n So if X = [4, 5, -3, 7] and Y = [9, 0, -8, 2] Then X  Y = 4   0 + (-3)  (-8) + 7  2 = 74 (weighted sums in ANNs are dot products)

Some Equations Separating Plane For all positive examples For all negative examples weights input features threshold These 1’s result from dividing through by a constant for convenience (it is the distance from the dashed lines to the green line) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 108

Idea #1: The Margin (derivation not on final) xixi Subtracting (ii) from (i) gives (i) (ii) (iii) W xAxA xBxB margin = 1 since parallel lines (iv) The green line is the set of all pts that satisfy this equation (ditto for red line) Combining (iii) and (iv) we get 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 109 xjxj

Our Initial ‘Mathematical Program’ min ||w|| (this is the ‘1-norm’ length of the weight vector, which is the sum of the absolute values of the weights; some SVMs use quadratic programs, but 1-norms have some preferred properties) such that w · x pos ≥  + 1 // for ‘+’ ex’s w · x neg ≤  – 1 // for ‘–’ ex’s  w,  11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week

The ‘p’ Norm – Generalization of the Familiar Euclidean Distance (p=2) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 10 11

Our Mathematical Program (cont.) Note: w and  are our adjustable parameters (we could, of course, use the ANN ‘trick’ and move  to the left side of our inequalities and treat as another weight) We can now use existing math programming optimization s/w to find a sol’n to our current program (covered in cs525) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 1012

Idea #2: Dealing with Non-Separable Data We can add what is called a ‘slack’ variable to each example This variable can be viewed as = 0 if example correctly separated else = ‘distance’ we need to move ex to get it correct (ie, distance from decision boundary) Note: we are NOT counting #misclassified would be nice to do so, but that becomes [mixed] integer programming, which is much harder 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 1013

min ||w|| 1 + μ ||S|| 1 such that w · x pos i + S i ≥  + 1 w · x neg j – S j ≤  – 1  S k ≥ 0 The Math Program with Slack Vars (this is the linear-programming version; there also is a quadratic-prog version - in cs540 we won’t worry about the difference)  w, s,  Dim = # of input features Dim = # of training examples scalar Scaling constant (use tuning set to select value) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 10 The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface 14 Notice we are solving the perceptron task with a complexity penalty (sum of wgts) – Hinton’s wgt decay!

Slack’s and Separability If training data is separable, will all S i = 0 ? Not necessarily! –Might get a larger margin by misclassifying a few examples (just like in d-tree pruning) –This can also happen when using gradient- descent to minimize an ANN’s cost function 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 1015

CS Fall 2015 (Shavlik©), Lecture 22, Week 10 Brief Intro to Linear Programs (LP’s) - not on final We need to convert our task into A z ≥ b which is the basic form of an LP (A is a constant matrix, b is a constant vector, z is a vector of variables) Note Can convert inequalities containing ≤ into ones using ≥ by multiplying both sides by -1 eg, 5x ≤ 15 same as -5x ≥ -15 Can also handle = (ie, equalities) could use ≥ and ≤ to get =, but more efficient methods exist 11/10/1516

CS Fall 2015 (Shavlik©), Lecture 22, Week 10 Brief Intro to Linear Programs (cont.) - not on final In addition, we want to min c  z under the linear Az ≥ b constraints Vector c says how to penalize settings for variables in vector z Highly optimized s/w for solving LP exists (eg, CPLEX, COINS [free]) 11/10/15 Yellow region are those points that satisfy the constraints; dotted lines are iso-cost lines Lecture #21, Slide 17

Review: Matrix Multiplication 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week From (code also there): A  B = C Matrix A is K by M Matrix B is N by K Matrix C is M by N

f e/2 1 CS Fall 2015 (Shavlik©), Lecture 22, Week 10 Aside: Our SVM as an LP (not on final) A pos A neg The 1’s are identity matrices (often written as I) | f | e/2 | e/2 | 1 | f | e/2 f  ≥ e/2 e f W S pos S neg  Z Let A pos = our positive training examples A neg = our negative training examples (assume 50% pos and 50% neg for notational simplicity) #features #examples 11/10/1519 f

CS Fall 2015 (Shavlik©), Lecture 22, Week 10 Our C Vector (determines the cost we’re minimizing, also not on final) min [ 0 μ 0 1 ] WSZWSZ = min μ ● S + 1 ● Z Aside: could also penalize  (but would need to add more variables since  can be negative) C Note we min Z’s not W’s since only Z’s ≥ 0 = min μ ||S|| 1 + ||W|| 1 since all S are non-negative and the Z’s ‘squeeze’ the W’s 11/10/15 20 Note here: S = S pos concatenated with S neg

Where We are so Far We have an ‘objective’ function that we can optimize by Linear Programming –min ||w|| 1 + μ ||S|| 1 subject to some constraints –Free LP solvers exist –CS 525 teaches Linear Programming We could also use gradient descent –Perceptron learning with ‘weight decay’ quite similar, though uses SQUARED wgts and SQUARED error (the S is this error) 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 10 21