CS 485: Special Topics in Data Mining Jinze Liu

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Support Vector Machine
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
Support Vector Machines
Lecture 10: Support Vector Machines
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Support Vector Machine & Image Classification Applications
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
Data Mining Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
An Introduction to Support Vector Machines (M. Law)
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
CS 478 – Tools for Machine Learning and Data Mining SVM.
1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Support Vector Machines Louis Oliphant Cs540 section 2.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Chapter 18.9 and the paper “Support vector machines” by M. Hearst, ed., 1998 Acknowledgments: These slides combine and modify ones.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Support Vector Machines
Introduction to SVMs.
Support Vector Machines
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines
Machine Learning Week 2.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Support Vector Machine & Its Applications
Introduction to Support Vector Machines
Class #212 – Thursday, November 12
Support Vector Machines
Support Vector Machines
SVMs for Document Ranking
Presentation transcript:

CS 485: Special Topics in Data Mining Jinze Liu Classification - SVM CS 485: Special Topics in Data Mining Jinze Liu

Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data?

Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data?

Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data?

Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data?

Linear Classifiers f a yest x f(x,w,b) = sign(w. x - b) denotes +1 Any of these would be fine.. ..but which is best?

Classifier Margin f a yest x f(x,w,b) = sign(w. x - b) denotes +1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin f a yest x f(x,w,b) = sign(w. x - b) denotes +1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Maximum Margin f a yest x f(x,w,b) = sign(w. x - b) denotes +1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Why Maximum Margin? Intuitively this feels safest. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints. Empirically it works very very well. f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

Specifying a line and margin Plus-Plane “Predict Class = +1” zone Classifier Boundary Minus-Plane “Predict Class = -1” zone How do we represent this mathematically? …in m input dimensions?

Specifying a line and margin Plus-Plane “Predict Class = +1” zone Classifier Boundary Minus-Plane “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } Classify as.. +1 if w . x + b >= 1 -1 w . x + b <= -1 Universe explodes -1 < w . x + b < 1

Computing the margin width M = Margin Width “Predict Class = +1” zone How do we compute M in terms of w and b? “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why?

Computing the margin width M = Margin Width “Predict Class = +1” zone How do we compute M in terms of w and b? “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane

Computing the margin width x+ M = Margin Width “Predict Class = +1” zone How do we compute M in terms of w and b? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Any location in m: not necessarily a datapoint Any location in Rm: not necessarily a datapoint

Computing the margin width x+ M = Margin Width “Predict Class = +1” zone How do we compute M in terms of w and b? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + l w for some value of l. Why?

Computing the margin width x+ M = Margin Width “Predict Class = +1” zone The line from x- to x+ is perpendicular to the planes. So to get from x- to x+ travel some distance in direction w. How do we compute M in terms of w and b? x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + l w for some value of l. Why?

Computing the margin width x+ M = Margin Width “Predict Class = +1” zone x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 What we know: w . x+ + b = +1 w . x- + b = -1 x+ = x- + l w |x+ - x- | = M It’s now easy to get M in terms of w and b

Computing the margin width x+ M = Margin Width “Predict Class = +1” zone x- “Predict Class = -1” zone wx+b=1 w . (x - + l w) + b = 1 => w . x - + b + l w .w = 1 -1 + l w .w = 1 wx+b=0 wx+b=-1 What we know: w . x+ + b = +1 w . x- + b = -1 x+ = x- + l w |x+ - x- | = M It’s now easy to get M in terms of w and b

Computing the margin width x+ M = Margin Width = “Predict Class = +1” zone x- “Predict Class = -1” zone wx+b=1 M = |x+ - x- | =| l w |= wx+b=0 wx+b=-1 What we know: w . x+ + b = +1 w . x- + b = -1 x+ = x- + l w |x+ - x- | = M

Learning the Maximum Margin Classifier M = Margin Width = “Predict Class = +1” zone x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?

Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Quadratic Programming Find Quadratic criterion Subject to n additional linear inequality constraints And subject to e additional linear equality constraints

Quadratic Programming Quadratic criterion Find There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself) Subject to n additional linear inequality constraints And subject to e additional linear equality constraints

Learning the Maximum Margin Classifier Given guess of w , b we can Compute whether all data points are in the correct half-planes Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? How many constraints will we have? What should they be?

Learning the Maximum Margin Classifier Given guess of w , b we can Compute whether all data points are in the correct half-planes Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize w.w How many constraints will we have? R What should they be? w . xk + b >= 1 if yk = 1 w . xk + b <= -1 if yk = -1

Uh-oh! This is going to be a problem! What should we do? denotes +1

Uh-oh! This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization Uh-oh! denotes +1 denotes -1

Uh-oh! This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Uh-oh! denotes +1 denotes -1 Tradeoff parameter

Uh-oh! This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Uh-oh! denotes +1 denotes -1 Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) So… any other ideas?

Uh-oh! This is going to be a problem! What should we do? Idea 2.0: Minimize w.w + C (distance of error points to their correct place) Uh-oh! denotes +1 denotes -1

Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 wx+b=1 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? How many constraints will we have? What should they be?

Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 e2 wx+b=1 e7 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize How many constraints will we have? R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1

Learning Maximum Margin with Noise m = # input dimensions M = Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 e11 e2 Our original (noiseless data) QP had m+1 variables: w1, w2, … wm, and b. Our new (noisy data) QP has m+1+R variables: w1, w2, … wm, b, ek , e1 ,… eR wx+b=1 e7 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize How many constraints will we have? R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1 R= # records

Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 e11 e2 wx+b=1 e7 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize How many constraints will we have? R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1 There’s a bug in this QP. Can you spot it?

Learning Maximum Margin with Noise Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 e11 e2 wx+b=1 e7 wx+b=0 wx+b=-1 What should our quadratic optimization criterion be? Minimize How many constraints will we have? 2R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1 ek >= 0 for all k

An Equivalent QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b)

An Equivalent QP Maximize where Subject to these constraints: Then define: Datapoints with ak > 0 will be the support vectors Then classify with: f(x,w,b) = sign(w. x - b) ..so this sum only needs to be over the support vectors.

An Equivalent QP Maximize where Why did I tell you about this equivalent QP? It’s a formulation that QP packages can optimize more quickly Because of further jaw-dropping developments you’re about to learn. Maximize where Subject to these constraints: Then define: Datapoints with ak > 0 will be the support vectors Then classify with: f(x,w,b) = sign(w. x - b) ..so this sum only needs to be over the support vectors.

Estimate the Margin denotes +1 denotes -1 wx +b = 0 x What is the distance expression for a point x to a line wx+b= 0?

Estimate the Margin What is the expression for margin? wx +b = 0 denotes +1 denotes -1 wx +b = 0 Margin What is the expression for margin?

Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin

Maximize Margin Min-max problem  game problem wx +b = 0 Margin denotes +1 denotes -1 wx +b = 0 Margin Min-max problem  game problem

Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Strategy:

Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Suppose we’re in 1-dimension What would SVMs do with this data? x=0

Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Negative “plane”

Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0

Harder 1-dimensional dataset Remember how permitting non-linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

Harder 1-dimensional dataset Remember how permitting non-linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk )

SVM Performance Anecdotally they work very very well indeed. Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark There has been a lot of excitement and religious fervor about SVMs. Despite this, some practitioners are a little skeptical.

Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

SVM Related Links http://svm.dcs.rhbnc.ac.uk/ http://www.kernel-machines.org/ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press