Support Vector Machines Chapter 18.9 and the paper “Support vector machines” by M. Hearst, ed., 1998 Acknowledgments: These slides combine and modify ones.

Slides:



Advertisements
Similar presentations
Support Vector Machines
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Machines Kernel Machines
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
CS 478 – Tools for Machine Learning and Data Mining SVM.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines Tao Department of computer science University of Illinois.
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Support Vector Machines Louis Oliphant Cs540 section 2.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Kernels Slides from Andrew Moore and Mingyue Tan.
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
An Introduction to Support Vector Machines
Support Vector Machines
Machine Learning Week 2.
Support Vector Machines
Introduction to Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
Support Vector Machines
SVMs for Document Ranking
Presentation transcript:

Support Vector Machines Chapter 18.9 and the paper “Support vector machines” by M. Hearst, ed., 1998 Acknowledgments: These slides combine and modify ones provided by Andrew Moore (CMU), Carla Gomes (Cornell), Mingyue Tan (UBC), Jerry Zhu (Wisconsin), Glenn Fung (Wisconsin), and Olvi Mangasarian (Wisconsin)

What are Support Vector Machines (SVMs) Used For? Classification Regression and data-fitting Supervised and unsupervised learning

Lake Mendota, Madison, WI Identify areas of land cover (land, ice, water, snow) in a scene Two methods: Scientist manually-derived Support Vector Machine (SVM) Visible Image Expert Labeled Expert Derived Automated Ratio SVM Lake Mendota, Wisconsin ClassifierExpert Derived SVM cloud 45.7%58.5% ice 60.1%80.4% land 93.6%94.0% snow 63.5%71.6% water 84.2%89.1% unclassified 45.7% Courtesy of Steve Chien of NASA/JPL

Linear Classifiers f x y denotes +1 denotes -1 How would you classify this data?

Linear Classifiers (aka Linear Discriminant Functions) Definition: A function that is a linear combination of the components of the input (column vector) x: where w is the weight (column vector) and b is the bias A 2-class classifier then uses the rule: Decide class c 1 if f(x) ≥ 0 and class c 2 if f(x) < 0 or, equivalently, decide c 1 if w T x ≥ -b and c 2 otherwise

w is the plane’s normal vector b is the distance from the origin Planar decision surface in d dimensions is parameterized by (w, b)

Linear Classifiers f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) How would you classify this data?

Linear Classifiers f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) How would you classify this data?

Linear Classifiers f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) How would you classify this data?

Linear Classifiers f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) Any of these would be fine … … but which is best?

Classifier Margin f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) Define the margin of a linear classifier as the width that the boundary could be increased before hitting a data point

Maximum Margin f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) The maximum margin linear classifier is the linear classifier with the maximum margin This is the simplest kind of SVM (Called an LSVM) Linear SVM

Maximum Margin f x y denotes +1 denotes -1 f (x, w, b) = sign(w T x + b) The maximum margin linear classifier is the linear classifier with the maximum margin This is the simplest kind of SVM (Called an LSVM) Support Vectors are those data points that the margin pushes against Linear SVM

Why the Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those data points that the margin pushes against 1.Intuitively this feels safest 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification 3.Robust to outliers since the model is immune to change/removal of any non-support-vector data points 4.There’s some theory (using “VC dimension”) that is related to (but not the same as) the proposition that this is a good thing 5.Empirically it works very well

Specifying a Line and Margin How do we represent this mathematically? … in d input dimensions? An example x = (x 1, …, x d ) T Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

Specifying a Line and Margin Plus-plane = w T x + b = +1 Minus-plane = w T x + b = -1 Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as+1ifw T x + b ≥ 1 ifw T x + b ≤ -1 ?if-1 < w T x + b < 1 wx+b=1 wx+b=0 wx+b=-1 Weight vector: w = (w 1, …, w d ) T Bias or threshold: b The dot product is a scalar: x’s projection onto w

Computing the Margin Plus-plane = w T x + b = +1 Minus-plane = w T x + b = -1 Claim: The vector w is perpendicular to the Plus-Plane and the Minus-Plane “Predict Class = +1” zone “Predict Class = -1” zone w T x+b=1 w T x+b=0 w T x+b=-1 M = Margin (width) How do we compute M in terms of w and b? w Note: Transpose symbol on w implied

Computing the Margin Plus-plane = w T x + b = +1 Minus-plane = w T x + b = -1 Claim: The vector w is perpendicular to the Plus-Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin How do we compute M in terms of w and b? Let u and v be two vectors on the Plus-Plane. What is w T ( u – v ) ? And so, of course, the vector w is also perpendicular to the Minus-Plane w

Computing the Margin Plus-plane = w T x + b = +1 Minus-plane = w T x + b = -1 The vector w is perpendicular to the Plus-Plane Let x − be any point on the Minus-Plane Let x + be the closest Plus-Plane-point to x − “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin How do we compute M in terms of w and b? x−x− x+x+ Any location in  m : not necessarily a datapoint Any location in R m ; not necessarily a data point w

Computing the Margin Plus-plane = w T x + b = +1 Minus-plane = w T x + b = -1 The vector w is perpendicular to the Plus-Plane Let x − be any point on the Minus-Plane Let x + be the closest Plus-Plane-point to x − Claim: x + = x − +  w for some value of  “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin How do we compute M in terms of w and b? x−x− x+x+ w

Computing the Margin Plus-plane = w T x + b = +1 Minus-plane = w T x + b = -1 The vector w is perpendicular to the Plus-Plane Let x − be any point on the Minus-Plane Let x + be the closest Plus-Plane-point to x − Claim: x + = x − +  w for some value of  Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin How do we compute M in terms of w and b? x−x− x+x+ The line from x − to x + is perpendicular to the planes So to get from x − to x + travel some distance in direction w w

Computing the Margin What we know: w T x + + b = +1 w T x - + b = -1 x + = x − +  w ||x + - x − || = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin x−x− x+x+ w

Computing the Margin What we know: w T x + + b = +1 w T x − + b = -1 x + = x − +  w ||x + - x − || = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin w T (x − +  w) + b = 1  w T x − + b +  w T w = 1  -1 +  w T w = 1  x−x− x+x+ w

Computing the Margin What we know: w T x + + b = +1 w T x − + b = -1 x + = x − +  w ||x + - x − || = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin = M = | |x + - x − || = || λw || x−x− x+x+ w = M, margin size

Learning the Maximum Margin Classifier Given a guess of w and b, we can 1.Compute whether all data points in the correct half-planes 2.Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points How? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin = x−x− x+x+ w

SVM as Constrained Optimization Unknowns: w, b Objective function: maximize the margin: M = 2/||w|| Equivalent to minimizing ||w|| or ||w|| 2 = w T w N training points: (x k, y k ), y k = +1 or -1 Subject to each training point correctly classified (the constraints), i.e., subject to y k (w T x k + b) ≥ 1 for all k This is a quadratic optimization problem (QP), which can be solved efficiently N constraints

SVMs: More than Two Classes SVMs can only handle two-class problems N-class problem: Split the task into N binary tasks and learn N SVMs: Class 1 vs. the rest (classes 2 — N) Class 2 vs. the rest (classes 1, 3 — N) … Class N vs. the rest Finally, pick the class that puts the point farthest into the positive region

SVMs: Non Linearly-Separable Data What if the data are not linearly separable?

SVMs: Non Linearly-Separable Data Two solutions:  Allow a few points on the wrong side (slack variables)  Map data to a higher dimensional space, and do linear classification there (kernel trick)

Non Linearly-Separable Data Approach 1: Allow a few points on the wrong side (slack variables) “Soft Margin Classification”

denotes +1 denotes -1 What Should We Do?

denotes +1 denotes -1 What Should We Do? Idea 1: Find minimum ||w|| 2 while minimizing number of training set errors Problem: Two things to minimize makes for an ill-defined optimization

denotes +1 denotes -1 What Should We Do? Idea 1.1: Minimize ||w|| 2 + C (# train errors) There’s a serious practical problem with this approach Tradeoff “penalty” parameter

What Should We Do? Idea 1.1: Minimize ||w|| 2 + C (# train errors) There’s a serious practical problem with this approach denotes +1 denotes -1 Tradeoff “penalty” parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) Any other ideas?

denotes +1 denotes -1 What Should We Do? Minimize ||w|| 2 + C (distance of all “misclassified points” to their correct place)

Choosing C, the Penalty Factor Critical to choose a good value for the constant parameter, C, because C too big means we have a high penalty for non-separable points and we may store many support vectors and overfit C too small and we may underfit

Learning Maximum Margin with Noise Given guess of w, b, we can 1.Compute sum of distances of points to their correct zones 2.Compute the margin width Assume N examples, each (x k, y k ) where y k = +1 / -1 wx+b=1 wx+b=0 wx+b=-1 M = What should our optimization criterion be?

Learning Maximum Margin with Noise Given guess of w, b we can 1.Compute sum of distances of points to their correct zones 2.Compute the margin width Assume N examples, each (x k, y k ) where y k = +1 / -1 wx+b=1 wx+b=0 wx+b=-1 M = What should our optimization criterion be? Minimize 77  11 22 How many constraints will we have? N What should they be? “slack variables” y k (w T x k + b) ≥ 1-ε k for all k

Given guess of w, b we can 1.Compute sum of distances of points to their correct zones 2.Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 Learning Maximum Margin with Noise wx+b=1 wx+b=0 wx+b=-1 M = What should our optimization criterion be? Minimize 77  11 22 Our original (noiseless data) QP had d +1 variables: w 1, w 2, … w d, and b Our new (noisy data) QP has d +1+N variables: w 1, w 2, … w d, b,  k,  1,…  N d = # input dimensions How many constraints will we have? N What should they be? w T x k + b ≥ 1-  k if y k = +1 w T x k + b ≤ -1+  k if y k = -1 N = # examples

How many constraints will we have? N What should they be? w T x k + b ≥ 1-  k if y k = +1 w T x k + b ≤ -1+  k if y k = -1 Learning Maximum Margin with Noise Given guess of w, b we can 1.Compute sum of distances of points to their correct zones 2.Compute the margin width Assume N examples, each (x k, y k ) where y k = +1 / -1 wx+b=1 wx+b=0 wx+b=-1 M = What should our optimization criterion be? Minimize 77  11 22 There’s a bug in this QP. Can you spot it?

Learning Maximum Margin with Noise Given guess of w, b we can 1.Compute sum of distances of points to their correct zones 2.Compute the margin width Assume N examples, each (x k, y k ) where y k = +1 / -1 wx+b=1 wx+b=0 wx+b=-1 M = What should our optimization criterion be? Minimize How many constraints will we have? 2N What should they be? w T x k + b ≥ +1 -  k if y k = +1 w T x k + b ≤ -1 +  k if y k = -1  k ≥ 0 for all k 77  11 22 “slack variables”

An Equivalent QP Maximize where subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) N examples, each (x k, y k ) where y k = +1 / -1

An Equivalent QP Maximize where subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) Datapoints with  k > 0 will be the support vectors so this sum only needs to be over the support vectors

An Equivalent QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) Datapoints with  k > 0 will be the support vectors..so this sum only needs to be over the support vectors. Why did I tell you about this equivalent QP? It’s a formulation that QP packages can optimize more quickly Because of further jaw- dropping developments you’re about to learn

Non Linearly-Separable Data Approach 2: Map data to a higher dimensional space, and then do linear classification there (called the kernel trick)

Suppose we’re in 1 Dimension What would SVMs do with this data? x=0

Suppose we’re in 1 Dimension Positive “plane” Negative “plane” x=0

Harder 1D Dataset: Not Linearly-Separable What can be done about this? x=0

Harder 1D Dataset The Kernel Trick: Preprocess the data, mapping x into a higher dimensional space, z = Φ(x) For example: x=0 Here, Φ maps data from 1D to 2D In general, map from d-dimensional input space to k-dimensional z space

Harder 1D Dataset x=0 The Kernel Trick: Preprocess the data, mapping x into a higher dimensional space, z = Φ(x) w T Φ(x) + b = +1 The data is linearly separable in the new space, so use a linear SVM in the new space

Another Example Project examples into some higher dimensional space where the data is linearly separable, defined by z = Φ(x)

CS 540, University of Wisconsin-Madison, C. R. Dyer Another Example Project examples into some higher dimensional space where the data is linearly separable, defined by z = Φ(x)

Algorithm 1.Pick a Φ function 2.Map each training example, x, into the new higher-dimensional space using z = Φ(x) 3.Solve the optimization problem using the nonlinearly transformed training examples, z, to obtain a Linear SVM (with or without using the slack variable formulation) defined by w and b 4.Classify a new test example, e, by computing: sign(w T · Φ(e) + b)

Improving Efficiency Time complexity of the original optimization formulation depends on the dimensionality, k, of z (k >> d) We can convert the optimization problem into an equivalent form, called the Dual Form, with time complexity O(N 3 ) that depends on N, the number of training examples, not k Dual Form will also allow us to rewrite the mapping functions in Φ in terms of “kernel functions” instead

Dual Optimization Problem Maximize where subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w T x - b) N examples: (x k, y k ) where y k = +1 / -1

Dual Optimization Problem N examples: (x k, y k ) where y k = +1 / -1 Maximize where subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w T x - b) New variables; Examples with  k > 0 will be the support vectors Dot product of two examples

Algorithm Compute N x N matrix Q by computing y i y j (x i T x j ) between all pairs of training points Solve the optimization problem to compute α i for i = 1, …, N Each non-zero α i indicates that example x i is a support vector Compute w and b Then classify test example x with: f(x) = sign(w T x – b)

Copyright © 2001, 2003, Andrew W. Moore Dual Optimization Problem (After Mapping) where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w T  (x) - b) Maximize N examples: (x k, y k ) where y k = +1 / -1

Copyright © 2001, 2003, Andrew W. Moore Dual Optimizzation Problem (After Mapping) where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w T  (x) - b) Maximize N examples: (x k, y k ) where y k = +1 / -1 N 2 /2 dot products to compute this matrix

Dual formulation of the optimization problem depends on the input data only in dot products of the form: Φ(x i ) T · Φ(x j ) where x i and x j are two examples We can compute these dot products efficiently for certain types of Φ’s where K(x i, x j ) = Φ(x i ) T · Φ(x j ) Example: Φ(x i ) T · Φ(x j ) = (x i T · x j ) 2 = K(x i, x j ) Since the data only appears as dot products, we do not need to map data to higher dimensional space (using Φ(x) ) because we can use the kernel function K instead

Example Suppose we have 5 1D data points x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1, and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2: K(x,y) = (xy+1) 2 C = 100 We first find  i (i=1, …, 5) by

Example Using a QP solver, we get  1 =0,  2 =2.5,  3 =0,  4 =7.333,  5 =4.833 The support vectors are {x 2 =2, x 4 =5, x 5 =6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x 2, x 4, x 5 lie on and all give b=9

Example Value of discriminant function: class 2 class 1

Kernel Functions A kernel function is a function that can be applied to pairs of input data points to evaluate dot products in some corresponding (possibly infinite dimensional) feature space

What’s Special about a Kernel? Say 1 example is (in 2D) is: s = (s 1, s 2 ) We decide to use a particular mapping into 6D space: Φ(s) T = (s 1 2, s 2 2, √2s 1 s 2, s 1, s 2, 1) Let another example be t = (t 1, t 2 ) Then, Φ(s) T  Φ(t) = s 1 2 t s 2 2 t s 1 s 2 t 1 t 2 + s 1 t 1 + s 2 t = (s 1 t 1 + s 2 t 2 + 1) 2 = (s T  t +1) 2 So, define the kernel function to be K(s, t) = (s T  t +1) 2 = Φ(s) T  Φ(t) We save computation by using K

Some Commonly Used Kernels Linear kernel: K(x i, x j ) = x i T x j Quadratic kernel: K(x i, x j ) = (x i T x j +1) 2 Polynomial of degree d kernel: K(x i, x j ) = (x i T x j +1) d Radial Basis Function kernel: K(x i, x j ) = exp(−||x i -x j || 2 / σ 2 ) Many possible kernels; picking a good one is tricky Hacking with SVMs: create various kernels, hope their space Φ is meaningful, plug them into SVM, pick one with good classification accuracy Kernel usually combined with slack variables because no guarantee of linear separability in new space

Algorithm Compute N x N matrix Q by computing y i y j K(x i,x j ) between all pairs of training points Solve optimization problem to compute α i for i = 1, …, N Each non-zero α i indicates that example x i is a support vector Compute w and b Then classify test example x with: f(x) = sign(w T x – b)

Common SVM Kernel Functions z k = ( polynomial terms of x k of degree 1 to d ) For example, when d=2 and m=2, K(x,y) = (x 1 y 1 + x 2 y 2 + 1) 2 = 1 + 2x 1 y 1 + 2x 2 y 2 + 2x 1 x 2 y 1 y 2 + x 1 2 y x 2 2 y 2 2 z k = ( radial basis functions of x k ) z k = ( sigmoid functions of x k )

Some SVM Kernel Functions Polynomial Kernel Function: K(x i, x j )= (x i T  x j + 1) d Beyond polynomials there are other high dimensional basis functions that can be made practical by finding the right kernel function Radial-Basis-style Kernel Function: Neural-Net-style Kernel Function: ,  and  are magic parameters that must be chosen by a model selection method

Applications of SVMs Bioinformatics Machine Vision Text Categorization Ranking (e.g., Google searches) Handwritten Character Recognition Time series analysis  Lots of very successful applications!!!

Handwritten Digit Recognition

Example Application: The Federalist Papers Dispute Written in by Alexander Hamilton, John Jay, and James Madison to persuade the citizens of New York to ratify the U.S. Constitution Papers consisted of short essays, 900 to 3500 words in length Authorship of 12 of those papers have been in dispute ( Madison or Hamilton); these papers are referred to as the disputed Federalist papers

Description of the Data For every paper: Machine readable text was created using a scanner Computed relative frequencies of 70 words that Mosteller-Wallace identified as good candidates for author-attribution studies Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies The dataset consists of 118 papers: 50 Madison papers 56 Hamilton papers 12 disputed papers

Function Words Based on Relative Frequencies

Feature Selection for Classifying the Disputed Federalist Papers Apply the SVM Successive Linearization Algorithm for feature selection to: Train on the 106 Federalist papers with known authors Find a classification hyperplane that uses as few words as possible Use the hyperplane to classify the 12 disputed papers

Hyperplane Classifier Using 3 Words A hyperplane depending on three words was found: 0.537to upon would = All disputed papers ended up on the Madison side of the plane

Results: 3D Plot of Hyperplane

SVM Applet

Summary Learning linear functions Pick separating plane that maximizes margin Separating plane defined in terms of support vectors only Learning non-linear functions Project examples into higher dimensional space Use kernel functions for efficiency Generally avoids overfitting problem Global optimization method; no local optima Can be expensive to apply, especially for multi-class problems Biggest Drawback: The choice of kernel function There is no “ set-in-stone ” theory for choosing a kernel function for any given problem Once a kernel function is chosen, there is only ONE modifiable parameter, the error penalty C

Software A list of SVM implementations can be found at Some implementations (such as LIBSVM ) can handle multi-class classification SVMLight is one of the earliest and most frequently used implementations of SVMs Several Matlab toolboxes for SVMs are available