An Introduction to Support Vector Machines
CSE 802. Prepared by Martin Law Outline What is a good decision boundary for binary classification problem? From minimizing the misclassification error to maximize the margin Two classes, linearly inseparable How to deal with some noisy data How to make SVM non-linear: kernel Conclusion 11/12/2018 CSE 802. Prepared by Martin Law
Two Class Problem: Linear Separable Case The problem of minimizing the misclassification: Many decision boundaries can separate these two classes without misclassification Which one should we choose? Class 2 Perceptron learning rule can be used to find any decision boundary between class 1 and class 2 Class 1 11/12/2018 CSE 802. Prepared by Martin Law
CSE 802. Prepared by Martin Law Maximizing the margin The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Class 2 m Class 1 11/12/2018 CSE 802. Prepared by Martin Law
The Optimization Problem Let {x1, ..., xn} be our data set and let yi Î {1,-1} be the class label of xi The decision boundary should classify all points correctly Þ A constrained optimization problem 11/12/2018 CSE 802. Prepared by Martin Law
CSE 802. Prepared by Martin Law The dual Problem We can transform the problem to its dual This is a quadratic programming (QP) problem Global maximum of ai can always be found w can be recovered by Let x(1) and x(-1) be two S.V. Then b = -1/2( w^T x(1) + w^T x(-1) ) 11/12/2018 CSE 802. Prepared by Martin Law
A Geometrical Interpretation Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 So, if change internal points, no effect on the decision boundary a6=1.4 a9=0 a3=0 Class 1 11/12/2018 CSE 802. Prepared by Martin Law
Characteristics of the Solution Many of the ai are zero w is a linear combination of a small number of data Sparse representation xi with non-zero ai are called support vectors (SV) The decision boundary is determined only by the SV Let tj (j=1, ..., s) be the indices of the s support vectors. We can write For testing with a new data z Compute and classify z as class 1 if the sum is positive, and class 2 otherwise 11/12/2018 CSE 802. Prepared by Martin Law
CSE 802. Prepared by Martin Law Some Notes There are theoretical upper bounds on the error on unseen data for SVM The larger the margin, the smaller the bound The smaller the number of SV, the smaller the bound Note that in both training and testing, the data are referenced only as inner product, xTy This is important for generalizing to the non-linear case 11/12/2018 CSE 802. Prepared by Martin Law
How About Not Linearly Separable We allow “error” xi in classification to tolerate some noisy data Class 2 Class 1 11/12/2018 CSE 802. Prepared by Martin Law
Soft Margin Hyperplane Define xi=0 if there is no error for xi xi are just “slack variables” in optimization theory We want to minimize C : tradeoff parameter between error and margin The optimization problem becomes 11/12/2018 CSE 802. Prepared by Martin Law
The Optimization Problem The dual of the problem is w is also recovered as The only difference with the linear separable case is that there is an upper bound C on ai Once again, a QP solver can be used to find ai Note also, everything is done by inner-products 11/12/2018 CSE 802. Prepared by Martin Law
Extension to Non-linear Decision Boundary In most of the situation, the decision boundary we are looking for should NOT be a straight line. f( ) f( ) f( ) f( ) f( ) f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Input space 11/12/2018 CSE 802. Prepared by Martin Law
Extension to Non-linear Decision Boundary Key idea: Use a function f(x) Transform xi to a higher dimensional space to “make life easier” Input space: the space xi are in Feature space: the space of f(xi) after transformation Searching a hyper plane in Feature space to maximize the margin. The hyper plane in Feature space correspond to a curve in input space. Why transform? We still like the idea of maximizing the margin. More powerful in mining knowledge, more flexible. XOR: x_1, x_2, and we want to transform to x_1^2, x_2^2, x_1 x_2 It can also be viewed as feature extraction from the feature vector x, but now we extract more feature than the number of features in x. 11/12/2018 CSE 802. Prepared by Martin Law
Transformation and Kernel 11/12/2018 CSE 802. Prepared by Martin Law
Kernel: Efficient computation Define the kernel function K (x,y) as Consider the following transformation In practice we don’t need to worry about the transformation function f(x), what we have to do is to select a good kernel for our problem. 11/12/2018 CSE 802. Prepared by Martin Law
Examples of Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width s Closely related to radial basis function neural networks Research on different kernel functions in different applications is very active Despite violating Mercer condition, the sigmoid kernel function can still work 11/12/2018 CSE 802. Prepared by Martin Law
Summary: Steps for Classification Prepare the data matrix Select the kernel function to use Select the parameter of the kernel function and the value of C You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter Execute the training algorithm and obtain the ai Unseen data can be classified using the ai and the support vectors 11/12/2018 CSE 802. Prepared by Martin Law
Classification result of SVM 11/12/2018 CSE 802. Prepared by Martin Law
CSE 802. Prepared by Martin Law Conclusion Most popular tools for numeric binary classification Key ideas of SVM: Maximizing the margin can lead to a “good” classifier Transformation to higher space to make the classifier more flexible. Kernel tricks for efficient computation Weaknesses of SVM Need a “good” kernel function 11/12/2018 CSE 802. Prepared by Martin Law
CSE 802. Prepared by Martin Law Resources http://www.kernel-machines.org/ http://www.support-vector.net/ http://www.support-vector.net/icml-tutorial.pdf http://www.kernel-machines.org/papers/tutorial-nips.ps.gz http://www.clopinet.com/isabelle/Projects/SVM/applist.html 11/12/2018 CSE 802. Prepared by Martin Law