Support Vector Machines 2 SVM The Kernel trick Selection of models VC dimension SVR Other applications
Large margin classifier The optimization problem: ξ=0 when the point is on the correct side of the margin; ξ>1 when the point passes the hyperplane to the wrong side; 0<ξ<1 when the point is in the margin but still on the correct side.
Large margin classifier The solution of β has the form: Non-zero coefficients only for those points i for which These are called “support vectors”. Some will lie on the edge of the margin the remainder have , They are on the wrong side of the margin.
SVM Consider basis expansion Solution of the large margin classifier in h() space:
SVM h(x) is involved ONLY in the form of inner product! So as long as we define the kernel function which computes the inner product in the transformed space, we don’t need to know what h(x) itself is! “Kernel trick”
SVM Recall αi=0 for non-support vectors, f(x) depends only on the support vectors. The decision is made essentially by a weighted sum of similarity of the object to all the support vectors.
SVM
More on kernels Most kernels don’t correspond to explicit basis functions There are exceptions. Example: This is a degree 2 polynomial kernel with κ=0. Other degree 2 polynormial kernels with non-zero κ also correspond to explicit degree 2 polynomial basis.
More on kernels
More on kernels (Gaussian) radial basis function kernel, or RBF kernel
More on kernels CPD: conditional positive definite “SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network” CPD: conditional positive definite Lin and Lin: “A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods”
More on kernels https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805
SVM Using kernel trick brings the feature space to very high dimension many many parameters. Why doesn’t the method suffer from the curse of dimensionality or overfitting??? Vapnic argues that the number of parameters alone, or dimensions alone, is not a true reflection of how flexible the classifier is. Compare two functions in 1-dimension: f(x)=α+βx g(x)=sin(αx)
SVM g(x)=sin(αx) is a really flexible classifier in 1-dimension, although it has only one parameter. f(x)=α+βx can only promise to separate two points every time, although it has one more parameter ?
SVM Vapnic-Chernovenkis dimension: The VC dimension of a class of classifiers {f(x,α)} is defined to be the largest number of points that can be shattered by members of {f(x,α)} A set of points is said to be shattered by a class of function if, no matter how the class labels are assigned, a member of the class can separate them perfectly.
SVM Linear classifier is rigid. A hyperplane classifier has VC dimension of d+1, where d is the feature dimension.
SVM The class sin(αx) has infinite VC dimension. By appropriate choice of α, any number of points can be shattered. The VC dimension of the nearest neighbor classifier is infinity --- you can always get perfect classification in training data. For many classifiers, it is difficult to compute VC dimension exactly. But this doesn’t diminish its value for theoretical arguments. Th VC dimension is a measure of complexity of the class of functions by assessing how wiggly the function can be.
SVM For SVM: VC-dimension of maximum-margin hyperplane does not necessarily depend on the number of features. VC-dimension is lower with larger margin.
SVM Strengths of SVM: flexibility scales well for high-dimensional data can control complexity and error trade-off explicitly as long as a kernel can be defined, non-traditional (vector) data, like strings, trees can be input Weakness how to choose a good kernel? (a low degree polynomial or radial basis function can be a good start)
K-fold cross-validation: The goal is to directly estimate the extra-sample error (error on an independent test set) K-fold cross-validation: Split data into K roughly equal-sized parts For each of the K parts, fit the model with the other K-1 parts, and calculate the prediction error on the part that is left out.
Cross-validation The CV estimate of the prediction error is from the combination of the K estimates α is the tuning parameter (different models, model parameters) Find that minimizes CV(α) Finally, fit all data on the model
Cross-validation Leave-one-out cross-validation (K=N) is approximately unbiased, yet it has high variance. K=5 or 10, CV has low variance but more bias. If the learning curve has large slope at the training set size, a 5-fold or 10-fold CV can overestimate the prediction error substantially.
Cross-validation
Support Vector Regression A linear function that approximates all pairs xi yi with ε precision: Smola & Scholkopf “A Tutorial on Support Vector Regression ”
Support Vector Regression Similar to classification, allow slack variables:
Support Vector Regression Similar to the classification case, the solution is a linear combination of support vectors: Thus the prediction only involves the features in an inner product: Kernel trick is applicable.
Support Vector Regression PLoS ONE 8(11): e79970.
Kernel trick in other areas Capturing non-linear effects without explicitly specifying a particular non-linear functional form Statistical testing of combined effects of variables Clustering Quantile regression Dimensionality reduction ……