Pattern Recognition and Machine Learning Chapter 7: sparse kernel Machines
Outline The problem: finding a sparse decision (and regression) machine that uses kernels The solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs) The core ideas behind the solutions The mathematical details
The problem (1) Methods introduced in chapters 3 and 4 Take into account all data points in the training set -> cumbersome Do not take advantage of kernel methods -> basis functions have to be explicit Example: Least squares and logistic regression
The problem (2) Kernel methods require evaluation of the kernel function for all pairs of -> cumbersome
The solution (1) Support vector machines (SVMs) are kernel machines that compute a decision boundary making sparse use of data points
The solution (2) Relevance vector machines (RVMs) are kernel machines that compute a posterior class probability making sparse use of data points
The solution (3) SVMs as well as RVMs can also be used for regression even sparser!
SVM: The core idea (1) That class separator which maximizes the margin between itself and the nearest data points will have the smallest generalization error:
SVM: The core idea (2) In input space:
SVM: The core idea (3) For regression:
RVM: The core idea (1) Exclude basis vectors whose presence reduces the probability of the observed data
RVM: The core idea (2) For classification and regression:
SVM: The details (1) Equation of the decision surface: Distance of a point from the decision surface:
SVM: The details (2) Distance of a point from the decision surface: Maximum margin solution:
SVM: The details (3) Distance of a point from the decision surface: We therefore may rescale , such that for the point closest to the surface.
SVM: The details (4) Therefore, we can reduce to under the constraint
SVM: The details (5) To solve this, we introduce Lagrange multipliers and minimize Equivalently, we can maximize the dual representation where the kernel function can be chosen without specifying explicitly.
SVM: The details (6) Because of the constraint only those survive for which is on the margin, i.e. This leads to sparsity.
SVM: The details (7) Based on numerical optimization of the parameters and , predictions on new data points can be made by evaluating the sign of
SVM: The details (8) In cases where the data points are not separable in feature space, we need a soft margin, i.e. a (limited) tolerance for misclassified points. To achieve this, we introduce slack variables with
SVM: The details (9) Graphically:
SVM: The details (10) The same procedure as before (with additional Lagrange multipliers and corresponding additional constraints) again yields a sparse kernel-based solution:
SVM: The details (11) The soft-margin approach can be formulated as minimizing the regularized error function This formulation can be extended to use SVMs for regression: where and are slack variables describing the position of a data point above or below a tube of width 2ϵ around the estimate y.
SVM: The details (12) Graphically:
SVM: The details (13) Again, optimization using Lagrange multipliers yields a sparse kernel-based solution:
SVM: Limitations Output is a decision, not a posterior probability Extension of classification to more than two classes is problematic The parameters C and ϵ have to be found by methods such as cross validation Kernel functions are required to be positive definite