Recap Finds the boundary with “maximum margin” Uses “slack variables” to deal with outliers Uses “kernels”, and the “kernel trick”, to solve nonlinear problems.
SVM error function = hinge loss + 1/margin …where hinge loss (summed over datapoints) proportional to the inverse of the margin, 1/m
Slack variables (aka soft margins) slack penalty for using slack
2D 3D What about much more complicated data? - project into high dimensional space, and solve with a linear model - project back to the original space, and the linear boundary becomes non-linear 2D 3D
The Kernel Trick
Slight rearrangement of our model – still equivalent though. Remember matrix notation – this is a “dot product”
…our new feature space. Project into higher dimensional space… x3 x2 x2 x1 x1 …our new feature space. BUT WHERE DO WE GET THIS FROM!?
The Representer Theorem (Kimeldorf and Wahba, 1978) For a linear model, The optimal parameter vector is always a linear combination of the training examples…
The Kernel Trick, PART 1 Substitute this into our model…. Or, if with our hypothetical high dimensional feature space:
The Kernel Trick, PART 2 scalar value Wouldn’t it be nice if we didn’t have to think up the ? And just skip straight to the scalar value we need directly…? ….If we had this, ….our model would look like this.
Kernels When d=2, the implicit feature space is: For example…. The polynomial kernel When d=2, the implicit feature space is: But we never actually calculate it!
- project into high dimensional space, and solve with a linear model - project back to the original space, and the linear boundary becomes non-linear 2D 3D
Polynomial kernel, with d=2
The Polynomial Kernel
The RBF (Gaussian) Kernel
Varying two things at once!
Summary of things…
SVMs versus Neural Networks Started from solid theory Theory led to many extensions (SVMS for text, images, graphs) Almost no parameter tuning Highly efficient to train Single optimum Highly resistant to overfitting Neural Nets Started from bio-inspired heuristics Ended up at theory equivalent to statistics ideas Good performance = lots of parameter tuning Computationally intensive to train Suffers from local optima Prone to overfitting
SVMs, done. Tonight… read chapter 4 while it’s still fresh. Remember, by next week – read chapter 5.
Examples of Parameters obeying the Representer Theorem 10 5 0 5 10
10 p 5 n 0 5 10
n 10 p 5 n 0 5 10
We had before, for an x on the boundary: And we just worked out: p Which gives us the expression for t … n The w and t are both linear functions of the examples.