Presentation is loading. Please wait.

Presentation is loading. Please wait.

Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)

Similar presentations


Presentation on theme: "Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)"— Presentation transcript:

1 Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models) SVM Wrapup Remind me to repeat Q’s for those listening to audio Informal Class Poll: Favorite ML Algo? (Domingos’s on-line ‘five tribes’ talk 11/24) Nearest Neighbors D-trees / D-forests Genetic Algorithms Naïve Bayes / Bayesian Nets Neural Networks Support Vector Machines

2 Recall: Three Key SVM Concepts Maximize the Margin Don’t choose just any separating plane Penalize Misclassified Examples Use soft constraints and ‘slack’ variables Use the ‘Kernel Trick’ to get Non-Linearity Roughly like ‘hardwiring’ the input  HU portion of ANNs (so only need a perceptron) 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 112

3 y Support Vectors Recall: ‘Slack’ Variables Dealing with Data that is not Linearly Separable 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 113 For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane) If we deleted any/all of the non support vectors we’d get the same answer! 2 ||w|| 2

4 min ||w|| 1 + μ ||S|| 1 such that w · x pos i + S i ≥  + 1 w · x neg j – S j ≤  – 1  S k ≥ 0 Recall: The Math Program with Slack Vars  w, s,  Dim = # of input features Dim = # of training examples 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface 4 Notice we are solving the perceptron task with a complexity penalty (sum of wgts) – Hinton’s wgt decay!

5 Recall: SVMs and Non-Linear Separating Surfaces f1f1 f2f2 + + _ _ h(f 1, f 2 ) g(f 1, f 2 ) + + _ _ Non-linearly map to new space Linearly separate in new space Result is a non-linear separator in original space 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 105

6 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 Idea #3: Finding Non-Linear Separating Surfaces via Kernels Map inputs into new space, eg –ex 1 features: x 1 =5, x 2 =4  Old Rep –ex 1 features: (x 1 2, x 2 2, 2  x 1  x 2 )  New Rep = (25, 16, 40) (sq of old rep) Solve linear SVM program in this new space –Computationally complex if many derived features –But a clever trick exists! SVM terminology (differs from other parts of ML) –Input space: the original features –Feature space: the space of derived features 6

7 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 Kernels Kernels produce non-linear separating surfaces in the original space Kernels are similarity functions between two examples, K(ex i, ex j ), like in k-NN Sample kernels (many variants exist) K(ex i, ex j ) = ex i ● ex j this is linear K(ex i, ex j ) = exp{-||ex i – ex j || 2 / σ 2 } this is the Gaussian kernel 7

8 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 Kernels as Features Let the similarity between examples be the features! Feature j for example i is K(ex i, ex j ) Models are of the form If ∑ α j K(ex i, ex j ) >  then + else – The α ’s weight the similarities (we hope many α = 0) So a model is determined by (a) finding some good exemplars (those with α ≠ 0; they are the support vectors) and (b) weighting the similarity to these exemplars An instance- based learner! 8 Bug or feature?

9 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 Our Array of ‘Feature’ Values Features: K(ex i, ex j ) Examples Similarity between examples i and j Our models are linear in these features, but will be non-linear in the original features if K is a non-linear function Notice that we can compute K(ex i, ex j ) outside the SVM code! So we really only need code for the LINEAR SVM – it doesn’t know from where the ‘rectangle’ of data has come 9 An array of #’s

10 Concrete Example Use the ‘squaring’ kernel to convert the following set of examples K(ex i, ex j ) = (x z) 2 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 10 F1F2Output Ex142T Ex2-63T Ex3-5F Raw Data K(ex i, ex 1 )K(ex i, ex 2 )K(ex i, ex 3 )Output Ex1T Ex2T Ex3F Derived Features

11 Concrete Example (w/ answers) Use the ‘squaring’ kernel to convert the following set of examples K(ex i, ex j ) = (x z) 2 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 11 F1F2Output Ex142T Ex2-63T Ex3-5F Raw Data K(ex i, ex 1 )K(ex i, ex 2 )K(ex i, ex 3 )Output Ex1400324484T Ex23242025729T Ex3484729676F Derived Features Probably want to divide this by 1000 to scale the derived features

12 A Simple Example of a Kernel Creating a Non-Linear Separation: Assume K(A,B) = - distance(A,B) 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 12 ex1 ex5 ex2 ex3 ex4 ex6 ex7 ex9 ex8 Kernel-Produced Feature Space (only two dimensions shown) K(ex i, ex 6 ) K(ex i, ex 1 ) Feature Space ex6 ex7 ex8 ex9 ex2 ex5 ex4 ex1 ex3 Separating surface in the original feature space (non-linear!) Separating plane in the derived space Model if (K(ex new, ex 1 ) > -5) then GREEN else RED

13 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 Our 1-Norm SVM with Kernels min || α || 1 + μ ||S|| 1 such that  pos ex’s { ∑ α j  K(x j, x pos i ) } + S i ≥  + 1  neg ex’s { ∑ α j  K(x j, x neg k ) } – S k ≤  – 1  S m ≥ 0 We use α instead of w to indicate we’re weighting similarities rather than ‘raw’ features Same linear LP code can be used, simply create the K()’s externally! 13

14 The Kernel ‘Trick’ The Linear SVM can be written (using [not-on-final] primal-dual concept of LPs) min ([½  y i y j  i  j (ex i ex j )] -  i ) Whenever we see dot products: We can replace with kernels: –this is called the ‘kernel trick’ http://en.wikipedia.org/wiki/Kernel_trick http://en.wikipedia.org/wiki/Kernel_trick This trick not only for SVMs ie, ‘kernel machines’ a broad ML topic - can use ‘similarity to examples’ as features for ANY ML algo! - eg, run d-trees with kernelized features 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 ex i ex j K(ex i, ex j ) 14

15 Kernels and Mercer’s Theorem K(x, y)’s that are –continuous –symmetric: K(x, y) = K(y, x) –positive semidefinite (create a square Hermitian matrix whose eigenvectors are all positive; see en.wikipedia.org/wiki/Positive_semidefinite_matrix)en.wikipedia.org/wiki/Positive_semidefinite_matrix Are equivalent to a dot product in some space K(x, y) =  (x)   (y) Note: can use any similarity function to create a new ‘feature space’ and solve with a linear SVM, but the ‘dot product in a derived space’ interpretation will be lost unless Mercer’s Theorem holds 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 1115

16 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 The New Space for a Sample Kernel Let K(x, z) = (x z) 2 and let #features = 2 (x z ) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 1 x 1 z 1 z 1 + x 1 x 2 z 1 z 2 + x 2 x 1 z 2 z 1 + x 2 x 2 z 2 z 2 = Our new feature space with 4 dimensions; we’re doing a dot product in it! Note: if we used an exponent > 2, we’d have gotten a much larger ‘virtual’ feature space for very little cost! Notation: indicates a vector, with its components explicitly listed Key point: we don’t explicitly create the expanded ‘raw’ feature space, but the result is the same as if we did 16

17 Review: Matrix Multiplication 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 17 From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2http://www.cedricnugteren.nl/tutorial.php?page=2 A  B = C Matrix A is K by M Matrix B is N by K Matrix C is M by N

18 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 The Kernel Matrix Let A be our usual array of one example per row one (standard) feature per column A’ is ‘A transpose’ (rotate around diagonal) one (standard) feature per row one example per column The Kernel Matrix is K(A, A’) = x e f e e e e f e f f K 18

19 The Reduced SVM (Lee & Mangasarian, 2001) With kernels, learned models are weighted sums of similarities to some of the training examples Kernel matrix is size 0 (N 2 ), where N = # ex’s –With ‘big data’ squaring can be prohibitive! But no reason all training examples need to be candidate ‘exemplars’ Can randomly (or cleverly) choose a subset as candidates; size can scale 0 (N) 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 K(e i,e j ) Examples 19 Create (and use) only these blue columns

20 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 More on Kernels K(x, z) = tanh(c  (x z ) + d) Relates to the sigmoid of ANN’s (here # of HU’s determined by # of support vectors) How to choose a good kernel function? – Use a tuning set – Or just use the Gaussian kernel – Some theory exists – A sum of kernels is a kernel (  other ‘closure’ properties) – Don’t want the kernel matrix to be all 0’s off the diagonal since we want to model examples as sums of other ex’s 20

21 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 The Richness of Kernels Kernels need not solely be similarities computed on numeric data! Or where ‘raw’ data is in a rectangle Can define similarity between examples represented as –trees (eg, parse trees in NLP; see image above), count common subtrees, say –sequences (eg, DNA sequences) 21

22 Using Gradient Descent Instead of Linear Programming Recall last lecture we said that perceptron training with weight decay was quite similar to SVM training This is still the case with kernels; ie, we create a new data set outside the perceptron code and use gradient descent So here we get the non-linearity provided by HUs in a ‘hard-wired’ fashion (ie, by using the kernel to non-linearly compute a new representation of the data) 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 22 Kernel

23 SVM Wrapup For approx a decade, SVMs were the ‘hottest’ topic in ML (Deep NNs now are) Formalize nicely the task of find a simple models with few ‘outliers’ Use hard-wired ‘kernels’ to do the job done by HUs in ANNs Kernels can be used in any ML algo –just preprocess the data to create ‘kernel’ features –can handle non fixed-length-feature vectors Lots of good theory and empirical results 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11 23


Download ppt "Today’s Topics 11/17/15CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)"

Similar presentations


Ads by Google