CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14
11/27/2018 Today’s Topics Support Vector Machines (SVMs) Three Key Ideas Max Margins Allowing Misclassified Training Examples Kernels (for non-linear models) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Three Key SVM Concepts Maximize the Margin Don’t choose just any separating plane Penalize Misclassified Examples Use soft constraints and ‘slack’ variables Use the ‘Kernel Trick’ to get Non-Linearity Roughly like ‘hardwiring’ the input  HU portion of ANNs (so only need a perceptron) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Support Vector Machines Maximizing the Margin between Bounding Planes
SVMs define some inequalities we want satisfied. We then use advanced optimization methods (eg, linear programming) to find the satisfying solutions, but in cs540 we’ll do a simpler approx Support Vectors ? 2 ||w||2 Margin 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Margins and Learning Theory
Theorems exist that connect learning (‘PAC’) theory to the size of the margin Basically the larger the margin, the better the expected future accuracy See, for example, Chapter 4 of Support Vector Machines by N. Christianini & J. Shawe-Taylor, Cambridge Press, (not an assigned reading) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

‘Slack’ Variables Dealing with Data that is not Linearly Separable
For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane) If we deleted any/all of the non support vectors we’d get the same answer! Support Vectors y 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

SVMs and Non-Linear Separating Surfaces
Non-linearly map to new space f1 f2 + _ h(f1, f2) g(f1, f2) + _ Linearly separate in new space (# dimensions might be different from orig space) Result is a non-linear separator in original space 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Math Review: Dot Products
X  Y  X1  Y1 + X2  Y2 + … + Xn  Yn So if X = [4, 5, -3, 7] and Y = [9, 0, -8, 2] Then X  Y = 49 + 50 + (-3)(-8) + 72 = 74 (weighted sums in ANNs are dot products) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Some Equations - Separating Plane - + - - + - + - - + + weights input features threshold For all positive examples These 1’s result from dividing through by a constant for convenience (it is the distance from the dashed lines to the green line) For all negative examples 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The green line is the set of all pts that satisfy this equation (ditto for red line) Idea #1: The Margin xA (i) xj W (ii) margin Subtracting (ii) from (i) gives xi xB (iii) = 1 since parallel lines (iv) Combining (iii) and (iv) we get 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Our Initial ‘Mathematical Program’
min ||w|| (this is the ‘1-norm’ length of the weight vector, which is the sum of the absolute values of the weights; some SVMs use quadratic programs, but 1-norms have some preferred properties) such that w · xpos ≥  // for ‘+’ ex’s w · xneg ≤  – 1 // for ‘–’ ex’s 1 w,  12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The ‘p’ Norm – Generalization of the Familiar Euclidean Distance (p=2)
12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Our Mathematical Program (cont.)
Note: w and  are our adjustable parameters (we could, of course, use the ANN ‘trick’ and move  to the left side of our inequalities and treat as another weight) We can now use existing math programming optimization s/w to find a sol’n to our current program (covered in cs525) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Idea #2: Dealing with Non-Separable Data
We can add what is called a ‘slack’ variable to each example This variable can be viewed as = 0 if example correctly separated else = ‘distance’ we need to move ex to get it correct (ie, distance from decision boundary) Note: we are NOT counting #misclassified would be nice to do so, but that becomes [mixed] integer programming, which is much harder 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

CS 540 Fall 2015 (Shavlik) 11/27/2018 The Math Program with Slack Vars (this is the linear-programming version; there also is a quadratic-prog version - we won’t worry about the difference) Notice we are solving the perceptron task with a complexity penalty (sum of wgts) – Hinton’s wgt decay! min ||w||1 + μ ||S||1 such that w · xposi + Si ≥  + 1 w · xnegj – Sj ≤  – 1 Sk ≥ 0 w, s,  scalar Scaling constant (use tuning set to select value) Dim = # of training examples Dim = # of input features The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Slack’s and Separability
If training data is separable, will all Si = 0 ? Not necessarily! Might get a larger margin by misclassifying a few examples (just like in d-tree pruning) This can also happen when using gradient- descent to minimize an ANN’s cost function 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Idea #3: Finding Non-Linear Separating Surfaces via Kernels
Map inputs into new space, eg ex1 features: x1=5, x2=4  Old Rep ex1 features: (x12, x22, 2  x1  x2)  New Rep = (25, 16, 40) (sq of old rep) Solve linear SVM program in this new space Computationally complex if many derived features But a clever trick exists! SVM terminology (differs from other parts of ML) Input space: the original features Feature space: the space of derived features 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Kernels Kernels can produce non-linear separating surfaces in the original space Kernels are similarity functions between two examples, K(exi, exj), like in k-NN Sample kernels (many variants exist) K(exi, exj) = exi ● exj this is linear K(exi, exj) = exp{-||exi – exj||2 / σ2 } this is the Gaussian kernel 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The Gaussian Kernel - heavily used in SVMs
Kernel = similarity between two ex’s K(extest , extrain) extest Euler’s constant Feature Space extrain 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Bug or feature? Kernels as Features Let the similarity between examples be the features! Feature j for example i is K(exi, exj) Models are of the form If ∑ αj K(exi, exj) >  then + else – An instance-based learner! So a model is determined by (a) finding some good exemplars (those with α ≠ 0; they are the support vectors) and (b) weighting the similarity to these exemplars The α’s weight the similarities (we hope many α = 0) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Our Array of ‘Feature’ Values
An array of #’s Features: K(exi, exj) Our models are linear in these features, but will be non-linear in the original features if K is a non-linear function Similarity between examples i and j Examples Notice that we can compute K(exi, exj) outside the SVM code! So we really only need code for the LINEAR SVM – it doesn’t know from where the ‘rectangle’ of data has come 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Concrete Example Use the ‘squaring’ kernel to convert the following set of examples K(exi, exj) = (exi • exj)2 Raw Data Derived Features F1 F2 Output Ex1 4 2 T Ex2 -6 3 Ex3 -5 -1 F K(exi, ex1) K(exi, ex2) K(exi, ex3) Output Ex1 T Ex2 Ex3 F 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Concrete Example (w/ answers)
Use the ‘squaring’ kernel to convert the following set of examples K(exi, exj) = (exi • exj)2 Probably want to divide this by 1000 to scale the derived features Raw Data Derived Features F1 F2 Output Ex1 4 2 T Ex2 -6 3 Ex3 -5 -1 F K(exi, ex1) K(exi, ex2) K(exi, ex3) Output Ex1 400 324 484 T Ex2 2025 729 Ex3 676 F 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Remember! We can take any dataset of examples and create a new dataset of ‘kernelized’ features Then gave that new dataset to any ML algo (that can accept numeric features) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The Richness of Kernels
Kernels need not solely be similarities computed on numeric data! Or where ‘raw’ data is in a rectangle Can define similarity between examples represented as trees (eg, parse trees in NLP; see image above), count common subtrees, say sequences (eg, DNA sequences) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Review: Matrix Multiplication
A  B = C Matrix A is K by M Matrix B is N by K Matrix C is M by N From (code also there): 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The Kernel Matrix Let A be our usual array of one example per row one (standard) feature per column A’ is ‘A transpose’ (rotate around diagonal) one (standard) feature per row one example per column The Kernel Matrix is K(A, A’) f e e f f e e x f = e e K 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The Reduced SVM (Lee & Mangasarian, 2001)
With kernels, learned models are weighted sums of similarities to some of the training examples Kernel matrix is size 0(N2), where N = # ex’s With ‘big data’ squaring can be prohibitive! But no reason all training examples need to be candidate ‘exemplars’ Can randomly (or cleverly) choose a subset as candidates; size can scale 0(N) K(ei,ej) Examples Create (and use) only these blue columns 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

A Simple Example of a Kernel Creating a Non-Linear Separation: Assume K(A,B) = - distance(A,B)
Kernel-Produced Feature Space (only two dimensions shown) Feature Space K(exi, ex1) ex8 ex6 ex9 ex3 ex2 ex5 ex1 ex1 ex4 ex2 ex5 ex8 ex4 ex7 K(exi, ex6) ex3 ex6 ex7 ex9 Separating plane in the derived space Separating surface in the original feature space (non-linear!) Model if (K(exnew, ex1) > -5) then GREEN else RED 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Our 1-Norm SVM with Kernels
We use α instead of w to indicate we’re weighting similarities rather than ‘raw’ features min ||α||1 + μ ||S||1 such that  pos ex’s { ∑ αj  K(xj, xposi) } + Si ≥  + 1  neg ex’s { ∑ αj  K(xj, xnegk) } – Sk ≤  – 1  Sm ≥ 0 Same linear LP code can be used, simply create the K()’s externally! 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Advanced Note: The Kernel ‘Trick’
The Linear SVM can be written (using primal-dual concept of LPs) min ([½  yi yj i j (exi • exj)] - i) Whenever we see dot products: We can replace with kernels: this is called the ‘kernel trick’ This trick not only for SVMs ie, ‘kernel machines’ a broad ML topic - can use ‘similarity to examples’ as features for ANY ML algo! - eg, run d-trees with kernelized features exi • exj K(exi , exj ) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Advanced Topic: Kernels and Mercer’s Theorem
K(x, y)’s that are continuous symmetric: K(x, y) = K(y, x) positive semidefinite (create a square Hermitian matrix whose eigenvectors are all positive; see en.wikipedia.org/wiki/Positive_semidefinite_matrix) Are equivalent to a dot product in some space K(x, y) = (x)  (y) Note: can use any similarity function to create a new ‘feature space’ and solve with a linear SVM, but the ‘dot product in a derived space’ interpretation will be lost unless Mercer’s Theorem holds 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

The New Space for a Sample Kernel
Let K(x, z) = (x • z)2 and let #features = 2 (x •z )2 = (x1z1 + x2z2)2 = x1x1z1z1 + x1x2z1z2 + x2x1z2z1 + x2x2z2z2 = <x1x1, x1x2, x2x1, x2x2> • <z1z1, z1z2, z2z1, z2z2> Our new feature space with 4 dimensions; we’re doing a dot product in it! Note: if we used an exponent > 2, we’d have gotten a much larger ‘virtual’ feature space for very little cost! Notation: <a, b, …, z> indicates a vector, with its components explicitly listed Key point: we don’t explicitly create the expanded ‘raw’ feature space, but the result is the same as if we did! 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

More on Kernels K(x, z) = tanh(c  (x • z ) + d) Relates to the sigmoid of ANN’s (here # of HU’s determined by # of support vectors) How to choose a good kernel function? Use a tuning set Or just use the Gaussian kernel Some theory exists A sum of kernels is a kernel ( other ‘closure’ properties) Don’t want the kernel matrix to be all 0’s off the diagonal since we want to model examples as sums of other ex’s 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Using Gradient Descent Instead of Linear Programming
Recall earlier we said that perceptron training with weight decay was quite similar to SVM training This is still the case with kernels; ie, we create a new data set outside the perceptron code and use gradient descent So here we get the non-linearity provided by HUs in a ‘hard-wired’ fashion (ie, by using the kernel to non-linearly compute a new representation of the data) Kernel 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Where We Are We have an ‘objective’ function that we can optimize by Linear Programming (LP) min ||w||1 + μ ||S||1 subject to some constraints Same LP can handle kernel’ed features Free LP solvers exist CS 525 teaches Linear Programming We could also use gradient descent Perceptron learning with ‘weight decay’ quite similar, though uses SQUARED wgts and SQUARED error (the S is this error) Recall cs540 HW4 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

SVM Wrapup For approx a decade, SVMs were the ‘hottest’ topic in ML (Deep NNs now are) Formalize nicely the task of find a simple model with few ‘outliers’ Use hard-wired ‘kernels’ to do the job done by HUs in ANNs Combine ideas from ANNs and ‘instance-based’ ML Kernels can be used in any ML algo just preprocess the data to create ‘kernel’ features can handle non fixed-length-feature vectors Lots of good theory and empirical results 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Advanced SVM Details (not on final)
If you have taken (or will take) cs525, Intro to Linear Programming, the remaining slides in this desk might be of interest 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Brief Intro to Linear Programs (LP’s)
We need to convert our task into A z ≥ b which is the basic form of an LP (A is a constant matrix, b is a constant vector, z is a vector of variables; we solve for z) Note Can convert inequalities containing ≤ into ones using ≥ by multiplying both sides by -1 eg, 5x ≤ 15 same as -5x ≥ -15 Can also handle = (ie, equalities) could use ≥ and ≤ to get =, but more efficient methods exist 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Brief Intro to Linear Programs (cont.)
In addition, we want to Yellow region are those points that satisfy the constraints; dotted lines are iso-cost lines min c  z under the linear Az ≥ b constraints Vector c says how to penalize settings for variables in vector z Highly optimized s/w for solving LP exists (eg, CPLEX, COINS [free]) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Aside: Our SVM as an LP  1 -Aneg 0 1 1 0 W Spos 0 1 0 0 0 Sneg
Let Apos = our positive training examples Aneg = our negative training examples (assume 50% pos and 50% neg for notational simplicity) | f | e/2 | e/2 | 1 | f | Apos -Aneg e/2 f W Spos Sneg  Z 1 e/2 e f f e/2 1 #examples  ≥ f #features The 1’s are identity matrices (often written as I) 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Doing the Multiplies Apos × W + Spos -  ≥ 1 // A is matrix, others are vectors -Aneg × W + Sneg +  ≥ 1 Spos ≥ 0 // Compare to “The Math Program with Slack Vars” slide. Sneg ≥ 0 -W + Z ≥ 0 // Not on earlier slide, but explained shortly. W + Z ≥ 0 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Our C Vector (determines the cost we’re minimizing)
Note we min Z’s not W’s since only Z’s ≥ 0 min [ 0 μ 0 1 ] W S  Z = min μ ● S + 1 ● Z Aside: could also penalize  (but would need to add more variables since  can be negative) = min μ ||S||1 + ||W||1 since all S are non-negative and the Z’s ‘squeeze’ the W’s Note here: S = Spos concatenated with Sneg CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14 12/6&8/16

Squeezing the W’s Three slides ago, our matrix multiplies included -W + Z ≥ 0 // These are all vectors W + Z ≥ 0 // Inequality holds for every component Plus our MINIMIZE expression included 1 ● Z - if components of Z could be negative, we’d get huge neg values! The two inequalities above can be written as Z ≥ W and Z ≥ -W // Holds for each component of Z & W which means Zi can never be negative, though Wi can! (adding the two inequalities produces 2 × Z ≥ 0, so Z ≥ 0) We get the NON-linear absolute value function in a linear program! 12/6&8/16 CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Similar presentations

Presentation on theme: "CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14

Similar presentations

Presentation on theme: "CS540 - Fall 2016 (Shavlik©), Lecture 24, Week 14"— Presentation transcript:

Similar presentations

About project

Feedback