Presentation is loading. Please wait.

Presentation is loading. Please wait.

Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.

Similar presentations


Presentation on theme: "Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl."— Presentation transcript:

1 Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl

2 Exercise Derive the vector derivative expressions: Find an expression for the minimum squared error weight vector, w, in the loss function:

3 Solution to LSE regression

4 The LSE method The quantity is called a Gram matrix and is positive semidefinite and symmetric The quantity is the pseudoinverse of X May not exist if Gram matrix is not invertible The complete “learning algorithm” is 2 whole lines of Matlab code

5 LSE example w=[6.72, -0.36]

6 LSE example t=y(x1,w)t=y(x1,w)

7 [0.36, 1] 6.72 t=y(x1,w)t=y(x1,w)

8 The LSE method So far, we have a regressor -- estimates a real valued t i for each x i Can convert to a classifier by assigning t=+1 or -1 to binary class training data

9 Multiclass trouble ?

10 Handling non-binary data All against all: Train O(c 2 ) classifiers, one for each pair of classes Run every test point through all classifiers Majority vote for final classifier More stable than 1 vs many Lot more overhead, esp for large c Data may be more balanced Each classifier trained on very small part of data

11 Support Vector Machines

12 Linear separators are nice... but what if your data looks like this:

13 Linearly nonseparable data 2 possibilities: Use nonlinear separators (diff hypothesis space) Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

14 Linearly nonseparable data 2 possibilities: Use nonlinear separators (diff hypothesis space) Possibly intersection of multiple linear separators, etc. (E.g., decision tree) Change the data Nonlinear projection of data These turn out to be flip sides of each other Easier to think about (do math for) 2 nd case

15 Nonlinear data projection Suppose you have a “projection function”: Original feature space “Projected” space Usually Do learning w/ linear model in Ex:

16 Nonlinear data projection

17

18

19 Common projections Degree- k polynomials: Fourier expansions:

20 Example nonlinear surfaces SVM images from lecture notes by S. Dreiseitl

21 Example nonlinear surfaces SVM images from lecture notes by S. Dreiseitl

22 Example nonlinear surfaces SVM images from lecture notes by S. Dreiseitl

23 Example nonlinear surfaces SVM images from lecture notes by S. Dreiseitl

24 The catch... How many dimensions does have? For degree- k polynomial expansions: E.g., for k =4, d =256 (16x16 images), Yike! For “radial basis functions”,

25 Linear surfaces for cheap Can’t directly find linear surfaces in Have to find a clever “method” for finding them indirectly It’ll take (quite) a bit of work to get there... Will need different criterion than We’ll look for the “maximum margin” classifier Surface s.t. class 1 (“true”) data falls as possible on one side; class -1 (“false”) falls as far as possible on the other

26 Max margin hyperplanes Hyperplane Margin

27 Max margin is unique Hyperplane Margin

28 Back to SVMs & margins The margins are parallel to hyperplane, so are defined by same w, plus constant offsets w b b

29 Back to SVMs & margins The margins are parallel to hyperplane, so are defined by same w, plus constant offsets Want to ensure that all data points are “outside” the margins w b b

30 Maximizing the margin So now we have a learning criterion function: Pick w to maximize b s.t. all points still satisfy Note: w.l.o.g. can rescale w arbitrarily (why?)

31 Maximizing the margin So now we have a learning criterion function: Pick w to maximize b s.t. all points still satisfy Note: w.l.o.g. can rescale w arbitrarily (why?) So can formulate full problem as: Minimize: Subject to: But how do you do that? And how does this help?

32 Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems There are off-the-shelf methods to solve them Actually solving this is way, way beyond the scope of this class Consider it a black box If a solution exists, it will be found & be unique Expensive, but not intractably so

33 Nonseparable data What if the data isn’t linearly separable? Project into higher dim space (we’ll get there) Allow some “slop” in the system Allow margins to be violated “a little” w

34 The new “slackful” QP The are “slack variables” Allow margins to be violated a little Still want to minimize margin violations, so add them to QP instance: Minimize: Subject to:

35 You promised nonlinearity! Where did the nonlinear transform go in all this? Another clever trick With a little algebra (& help from Lagrange multipliers), can rewrite our QP in the form: Maximize: Subject to:

36 Kernel functions So??? It’s still the same linear system Note, though, that appears in the system only as a dot product:

37 Kernel functions So??? It’s still the same linear system Note, though, that appears in the system only as a dot product: Can replace with : The inner product is called a “kernel function”

38 Why are kernel fns cool? The cool trick is that many useful projections can be written as kernel functions in closed form I.e., can work with K() rather than If you know K(x i,x j ) for every (i,j) pair, then you can construct the maximum margin hyperplane between the projected data without ever explicitly doing the projection!

39 Example kernels Homogeneous degree- k polynomial:

40 Example kernels Homogeneous degree- k polynomial: Inhomogeneous degree- k polynomial:

41 Example kernels Homogeneous degree- k polynomial: Inhomogeneous degree- k polynomial: Gaussian radial basis function:

42 Example kernels Homogeneous degree- k polynomial: Inhomogeneous degree- k polynomial: Gaussian radial basis function: Sigmoidal (neural network):

43 Side note on kernels What precisely do kernel functions mean? Metric functions take two points and return a (generalized) distance between them What is the equivalent interpretation for kernels? Hint: think about what kernel function replaces in the max margin QP formulation

44 Side note on kernels Kernel functions are generalized inner products Essentially, give you the cosine of the angle between vectors Recall the law of cosines:

45 Side note on kernels Replace traditional dot product with “generalized inner product” and get:

46 Side note on kernels Replace traditional dot product with “generalized inner product” and get: Kernel (essentially) represents: Angle between vectors in the projected, high-dimensional space

47 Side note on kernels Replace traditional dot product with “generalized inner product” and get: Kernel (essentially) represents: Angle between vectors in the projected, high-dimensional space Alternatively: Nonlinear distribution of angles in low-dim space

48 Example of Kernel nonlin.

49

50 Using the classifier Solution of the QP gives back a set of Data points for which are called “support vectors” Turns out that we can write w as

51 Using the classifier And our classification rule for query pt was: So:

52 Using the classifier SVM images from lecture notes by S. Dreiseitl Support vectors

53 Putting it all together Original (low dimensional) data

54 Putting it all together Original data matrix Kernel matrix Kernel function

55 Putting it all together Kernel + orig labels Maximize Subject to: Quadratic Program instance

56 Putting it all together Support Vector weights Maximize Subject to: Quadratic Program instance QP Solver subroutine

57 Putting it all together Support Vector weights Hyperplane in

58 Putting it all together Support Vector weights Final classifier

59 Putting it all together Final classifier Nonlinear classifier in

60 Final notes on SVMs Note that only for which actually contribute to final classifier This is why they are called support vectors All the rest of the training data can be discarded

61 Final notes on SVMs Complexity of training (& ability to generalize) based only on amount of training data Not based on dimension of hyperplane space ( ) Good classification performance In practice, SVMs among the strongest classifiers we have Closely related to neural nets, boosting, etc.


Download ppt "Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl."

Similar presentations


Ads by Google