Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support vector machines

Similar presentations


Presentation on theme: "Support vector machines"— Presentation transcript:

1 Support vector machines
(kernel machines)

2 Support Vector (Kernel) Machines: Background
Used primarily for classification based on linear discriminants Given the linear discriminants wiTx = 0, we can classify the data without estimating posteriors. Why did we estimate posteriors in ANN? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

3 Support Vector (Kernel) Machines: Background
Discriminants defined in terms of support vectors Recall family car classification problem Filled circles are the subset of training examples required to define the version space and the hypothesis with maximum margins Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

4 Support Vector (Kernel) Machines: Background
Linear discriminants in d > 2 dimensions are called “hyperplanes” Finding optimal hyperplanes subject to constraints is an example of the “quadratic programming problem” Minimize f(x) = ½xTQx + cTx subject to Ax < b Since Q is a positive definite matrix, f(x) is convex. Since f(x) is convex and bounded from below, a unique global minimum exist in the solution space Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4

5 Support Vector (Kernel) Machines: Background
Minimize f(x) = ½xTQx + cTx subject to Ax < b If Q is zero (not the case for kernel machines) the problem reduces to linear programming (linear objective function subject to linear constrains) Solved by Simplex algorithm Optimal x is at a vertex of convex hull Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5

6 Active set in the quadratic programming problem
The active set determines which constraints will influence the final result of optimization. In the linear programming problem, the active set defines the allowed convex solution space and the vertex that is a solution. In a convex quadratic programming problem, the solution is not necessarily at a vertex but the active set still guides the search for a solution. In the SVM quadratic programming problem, support vectors define the active set Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6

7 Optimal Separating Hyperplane: 2 classes
Optimum hyperplane will have maximum margins 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

8 2D linearly-separable example
Separating hyperplane margins Support vectors On margins linear discriminant is + 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8

9 Distance of xt from hyperplane wTx = 0
Let xa and xb be points on the hyperplane wTxa + w0 = wTxb + w0 wT(xa – xb) = 0 xa – xb is in hyperplane w must be normal to hyperplane rt = -1 rt = 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9

10 Distance of xt from hyperplane wTx = 0
Decompose any point x into components parallel and perpendicular to the hyperplane x = xp + d w/||w|| xp is projection of x on d is distance of x from w/||w|| is unit normal of rt = 1 rt = -1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10

11 Distance of xt from hyperplane wTx = 0
g(xp) = 0 = g(x - d w/||w||) 0 = wT(x – d w/||w||) + w0 0 = wTx + w0 – d ||w|| d = |wTx + w0|/||w|| dt = |g(xt)|/||w|| dt = rt(wTxt + w0)/||w|| rt = 1 rt = -1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

12 Maximizing margins Distance of xt to the hyperplane is
Find weights such that Get max r by min ||w|| (not obvious since w in numerator and denominator) Quadratic programming problem becomes Note that the number of constraints = size of the training set 12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

13 Solution by Lagrange multipliers
Sub p denotes function of “primal” optimization problem Substitute these conditions for a stationary point into the Lagrangian Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13

14 maximize Ld Expand Lp Substitute
Sub d denotes dual optimization problem maximize Ld Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14

15 Wikipedia on duality in constrained optimization
In constrained optimization, it is often possible to convert the primal problem (i.e. the original form of the optimization problem) to a dual form. In the Lagrangian dual problem, nonnegative Lagrangian multipliers allow constraints to be added to the objective function. Minimizing the Lagrangian gives the primal variables as functions of the Lagrange multipliers, which are called dual variables. We then maximize the objective function with respect to the Lagrange multipliers under their derived constraints (at least, non-negativity). In general, solution of the dual problem provides only an upper bound on the solution of the primal problem. (called duality gap) For convex optimization problems with linear constraints, the duality gap is zero. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

16

17 2D example of a linearly-separable 2-class problem
Task: find the linear discriminant with maximum margins Separating hyperplane margins Support vectors Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17

18 Slightly different formulation of 2 class problem
18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 Enables a simple expression for distance of a data
point from hyperplane dt = rt(wTxt + w0)/||w|| rt = 1 rt = -1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19

20 Solution by Lagrange multipliers
Sub p denotes function of “primal” optimization variables Substitute these conditions for a stationary point into the Lagrangian to get the “dual” that is function of at only Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20

21 Given the at that maximize Ld calculate For all at > 0
Decrease complexity by setting αt = 0 when the data point is sufficiently far from discriminant to be ignored in the search for hyperplane with maximum margins. Small percentage remains Given the at that maximize Ld calculate For all at > 0 (rt)2 = 1; therefore Usually average w0 calculated from all support vectors Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21

22 Application to classes that are not linearly separable
2 approaches: soft margin hyperplanes tradeoff between error and margins kernel methods transformation to feature space

23 Soft Margin Hyperplane: Slack variables
xt > 0 penalty applied to data in margins or misclassified soft error = (a) & (b) xt = 0 (c) correct but in margin 0<xt<1 (d) misclassified xt >1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23

24 Soft error also called “hinge” loss function
is output r t is target Lhinge(yt, rt)= Penalty for correctly classified data in margin Penalty increases linearly for data misclassified Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24

25 Optimal soft margin hyperplane with slack variables
C is a regularization parameter, at and mt are Lagrange multipliers C is fine tuned by cross validation to give the proper tradeoff between margin maximization and error minimization Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

26 Setting partial derivatives of Lp with to the “primal” variables w, w0, and xt equal to zero gives
and at = C - mt which implies 0<at<C because Lagrange multiplier are non-negative Substitution into Lp gives the dual To be maximized subject to and 0<at<C for all t at=0: instances correctly classified with sufficient margin 0<at<C: support vectors that determine weights at = C: instances in margins and/or misclassified Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26

27 If C=0 then at = C – mt implies mt = -at
Then the Lp with slack variables Reduces to Lp without slack variables Larger C makes slack variables more important (i.e. model more tolerant of data in margins) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

28 n-SVM: another approach to soft margins
is a regularization parameter shown to be an upper bound on fraction of instances in margin r is a primal variable related to optimum margin width = 2r /||w|| Additional primal variables are w, w0, and slack variables maximize dual 28

29 The Kernel Trick transform inputs xt by basis functions
zt = φ(xt) g(zt)=wTzt linear discriminant g(xt)=wT φ(xt) non-linear Explicit transformation is unnecessary Kernel is a function defined on the input space Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

30 polynomial kernels Basis functions of kernel 6D feature space
2D attribute space Basis functions of kernel 6D feature space Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30

31 Radial-basis functions: As s decreases, more data points become
support vectors and boundary becomes more complex Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31

32 kernel engineering K(x,y) should get larger as x and y get more similar Application-specific kernels start with a good measure of similarity Similarity measure depends on data type String kernels, graph kernels, image kernels Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32

33 Empirical kernel map Define a set of templates mi and score s(x, mi) = similarity of x to template mi f(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)] is an M-dimensional representation of xt Define K(x,xt) = f(x)T f(xt)

34 Assignment 7 Due Application of Support Vector Machines using Weka software Must install libsvm Data set: Breast cancer diagnostics Deliverables: Performance metrics, including percent correct classifications and confusion matrix, compared to classification by ANN-MLP. Use default setting for both techniques except normalization. Use 10-fold cross validation for testing

35 Hints on installing and running libsvm
Use Weka 3.7 Find package manager under Setup and Tools Find libsvm in list of packages In Explorer open Classify Under Choose find libsvm under functions Discuss this paper in your report. What machine-learning techniques were used? How does their success compare with yours

36 Multiple Classes, Ci i=1,...,K
Beyond SVM dichotomizes: 1 vs all Train SVMs gi(x), i =1,...,K In testing, assign x to class with largest gi(x) 2 attributes, multiple classes, axis-aligned rectangle hypothesis class Same type of data (price, power), label is Boolean vector (all zeros but one) Treat K-class classification problem as K 2-class problems: hence, train K hypotheses Examples belonging to Ci are positive for hi Examples belonging to all other classes are negative for hi Training minimized sum of error over all classes Generalization includes doubt if (price, power) datum not in one and only one boundary 36 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

37 One-Class VSM Machines
Consider a sphere with center a and radius R Find a and R that define a soft boundary on high-density data (a) data not involved in finding sphere (b) data on sphere (xt = 0) used to find R given a (c) data outside of sphere (xt > 0) Objective: distinguish (a) & (b) from (c) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37

38 Add Lagrange multipliers at>0 and gt>0 for constraints
Set derivatives with respect to primal variables R, a and xt = 0 0 < at < C Substituting back into Lp we get dual to be maximized given find radius Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38

39 One-Class Kernel Machines
Typical instance in high density region: at=0 Support vectors (xt=0) define R: 0<at<C Outliers (xt>0): at=C Dot products in Ld can be replaced by kernel Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39

40 One-Class Gaussian Kernel Machine
Consider x an outlier (not similar to xt) if < Rc If variance is small, sum drops below threshold Rc unless x is very close to some xt Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 40

41

42 SVM for linear regression: f(x)= wTx +w0
є-sensitive error function No penalty for deviations < e Linearly increasing for deviations > e Squared residuals e-sensitive Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 42

43 SVM for linear regression: f(x)= wTx +w0
Use “slack” variable for soft margins Subject to Maximize dual Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 43

44 SVM for linear regression: f(x)= wTx +w0
in tube at+ = at- = 0 on tube at+ < C positive slack at+ = C Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 44

45 Kernel Regression quadratic kernel Gaussian kernel
Circled = support vector on margin Squared = support vector with slack 2 different spreads Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 45

46 Maximizing margins Distance of xt to the hyperplane is
Find weights such that Get max r by min ||w|| For non-trivial solution, require ρ||w||=1 “Order-reversing duality” 46 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

47 Kernel machines: more advantages
kernel functions, application-specific measures of similarity Recall quadratic multivariate regression in x-space: Defined new variables z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2 Enabled use of a linear model in new z space (basis functions, kernel trick: Chapter 13) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 47

48 Kernel machines: quadratic programming problem
Minimize f(x) = ½xTQx + cTx subject to Ax < b If Q is a positive definite matrix, then f(x) is convex. If f(x) is bounded from below, then a unique global minimum exist The optimum hyperplane problem has all of these properties If Q is zero (not the case for kernel machines) problem reduces to linear programming (linear objective function subject to linear constrains) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 48

49 Linear programming problem
When both the objective function and the constraint are linear functions of model parameters, the Simplex algorithm can be used to find a solution. Example: metabolic flux balance under conditions of optimum growth rate. Linear constraints define a convex region of allowed solutions in parameter space

50 Linear programming problem
The linear objective function is a hyperplane in parameter space. Larger values of the objective function are found by moving the hyperplane through parameter space in the direction of its gradient. In this travel the last intersection with the allowed solution space will be a point, line, or face. In either case, the maximum value of the objective function in the solution space will be one of its vertices.

51 Kernel Dimensionality Reduction
Kernel PCA does PCA on the kernel matrix (equal to canonical PCA with a linear kernel) Kernel LDA Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 51

52 Multiple Kernel Learning
Fixed kernel combinations If K1(x, y) and K2(x, y) are valid, then K(x, y) is also valid Adaptive kernel combination Determine both support vector parameters at and kernel weights Localized kernel combination Weights are parameterized functions of input Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 52

53 Alternatives to 1 vs all Pairwise separation: K(K-1)/2 SVMs based on
Single multi-class optimization: zt is the class index of xt; K∙N variables, K weight vectors 1 vs all generally preferred approach Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 53


Download ppt "Support vector machines"

Similar presentations


Ads by Google