Download presentation
1
Support Vector Machines and Kernel Methods
Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics
2
Agenda Linear Discriminant Functions and Decision Hyperplanes
Introduction to SVM Support Vector Machines Introduction to Kernels Nonlinear SVM Kernel Methods
3
Linear Discriminant Functions and Decision Hyperplanes
Figure 1. Two classes of patterns and a linear decision function
4
Linear Discriminant Functions and Decision Hyperplanes
Each pattern is represented by a vector Linear decision function has the equation where w1,w2 are weights and w0 is the bias term
5
Linear Discriminant Functions and Decision Hyperplanes
The general decision hyperplane equation in d-dimensional space has the form: where w = [w1 w2 ....wd] is the weight vector and w0 is the bias term.
6
Figure 2. An example of two possible classifiers
Introduction to SVM There are many hyperplanes that separates two classes Figure 2. An example of two possible classifiers
7
Introduction to SVM THE GOAL:
Our goal is to search for direction w and bias w0 that gives the maximum possible margin, or in other words, to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes.
8
SVM: Linearly Separable Case
Figure 3. Hyperplane through two linearly separable classes
9
SVM: Linearly Separable Case
Our training data is of the form: This hyperplane can be described by and called separating hyperplane.
10
SVM: Linearly Separable Case
Select variables w and b so that: These equations can be combined into:
11
SVM: Linearly Separable Case
The points that lie closest to the separating hyperplane are called support vectors (circled points in diagram) and are called supporting hyperplanes.
12
SVM: Linearly Separable Case
Figure 3. Hyperplane through two linearly separable classes (repeated)
13
SVM: Linearly Separable Case
The hyperplane’s equidistance from H1 and H2 means that d1= d2 and this quantity is known as SVM Margin: d1+ d2 = d1= d2=
14
SVM: Linearly Separable Case
Maximizing Minimizing min such that yi(xi . w + b) -1 >= 0 Minimizing is equivalent to minimizing to perform Quadratic Programming (QP) optimization
15
SVM: Linearly Separable Case
Optimization problem: Minimize subject to
16
SVM: Linearly Separable Case
This is an inequality constrained optimization problem with Lagrangian function: where αi >= 0 i=1,2,....,L are Lagrange multipliers. (1)
17
SVM The corresponding KKT conditions are: (2) (3)
18
SVM This is a convex optimization problem.The cost function is convex and the set of constraints are linear and define a convex set of feasible solutions. Such problems can be solved by considering the so called Lagrangian Duality
19
SVM Substituing (2) and (3) gives a new formulation which being dependent on α, we need to maximize.
20
SVM This is called Dual form (Lagrangian Dual) of the primary form. Dual form only requires the dot product of each input vector to be calculated. This is important for the Kernel Trick which will be described later.
21
SVM So the problem becomes a dual problem: Maximize subject to
22
SVM Differentiating with respect to αi ‘s and using the constraint equation, a system of equations is obtained. Solving the system, the Lagrange multipliers are found and optimum hyperplane is given according to the formula:
23
SVM SUPPORT VECTORS are the feature vectors for αi > 0 i=1,2,....,L
Some Notes: SUPPORT VECTORS are the feature vectors for αi > 0 i=1,2,....,L The cost function is strictly convex. Hessian matrix is positive definite. Any local minimum is also global and unique. The optimal hyperplane classifier of a SVM is UNIQUE. Although the solution is unique, the resulting Lagrange multipliers are not unique.
24
Kernels: Introduction
When applying our SVM to linearly separable data we have started by creating a matrix H from the dot product of our input variables: being known as Linear Kernel, an example of a family of functions called Kernel functions.
25
Kernels: Introduction
The set of kernel functions are all based on calculating inner products of two vectors. This means if the function is mapped to a higher dimensionality space by a nonlinear mapping function only the inner products of the mapped inputs need to be determined without needing to explicitly calculate Ф . This is called “Kernel Trick”
26
Kernels: Introduction
Kernel Trick is useful because there are many classification/regression problems that are not fully separable/regressable in the input space but separable/regressable in a higher dimensional space.
27
Kernels: Introduction
Popular Kernel Families: Radial Basis Function (RBF) Kernel Polynomial Kernel Sigmodial (Hyperbolic Tangent) Kernel
28
Nonlinear Support Vector Machines
The support vector machine with kernel functions becomes: and the resulting classifier:
29
Nonlinear Support Vector Machines
Figure 4. The SVM architecture employing kernel functions.
30
Kernel Methods Recall that a kernel function computes the inner product of the images under an embedding of two data points is a kernel if 1. k is symmetric: k(x,y) = k(y,x) 2. k is positive semi-definite, i.e., the “Gram Matrix” Kij = k(xi,xj) is positive semi-definite.
31
Kernel Methods The answer for which kernels does there exist a pair {H,φ}, with the properties described above, and for which does there not is given by Mercer’s condition.
32
Mercer’s condition Let be a compact subset of and let and a mapping where H is an Euclidean space. Then the inner product operation has an equivalent representation and is a symmetric function satisfying the following condition for any , such that
33
Mercer’s Theorem Theorem. Suppose K is a continuous symmetric non-negative definite kernel. Then there is an orthonormal basis {ei}i of L2[a, b] consisting of eigenfunctions of TK such that the corresponding sequence of eigenvalues {λi}i is nonnegative. The eigenfunctions corresponding to non-zero eigenvalues are continuous on [a, b] and K has the representation where the convergence is absolute and uniform.
34
Kernel Methods Suppose k1and k2 are valid (symmetric, positive definite) kernels on X. Then the following are valid kernels: 1. 2. 3.
35
Kernel Methods 4. 5. 6. 7.
36
References [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, , 1998. [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Applications”, Springer, 2001. [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.
37
References [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005. [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge University Press, 2004. [6] “Subject Title: Mercer’s Theorem”, Wikipedia:
38
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.