Support Vector Machines and Kernel Methods Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics
Agenda Linear Discriminant Functions and Decision Hyperplanes Introduction to SVM Support Vector Machines Introduction to Kernels Nonlinear SVM Kernel Methods
Linear Discriminant Functions and Decision Hyperplanes Figure 1. Two classes of patterns and a linear decision function
Linear Discriminant Functions and Decision Hyperplanes Each pattern is represented by a vector Linear decision function has the equation where w1,w2 are weights and w0 is the bias term
Linear Discriminant Functions and Decision Hyperplanes The general decision hyperplane equation in d-dimensional space has the form: where w = [w1 w2 ....wd] is the weight vector and w0 is the bias term.
Figure 2. An example of two possible classifiers Introduction to SVM There are many hyperplanes that separates two classes Figure 2. An example of two possible classifiers
Introduction to SVM THE GOAL: Our goal is to search for direction w and bias w0 that gives the maximum possible margin, or in other words, to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes.
SVM: Linearly Separable Case Figure 3. Hyperplane through two linearly separable classes
SVM: Linearly Separable Case Our training data is of the form: This hyperplane can be described by and called separating hyperplane.
SVM: Linearly Separable Case Select variables w and b so that: These equations can be combined into:
SVM: Linearly Separable Case The points that lie closest to the separating hyperplane are called support vectors (circled points in diagram) and are called supporting hyperplanes.
SVM: Linearly Separable Case Figure 3. Hyperplane through two linearly separable classes (repeated)
SVM: Linearly Separable Case The hyperplane’s equidistance from H1 and H2 means that d1= d2 and this quantity is known as SVM Margin: d1+ d2 = d1= d2=
SVM: Linearly Separable Case Maximizing Minimizing min such that yi(xi . w + b) -1 >= 0 Minimizing is equivalent to minimizing to perform Quadratic Programming (QP) optimization
SVM: Linearly Separable Case Optimization problem: Minimize subject to
SVM: Linearly Separable Case This is an inequality constrained optimization problem with Lagrangian function: where αi >= 0 i=1,2,....,L are Lagrange multipliers. (1)
SVM The corresponding KKT conditions are: (2) (3)
SVM This is a convex optimization problem.The cost function is convex and the set of constraints are linear and define a convex set of feasible solutions. Such problems can be solved by considering the so called Lagrangian Duality
SVM Substituing (2) and (3) gives a new formulation which being dependent on α, we need to maximize.
SVM This is called Dual form (Lagrangian Dual) of the primary form. Dual form only requires the dot product of each input vector to be calculated. This is important for the Kernel Trick which will be described later.
SVM So the problem becomes a dual problem: Maximize subject to
SVM Differentiating with respect to αi ‘s and using the constraint equation, a system of equations is obtained. Solving the system, the Lagrange multipliers are found and optimum hyperplane is given according to the formula:
SVM SUPPORT VECTORS are the feature vectors for αi > 0 i=1,2,....,L Some Notes: SUPPORT VECTORS are the feature vectors for αi > 0 i=1,2,....,L The cost function is strictly convex. Hessian matrix is positive definite. Any local minimum is also global and unique. The optimal hyperplane classifier of a SVM is UNIQUE. Although the solution is unique, the resulting Lagrange multipliers are not unique.
Kernels: Introduction When applying our SVM to linearly separable data we have started by creating a matrix H from the dot product of our input variables: being known as Linear Kernel, an example of a family of functions called Kernel functions.
Kernels: Introduction The set of kernel functions are all based on calculating inner products of two vectors. This means if the function is mapped to a higher dimensionality space by a nonlinear mapping function only the inner products of the mapped inputs need to be determined without needing to explicitly calculate Ф . This is called “Kernel Trick”
Kernels: Introduction Kernel Trick is useful because there are many classification/regression problems that are not fully separable/regressable in the input space but separable/regressable in a higher dimensional space.
Kernels: Introduction Popular Kernel Families: Radial Basis Function (RBF) Kernel Polynomial Kernel Sigmodial (Hyperbolic Tangent) Kernel
Nonlinear Support Vector Machines The support vector machine with kernel functions becomes: and the resulting classifier:
Nonlinear Support Vector Machines Figure 4. The SVM architecture employing kernel functions.
Kernel Methods Recall that a kernel function computes the inner product of the images under an embedding of two data points is a kernel if 1. k is symmetric: k(x,y) = k(y,x) 2. k is positive semi-definite, i.e., the “Gram Matrix” Kij = k(xi,xj) is positive semi-definite.
Kernel Methods The answer for which kernels does there exist a pair {H,φ}, with the properties described above, and for which does there not is given by Mercer’s condition.
Mercer’s condition Let be a compact subset of and let and a mapping where H is an Euclidean space. Then the inner product operation has an equivalent representation and is a symmetric function satisfying the following condition for any , such that
Mercer’s Theorem Theorem. Suppose K is a continuous symmetric non-negative definite kernel. Then there is an orthonormal basis {ei}i of L2[a, b] consisting of eigenfunctions of TK such that the corresponding sequence of eigenvalues {λi}i is nonnegative. The eigenfunctions corresponding to non-zero eigenvalues are continuous on [a, b] and K has the representation where the convergence is absolute and uniform.
Kernel Methods Suppose k1and k2 are valid (symmetric, positive definite) kernels on X. Then the following are valid kernels: 1. 2. 3.
Kernel Methods 4. 5. 6. 7.
References [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, 121-167, 1998. [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Applications”, Springer, 2001. [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.
References [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005. [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge University Press, 2004. [6] “Subject Title: Mercer’s Theorem”, Wikipedia: http://en.wikipedia.org/wiki/Mercer’s_theorem
Thank You