10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear. 2.S.R. Gunn, Support Vector Machines for Classification and Regression. ( 3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR , Microsoft Research, (ftp://ftp.research.microsoft.com/pub/tr/tr pdf) 4.For more resources on support vector machines, see
10/18/ Support Vector MachinesM.W. Mak Introduction l SVMs were developed by Vapnik in 1995 and are becoming popular due to their attractive features and promising performance. l Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs. l SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error. l SVMs have been shown to posses better generalization capability than conventional neural networks.
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) l Given N labeled empirical data: where X is the set of input data in and y i are the labels. Domain X (1)
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) l We construct a simple classifier by computing the means of the two classes where N 1 and N 2 are the number of data in the class with positive and negative labels, respectively. l We assign a new point x to the class whose mean is closer to it. l To achieve this, we compute (2)
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) l Then, we determine the class of x by checking whether the vector connecting x and c encloses an angle smaller than /2 with the vector Domain X x where
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) l In the special case where b = 0, we have l This means that we use ALL data point x i, each being weighted equally by 1/N 1 or 1/N 2, to define the decision plane. (3)
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) Domain X x Decision plan
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) l However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small. l We may also select only a few important data point (called support vectors) and weight them differently. l Then, we have a support vector machine.
10/18/ Support Vector MachinesM.W. Mak Introduction (Cont.) Domain X x Decision plane Support vectors Margin l We aim to find a decision plane that maximizes the margin.
10/18/ Support Vector MachinesM.W. Mak Linear SVMs l Assume that all training data satisfy the constraints: which means l Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane. (4) (5)
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Conts.) Margin: d l Therefore, maximizing the margin is equivalent to minimizing ||w|| 2.
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Lagrangian) l We minimize ||w|| 2 subject to the constraint that l This can be achieved by introducing Lagrange multipliers and a Lagrangian l The Lagrangian has to be minimized with respect to w and b and maximized with respect to (6) (7)
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Lagrangian) l Setting l We obtain l Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy where S contains the indexes to the support vectors. (8) l Patterns for which are considered to be irrelevant to the classification.
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Wolfe Dual) l Substituting (8) into (7), we obtain the Wolfe dual: l The hyper-decision plane is thus (9)
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Example) l Analytical example (3-point problem): l Objective function:
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Example) l We introduce another Lagrange multiplier λ to obtain the Lagrangian l Differentiating F(α, λ) with respect to λ and α i and set the results to zero, we obtain
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Example) l Substitute the Lagrange multipliers into Eq. 8
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Example) l 4-point linear separable problem: 4 SVs 3 SVs
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Non-linearly separable) l Non-linearly separable: patterns that cannot be separated by a linear decision boundary without incurring classification error. Data that causes classification error in linear SVMs
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Non-linearly separable) l We introduce a set of slack variables with l The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6): l Therefore, for some we have
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Non-linearly separable) l E.g. because x 10 and x 19 are inside the margins, i.e. they violate the constraint (Eq. 6).
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Non-linearly separable) l For non-separable cases: where C is a user-defined penalty parameter to penalize any violation of the margins. l The Lagrangian becomes
10/18/ Support Vector MachinesM.W. Mak Linear SVMs (Non-linearly separable) l Wolfe dual optimization: l The output weight vector and bias term are
10/18/ Support Vector MachinesM.W. Mak 2. Linear SVMs (Types of SVs) l Three types of support vectors 1.On the margin: 2. Inside the margin: 3. Outside the margin:
10/18/ Support Vector MachinesM.W. Mak 2. Linear SVMs (Types of SVs)
10/18/ Support Vector MachinesM.W. Mak 2. Linear SVMs (Types of SVs) Swapping Class 1 and Class 2
10/18/ Support Vector MachinesM.W. Mak 2. Linear SVMs (Types of SVs) l Effect of varying C: C = 0.1 C = 100
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs l In case the training data X are not linearly separable, we may use a kernel function to map the data from the input space to a feature space where data become linearly separable. Input Space (Domain X) Decision boundary Feature Space
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs (Conts.) l The decision function becomes (a)
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs (Conts.)
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs (Conts.) l The decision function becomes l For RBF kernels l For polynomial kernels
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs (Conts.) l The decision function becomes l The optimization problem becomes: (9)
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs (Conts.) l The effect of varying C on RBF-SVMs: C = 10 C = 1000
10/18/ Support Vector MachinesM.W. Mak 3. Non-linear SVMs (Conts.) l The effect of varying C on Polynomial-SVMs: C = 10 C = 1000