CSE 4705 Artificial Intelligence

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
Separating Hyperplanes
Support Vector Machines
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines and Kernel Methods
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machines
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
Support vector machines
SVMs for Document Ranking
Presentation transcript:

CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo

Introduction and History SVM is a supervised learning technique applicable  for both classification and regression. A classifier derived from statistical learning theory by Vapnik in 1995. SVM is a method that generates input-output mapping functions from a set of labeled training data. x → f(x, α) where α is an adjustable parameters It is the simple geometrical interpretation of the margin, uniqueness of the solution, statistical robustness of the loss function, modularity of the kernel function, and overfit control through the choice of a single regularization parameter.

Advantage of SVM Another name for regularization is The Capacity Control SVM controls the capacity by optimizing the classification margin. What is the margin? How can we optimize it? Linearly separable case Linearly inseparable case (using hinge loss) Primal-dual optimization The other key features of SVMs are the use of kernels. What are the kernels? (May omit in this class)

Support Vector Machine Find a linear hyperplane (decision boundary) that will separate the data

Support Vector Machine One Possible Solution

Support Vector Machine Another Possible Solution

Support Vector Machine Other possible solutions

Support Vector Machine Which one is better? B1 or B2? How do you define better or the optimal?

Support Vector Machine Definitions Basic concepts Examples closest to the hyperplane are support vectors. Support Vectors Distance from a support vector to the separator is Margin of the separator is the width of separation between support vectors of classes. Margin of the separator is the width of separation between support vectors of classes.

Support Vector Machine To find the optimal solution Find hyperplane that maximizes the margin So B1 is better than B2

Why Maximum Margin? Intuitively this feels safest. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. The model is immune to removal of any non-support- vector data points. There’s statistical learning theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very well.

Linear separable case

Linear non-separable case (soft margin) Noisy data Slack variables ξi error

Nonlinear Support Vector Machines What if the problem is not linearly separable?

Higher Dimensions Mapping the data into higher dimensional space (kernel trick) where it is linearly separable and then we can use linear SVM – (easier to solve) -1 +1 + - (1,0) (0,0) (0,1)

Extension to Non-linear Decision Boundary Possible problem of the transformation High computation overhead and hard to get a good estimate. SVM solves these two issues simultaneously Kernel tricks for efficient computation Minimize ||w||2 can lead to a “good” classifier Non-linear separable case Φ: x → φ(x)

Extension to Non-linear Decision Boundary Then we can solve it easly. Non-linear separable case

Non Linear non-separable case What if decision boundary is not linear? The concept of a kernel mapping function is very powerful. It allows SVM models to perform separations even with very complex boundaries

Linear Support Vector Machine Find hyperplane that maximizes the margin So B1 is better than B2

Definition d1 = the shortest distance to the closest positive point Define the hyperplane H (decision function) such that: d1 = the shortest distance to the closest positive point d2 = the shortest distance to the closest negative point The margin of a separating hyperplane is d1 + d2.

Distance 𝑊 = 𝑊 | 𝑊 |  distance= | 𝑊 𝑇 (Xn - 𝑋 )| Distance between Xn and the plane: Take any point X on the plane Projection of (Xn- 𝑋 ) on W. 𝑊 = 𝑊 | 𝑊 |  distance= | 𝑊 𝑇 (Xn - 𝑋 )| distance= 1 | 𝑊 | | 𝑊 𝑇 Xn - 𝑊 𝑇 𝑋 )| = 1 | 𝑊 | | 𝑊 𝑇 Xn + b - 𝑊 𝑇 𝑋 - b| = 1 | 𝑊 | (if Xn is a support vector, which makes |𝑊 𝑇 Xn + b| =1) 𝒙 ^ Xn 𝑾 ^ Xn 𝑾 ^ 𝑿 ^ 𝑿 ^^

SVM boundary and margin Want: find W, b (offset) such that subject to the constraint:

The Lagrangian trick Reformulate the optimization problem: We need to minimize Lp w.r.b to w,b and maximizing w.r.b 𝜶 >= 0 Reformulate the optimization problem: A ”trick” often used in optimization is to do an Lagrangian formulation of the problem. The constraints will be replaced by constraints on the Lagrangian multipliers and the training data will only occur as dot products.

New formulation which is dependent on α , we need to maximize: Dual form requires only the dot product of each input vector xi to be calculated Having moved from minimizing LP to maximizing LD, we need to find: This is a convex quadratic optimization problem, and we run a QP solver which will return alpha and then we can get w. What remains is to calculate b.

The Karush-Kuhn-Tucker Conditions Definition of KKT conditions: The standard form of optimization is as follows: The corresponding KKT conditions are: ( is the local minimum point) Feasibility Direction

The Karush-Kuhn-Tucker Conditions Geometric meaning of KKT conditions Nonlinear programming ---Dimitri P Bertsekas

The Karush-Kuhn-Tucker Conditions Condition for KKT: The intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions SVMs problems always satisfy this condition Nonlinear programming ---Dimitri P Bertsekas

The Karush-Kuhn-Tucker Conditions For the primal problem of linear SVMs, The SVMs problem is convex( a convex objective function and convex feasible region), thus the KKT conditions are necessary and sufficient, which means the primal problem can be simplified to a KKT problem. The KKT conditions are:

Linearly Support Vector Machines There are two cases of linearly Support Vector Machines: 1- The Separable case. 2-The In-Separable case.

Use it if there are no noises in training data. 1- The Separable case: Use it if there are no noises in training data. hyperplane

If there are noises in training data. We need to move to the Non-Separable case.

2-The In-Separable case: Often, data will be noisy which does not allow any hyper- plane to correctly classify all observations. Thus, the data are linearly inseparable. Idea: relax constraints using slack variables ξi in the constraints. One for each sample. In order to extend the SVM methodology to handle data that is not fully linearly separable. We relax the constraints for 1,2 slightly to allow for misclassified points. This is done by introducing a positive SLACK variable ξi. i=1,… L : L is training points. Separable case Non-Separable case

Slack variables ξ ξi is a measure of deviation from the ideal for xi. - ξi >1 : x is on the wrong side of the separating hyperplan. - 0 < ξi <1: x is correctly classified, but lies inside the margin. - ξi < 0 : x is correctly classified, and lies outside the margin. ξ Is the total distance of points on the wrong side of their margin

To deal with the non-separable case, we can rewrite the problem as: minimize : Subject to : The parameter C controls the trarde-off between maximizing the margin and minimizing the training error.

Use the Lagrangian formulation for the optimization problem. Introduce a positive Lagrangian multiplier for each inequality constraint. Lagrangian multiplier are the Lagrange multipliers introduced to enforce positivity of the error is the weight coefficient 

Get the following Lagrangian : Reformulating as a Lagrangian, we need to minimize with respect to w, b and ξ, and maximize with respect to

Differentiating with respect to w, b and ξi and setting the derivatives to zero: 1 2 Partial Differentiation w , b , ξ

Substituting these 1, 2 into We get a new formulation: Having moved from minimizing LP to maximizing LD, we need to find

Calculate b We run a QP solver which will return and from . will give us w. What remains is to calculate b. Any data point satisfying (1) which is a Support Vector xs will have the form: Substituting in (1), we will get : Where S denotes the set of indices of the Support Vectors. Faintly, we get this:

The summary of In-separable case

Non-linear SVMs Cover’s theorem on the separability of patterns “A complex pattern-classification problem cast in a high-dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space” The power of SVMs resides in the fact that they represent a robust and efficient implementation of Cover’s theorem SVMs operate in two stages Perform a non-linear mapping of the feature vector x onto a high-dimensional space that is hidden from the inputs or the outputs Construct an optimal separating hyperplane in the high-dim space

Non-linear SVMs Datasets that are linearly separable with noise work out great: x x But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: x x2

Non-linear SVMs: Feature Space General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

Non-linear SVMs

Non-linear SVMs Data is linearly separable in 3D This means that the problem can still be solved by a linear classifier

Non-linear SVMs Naïve application of this concept by simply projecting to a high-dimensional non-linear manifold has two major problems Statistical: operation on high-dimensional spaces is ill-conditioned due to the “curse of dimensionality” and the subsequent risk of overfitting –Computational: working in high-dim requires higher computational power, which poses limits on the size of the problems that can be tackled SVMs bypass these two problems in a robust and efficient manner First, generalization capabilities in the high-dimensional manifold are ensured by enforcing a largest margin classifier Recall that generalization in SVMs is strictly a function of the margin (or the VC dimension), regardless of the dimensionality of the feature space Second, projection onto a high-dimensional manifold is only implicit Recall that the SVM solution depends only on the dot product 𝑥𝑖,𝑥𝑗 between training examples Therefore, operations in high-dim space 𝜑(𝑥) do not have to be performed explicitly if we find a function 𝐾(𝑥𝑖,𝑥𝑗) such that 𝐾(𝑥𝑖,𝑥𝑗) = (𝜑(𝑥𝑖),𝜑(𝑥𝑗)) 𝐾(𝑥𝑖,𝑥𝑗) is called a kernel function in SVM terminology

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

Nonlinear SVMs: The Kernel Trick An example: Assume we choose a kernel function 𝐾(𝑥𝑖,𝑥𝑗)=(𝑥𝑖𝑇𝑥𝑗)2 Our goal is to find a non-linear projection 𝜑(𝑥) such that (𝑥𝑖𝑇𝑥𝑗)2=𝜑𝑇(𝑥𝑖)𝜑(𝑥𝑗) Performing the expansion of 𝐾(𝑥𝑖,𝑥𝑗) K(xi,xj)=(1 + xiTxj)2, = 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] So in using the kernel 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖𝑇𝑥𝑗)2, we are implicitly operating on a higher- dimensional non-linear manifold defined by φ(xi) = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T Notice that the inner product 𝜑𝑇(𝑥𝑖)𝜑(𝑥𝑗) can be computed in 𝑅2 by means of the kernel (𝑥𝑖𝑇𝑥𝑗 )2 without ever having to project onto 𝑅6!

Kernel methods Let’s now see how to put together all these concepts

Kernel methods Let’s now see how to put together all these concepts Assume that our original feature vector 𝑥 lives in a space 𝑅𝐷 We are interested in non-linearly projecting 𝑥 onto a higher dimensional implicit space 𝜑(𝑥)∈𝑅𝐷1 𝐷1>𝐷 where classes have a better chance of being linearly separable Notice that we are not guaranteeing linear separability, we are only saying that we have a better chance because of Cover’s theorem The separating hyperplane in 𝑅𝐷1 will be defined by To eliminate the bias term 𝑏, let’s augment the feature vector in the implicit space with a constant dimension 𝜑0(𝑥)=1 Using vector notation, the resulting hyperplane becomes From our previous results, the optimal (maximum margin) hyperplane in the implicit space is given by

Kernel methods Merging this optimal weight vector with the hyperplane equation and, since 𝜑𝑇(𝑥𝑖)𝜑(𝑥𝑗)=𝐾(𝑥𝑖,𝑥𝑗), the optimal hyperplane becomes Therefore, classification of an unknown example 𝑥 is performed by computing the weighted sum of the kernel function with respect to the support vectors 𝑥𝑖 (remember that only the support vectors have non-zero dual variables 𝛼𝑖)

Kernel methods How do we compute dual variables 𝜶𝒊 in the implicit space? Very simple: we use the same optimization problem as before, and replace the dot product 𝜑𝑇(𝑥𝑖) 𝜑(𝑥𝑗) with the kernel 𝐾 (𝑥𝑖,𝑥𝑗) The Lagrangian dual problem for the non-linear SVM is simply subject to the constraints

Kernel methods How do we select the implicit mapping 𝝋(𝑥)?   How do we select the implicit mapping 𝝋(𝑥)? As we saw in the example a few slides back, we will normally select a kernel function first, and then determine the implicit mapping 𝜑(𝑥) that it corresponds Then, how do we select the kernel function 𝑲(𝒙𝒊,𝒙𝒋)? We must select a kernel for which an implicit mapping exists, this is, a kernel that can be expressed as the dot-product of two vectors For which kernels 𝑲(𝒙𝒊,𝒙𝒋)does there exist an implicit mapping 𝝋(𝑥)? The answer is given by Mercer’s Condition

Some Notes On Kernel Trick What’s the point of kernel trick? More essay to compute! Example: Consider Quadratic Kernel, , , and our original data 𝑥 has m dimension. Number of terms (m+2)-choose-2, around 𝑚 2 /2 Figure is from Andrew Moore, CMU

Some Note On Kernel Trick That dot product requires 𝒎 𝟐 /2 additions and multiplications So, how about compute directly? Oh! it is only O(m) Figure is from Andrew Moore, CMU

Some Notes on and Φ and H Mercer’s condition tells us whether or not a prospective kernel is a dot product in some space. But how to construct Φ or even what H is, if we are given a kernel ? (Usually we don’t need to know Φ, here we just try to have fun exploring what Φ looks like) Consider homogeneous polynomial kernel, we can actually explicitly construct the Φ Example: For data in 𝑅 2 Kernel We can easily get

Some Notes on and Φ and H Extend to arbitrary homogeneous polynomial kernels Remember the Multinomial Theorem Number of terms:   Consider an homogeneous polynomial kernel We can explicitly get Φ

Some Notes on and Φ and H We can also start with Φ, then construct kernel. Example: Consider Fourier expansion in the data x in R, cut off after N terms The Φ map x to vector in 𝑅 2𝑁−1 We can get the (Dirichlet) kernel:

Some Notes on and Φ and H Prove: Given Proof Then Proof Finally, it is clear that the above implicit mapping trick will work for any algorithm in which the data only appear as dot products (for example, the nearest neighbor algorithm).

Some Examples of Nonlinear SVMs Linear Kernel Polynomial kernels where P =2,3,.... To get the feature vectors we concatenate all up to Pth order polynomial terms of the components of 𝑥 Radial basis function kernels The 𝜎 is a user-specified parameter. In this case the feature space is infinite dimensional function space Hyperbolic tangent kernel (Sigmoid Kernel) That kernel only satisfies Mercer’s condition for certain values of the parameters κ and δ SVM model using a sigmoid kernel function is equivalent to a two- layer, perceptron neural network.  A common value for κ is 1/N, where N is the data dimension A nice Paper for further information.

Polynomial Kernel SVM Example Slide from Tommi S. Jaakkola, MIT

RBF Kernel SVM Example Kernel we are using: Notice that: Decrease sigma, moves towards nearest neighbor classifier Slides from A. Zisserman

RBF Kernel SVM Example Kernel we are using: Notice that: Decrease C, gives wider (soft) margin Slides from A. Zisserman

Global Solutions and Uniqueness Fact: Training an SVM amounts to solving a convex quadratic programming problem. A proper kernel must satisfy Mercer's positivity condition. Notes from optimization theorem. If the objective function is strictly convex, the solution is guaranteed to be unique. For quadratic programming problems, convexity of the objective function is equivalent to positive semi-definiteness of the Hessian, and strict convexity, to positive definiteness. For loosely convex (convex but not strictly convex), one must examine case by case, to determine uniqueness.

Global Solutions and Uniqueness Note Cont. Strict convexity of the primal objective function does not imply strict convexity of the dual objective function. Then 4 Situation (1)Both primal and dual solutions are unique; (2)The primal solution is unique while the dual solution is not; (3)The dual is unique but the primal is not; (4)Both solutions are not unique.

Global Solutions and Uniqueness So, if the Hessian is positive definite, we are happy to announce that the solution for the flowing function is unique. (Quick reminder: Hessian is square matrix of second-order partial derivatives of a function) What if Hessian is positive semi-definite, namely the objective function is loosely convex? There is necessary and sufficient condition proposed by J.C.Burge etc. For further info: Paper. non-uniqueness of the SVM solution will be the exception rather than the rule The object function Hessian

Thanks Pic by Mark. A. Hicks