Omer Boehm omerb@il.ibm.com A tutorial about SVM Omer Boehm omerb@il.ibm.com.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Lecture 9 Support Vector Machines

ECG Signal processing (2)

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Support vector machine

Machine learning continued Image source:

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Support Vector Machine (SVM) Classification

Binary Classification Problem Learn a Classifier from the Training Set

Support Vector Machines and Kernel Methods

Support Vector Machines

CS 4700: Foundations of Artificial Intelligence

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

An Introduction to Support Vector Machines Martin Law.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Support Vector Machine & Image Classification Applications

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

An Introduction to Support Vector Machines (M. Law)

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

An Introduction to Support Vector Machine (SVM)

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines Tao Department of computer science University of Illinois.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Neural networks and support vector machines

Support vector machines

CS 9633 Machine Learning Support Vector Machines

PREDICT 422: Practical Machine Learning

Support Vector Machine

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Support Vector Machines

Geometrical intuition behind the dual problem

Support Vector Machines

An Introduction to Support Vector Machines

An Introduction to Support Vector Machines

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Statistical Learning Dong Liu Dept. EEIS, USTC.

Support Vector Machines Most of the slides were taken from:

CSSE463: Image Recognition Day 14

Support Vector Machines

Support Vector Machines

Support vector machines

Machine Learning Support Vector Machine Supervised Learning

Linear Discrimination

SVMs for Document Ranking

Support Vector Machines 2

Presentation transcript:

Omer Boehm omerb@il.ibm.com A tutorial about SVM Omer Boehm omerb@il.ibm.com

Outline Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data.

Introduction A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data An important task of machine learning is classification. Classification is also referred to as pattern recognition.

Example Objects Classes Learning Machine Income Debt Married age Shelley 60,000 1000 No 30 Elad 200,000 Yes 80 Dan 20,000 25 Alona 100,000 10,000 40 Approve deny Learning Machine

Types of learning problems Supervised learning (n class, n>1) Classification Regression Unsupervised learning (0 class) Clustering (building equivalence classes) Density estimation

approximation problems Supervised learning Regression Learn a continuous function from input samples Stock prediction Input – future date Output – stock price Training – information on stack price over last period Classification Learn a separation function from discrete inputs to classes. Optical Character Recognition (OCR) Input – images of digits. Output – labeling 0-9. Training - labeled images of digits. In fact, these are approximation problems

Regression

Classification

Density estimation

What makes learning difficult Given the following examples How should we draw the line?

What makes learning difficult Which one is most appropriate?

What makes learning difficult The hidden test points

What is Learning (mathematically)? We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification Such an approximation is called a stable approximation As a rule of thumb, small derivatives ensure stable approximation

Stable vs. Unstable approximation Lagrange approximation (unstable) given points, , we find the unique polynomial , that passes through the given points Spline approximation (stable) given points, , we find a piecewise approximation by third degree polynomials such that they pass through the given points and have common tangents at the division points and in addition :

What would be the best choice? The “simplest” solution A solution where the distance from each example is as small as possible and where the derivative is as small as possible

Vector Geometry Just in case ….

Dot product The dot product of two vectors Is defined as: An example

Dot product where denotes the length (magnitude) of ‘a’ Unit vector

Plane/Hyperplane Hyperplane can be defined by: Three points Two vectors A normal vector and a point

Plane/Hyperplane Let be a perpendicular vector to the hyperplane H Let be the position vector of some known point in the plane. A point _ with position vector is in the plane iff the vector drawn from to is perpendicular to Two vectors are perpendicular iff their dot product is zero The hyperplane H can be expressed as

Classification

Solving approximation problems First we define the family of approximating functions F Next we define the cost function . This function tells how well performs the required approximation Getting this done , the approximation/classification consists of solving the minimization problem A first necessary condition (after Fermat) is As we know it is always possible to do Newton-Raphson, and get a sequence of approximations

Classification A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories. X is the input space, is a data point from an input space. A typical input space is high-dimensional, for example X is also called a feature vector. Ω is a finite set of categories to which the input data points belong : Ω ={1,2,…,C}. are called labels.

Classification Y is a finite set of decisions – the output set of the classifier. The classifier is a function

The Perceptron

Perceptron - Frank Rosenblatt (1957) Linear separation of the input space

Perceptron algorithm Start: The weight vector is generated randomly, set Test: A vector is selected randomly, if and go to test, if and go to add, if and go to test, if and go to subtract Add: go to test, Subtract: go to test,

Perceptron algorithm Shorter version Update rule for the k+1 iterations (iteration for each data point)

Perceptron – visualization (intuition)

Perceptron – visualization (intuition)

Perceptron – visualization (intuition)

Perceptron – visualization (intuition)

Perceptron – visualization (intuition)

Perceptron - analysis Solution is a linear combination of training points Only uses informative points (mistake driven) The coefficient of a point reflect its ‘difficulty’ The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR)

Support Vector Machines

Advantages of SVM, Vladimir Vapnik 1979,1998 Exhibit good generalization Can implement confidence measures, etc. Hypothesis has an explicit dependence on the data (via the support vectors) Learning involves optimization of a convex function (no false minima, unlike NN). Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found).

Advantages of SVM From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. These generalization bounds have two important features:

Advantages of SVM The upper bound on the generalization error does not depend on the dimensionality of the space. The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class.

Basic scenario - Separable data set

Basic scenario – define margin

In an arbitrary-dimensional space, a separating hyperplane can be written : Where W is the normal. The decision function would be :

Note argument in is invariant under a rescaling of the form Implicitly the scale can be fixed by defining as the support vectors (canonical hyperplanes)

The task is to select , so that the training data can be described as: for These can be combined into:

The margin will be given by the projection of the vector The margin will be given by the projection of the vector onto the normal vector to the hyperplane i.e. So the distance (Euclidian) can be formed

Note that lies on i.e. Similarly for Subtracting the two results in

The margin can be put as Can convert the problem to subject to the constraints: J(w) is a quadratic function, thus there is a single global minimum

Lagrange multipliers Problem definition : Maximize subject to A new λ variable is used , called ‘Lagrange multiplier‘ to define

Lagrange multipliers

Lagrange multipliers

Lagrange multipliers - example Maximize , subject to Formally set Set the derivatives to 0 Combining the first two yields : Substituting into the last Evaluating the objective function f on these yields

Lagrange multipliers - example

Primal problem: minimize s.t. Introduce Lagrange multipliers associated with the constraints The solution to the primal problem is equivalent to determining the saddle point of the function :

At saddle point, , has minimum requiring

Primal: Minimize with respect to subject to Primal-Dual Primal: Minimize with respect to subject to Substitute and Dual: Maximize with respect to subject to

Solving QP using dual problem Maximize constrained to and We have , new variables. One for each data point This is a convex quadratic optimization problem , and we run a QP solver to get , and W

‘b’ can be determined from the optimal and Karush-Kuhn-Tucker (KKT) conditions (data points residing on the SV) implies or AVG

For every data point i, one of the following must hold Many sparse solution Data Points with are Support Vectors Optimal hyperplane is completely defined by support vectors

SVM - The classification Given a new data point Z, find it’s label Y

Extended scenario - Non-Separable data set

Data is most likely not to be separable (inconsistencies, outliers, noise), but linear classifier may still be appropriate Can apply SVM in non-linearly separable case Data should be almost linearly separable

SVM with slacks Use non-negative slack variables one per data point Change constraints from to is a measure of deviation from ideal position for sample I

Would like to minimize constrained to SVM with slacks Would like to minimize constrained to The parameter C is a regularization term, which provides a way to control over-fitting if C is small, we allow a lot of samples not in ideal position if C is large, we want to have very few samples not in ideal position

SVM with slacks - Dual formulation Maximize Constraint to

SVM - non linear mapping Cover’s theorem: “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space” One dimensional space, not linearly separable Lift to two dimensional space with

SVM - non linear mapping Solve a non linear classification problem with a linear classifier Project data x to high dimension using function Find a linear discriminant function for transformed data Final nonlinear discriminant function is In 2D, discriminant function is linear In 1D, discriminant function is NOT linear

SVM - non linear mapping Can use any linear classifier after lifting data into a higher dimensional space. However we will have to deal with the “curse of dimensionality” poor generalization to test data computationally expensive SVM handles the “curse of dimensionality” problem: enforcing largest margin permits good generalization It can be shown that generalization in SVM is a function of the margin, independent of the dimensionality computation in the higher dimensional case is performed only implicitly through the use of kernel functions

Non linear SVM - kernels Recall: The data points appear only in dot products If is mapped to high dimensional space using The high dimensional product is needed The dimensionality of space F not necessarily important. May not even know the map

Kernel A function that returns the value of the dot product between the images of the two arguments: Given a function K, it is possible to verify that it is a kernel. Now we only need to compute instead of “kernel trick”: do not need to perform operations in high dimensional space explicitly

Kernel Matrix The central structure in kernel machines Contains all necessary information for the learning algorithm Fuses information about the data AND the kernel Many interesting properties:

Mercer’s Theorem The kernel matrix is Symmetric Positive Definite Any symmetric positive definite matrix can be regarded as a kernel matrix, that is as an inner product matrix in some space Every (semi)positive definite, symmetric function is a kernel: i.e. there exists a mapping φ such that it is possible to write:

Examples of kernels Some common choices (both satisfying Mercer’s condition): Polynomial kernel Gaussian radial basis function (RBF)

Polynomial Kernel - example

Applying - non linear SVM Start with data , which lives in feature space of dimension n Choose a kernel corresponding to some function , which takes data point to a higher dimensional space Find the largest margin linear discriminant function in the higher dimensional space by using quadratic programming package to solve

Applying - non linear SVM Weight vector w in the high dimensional space: Linear discriminant function of largest margin in the high dimensional space: Non-Linear discriminant function in the original space:

Applying - non linear SVM

SVM summary Advantages: Disadvantages: Based on nice theory Excellent generalization properties Objective function has no local minima Can be used to find non linear discriminant functions Complexity of the classifier is characterized by the number of support vectors rather than the dimensionality of the transformed space Disadvantages: It’s not clear how to select a kernel function in a principled manner tends to be slower than other methods (in non-linear case).