Ohad Hageby IDC 2008 1 Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Lecture 9 Support Vector Machines

ECG Signal processing (2)

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

An Introduction of Support Vector Machine

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.

SVM—Support Vector Machines

Support vector machine

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines and Kernel Methods

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

The Nature of Statistical Learning Theory by V. Vapnik

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Support Vector Machines Kernel Machines

Support Vector Machines and Kernel Methods

Support Vector Machines

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

SVM Support Vectors Machines

Lecture 10: Support Vector Machines

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Support Vector Machine & Image Classification Applications

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

An Introduction to Support Vector Machines (M. Law)

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

An Introduction to Support Vector Machine (SVM)

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines Tao Department of computer science University of Illinois.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

SVMs in a Nutshell.

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Support Vector Machines

Support vector machines

CS 9633 Machine Learning Support Vector Machines

PREDICT 422: Practical Machine Learning

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Support Vector Machines

Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Support Vector Machines

SVMs for Document Ranking

Support Vector Machines 2

Presentation transcript:

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya

Ohad Hageby IDC Introduction To Support Vector Machines (SVM) Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. supervised learninglinear classifierssupervised learninglinear classifiers (from Wikipedia) (from Wikipedia)

Ohad Hageby IDC Introduction Continued Often we are interested in classifying data as a part of a machine-learning process. Each data point will be represented by a p- dimensional vector (a list of p numbers). Each of these data points belongs to only one of two classes. Each of these data points belongs to only one of two classes.

Ohad Hageby IDC Training Data We want to estimate a function f:R N  {+1,-1}, using input-output training data pairs generated independent and identically distributed according to unknown P(x,y) If f(x i )=-1 x i is in class 1 If f(x i )=1 x i is in class 2

Ohad Hageby IDC The machine The machine task is to learn the mapping of x i to y i. It is defined by a set of possible mappings: x  f(x)

Ohad Hageby IDC Expected Error The test examples assumed to be of the same probability distribution as the training data P(x,y). The best function f we could have is one minimizing the expected error (risk).

Ohad Hageby IDC I denotes the “loss” function (“0/1 loss”) A common loss function is the squared loss:

Ohad Hageby IDC Empirical Risk Unfortunately the risk cannot be minimize directly due to the unknown probability distribution. “empirical risk” is defined to be just the measured mean error rate on the training set (for a fixed, finite number of observations)

Ohad Hageby IDC The overfitting dilemma It is possible to give conditions on the learning machine which will ensure that when n  ∞ R emp will converge toward R expected. For small sample size overfitting might occur

Ohad Hageby IDC The overfitting dilemma cont. From “An introduction to Kernel Based Learning Algorithms”

Ohad Hageby IDC VC Dimension A concept in “VC Theory” introduces by Vladimir Vapnik and Alexey Chervonenkis. Measure of the capacity of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter.

Ohad Hageby IDC Shattering Example From Wikipedia For example, consider a straight line as the classification model: the model used by a perceptron. The line should separate positive data points from negative data points. When there are 3 points that are not collinear, the line can shatter them.

Ohad Hageby IDC Shattering A classification model f with some parameter vector θ is said to shatter a set of data points (x 1,x 2,…,x n ) if, for all assignments of labels to those points, there exists a θ such that the model f makes no errors when evaluating that set of data points.

Ohad Hageby IDC Shattering Continued VC dimension of a model f is the maximum h such that some data point set of cardinality h can be shattered by f. The VC dimension has utility in statistical learning theory, because it can predict a probabilistic upper bound on the test error of a classification model.

Ohad Hageby IDC Upper Bound on Error In our case the upper bound on the training error is given by (Vapnik, 1995): For all δ>0 and f ∊F:

Ohad Hageby IDC Theorem: VC Dimension in R n The VC dimension of the set of oriented hyperplanes in R n is n+1 since we can always choose n+1 points and then choose one of the points as origin s.t. the position vectors of the remaining n points are linearly independent. But we can never choose n+2 such points. (Anthony and Biggs, 1995)

Ohad Hageby IDC Structural Risk Minimization Taking too many training points and the model may be “too tight” and predict poorly on new test points. Too little, may not be enough to learn. One way to avoid overfitting dilemma is to limit the complexity of the function class F that we choose function f from. One way to avoid overfitting dilemma is to limit the complexity of the function class F that we choose function f from. Intuition: “Simple” (e.g. linear) function that explains most of the data is preferable to a complex one (Occum’s razor)

Ohad Hageby IDC From “An introduction to Kernel Based Learning Algorithms”

Ohad Hageby IDC The Support Vector Machine Linear Case In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) such that: Because the set of training data is finite there is a family of such hyperplanes. We would like to maximize the distance (margin) of each class points from the separating plane. We could scale w and b such that:

Ohad Hageby IDC SVM – Linear Case Linear separating hyperplanes. The support vectors are the ones used to find the hyperplane (circled).

Ohad Hageby IDC Important observations Only a small part of the training set is used to build the hyperplane (the support vectors). At least one point at every side of the hyperplane achieve the equality: For two such opposite points with minimal distance:

Ohad Hageby IDC Reformulating as quadratic optimization problem This means that maximizing the distance is the same as minimizing ½|w| 2:

Ohad Hageby IDC Solving the SVM We can solve by introducing Lagrange multipliers α i to obtain the Lagrangian which should be minimized with respect to w and b and maximized with respect to α i (Karush- Kuhn-Tucker conditions)

Ohad Hageby IDC Solving the SVM Cont. A little manipulation leads to the requirement of: Note! We expect most α i to be zero. Those which aren’t represent the support vectors.

Ohad Hageby IDC The dual Problem

Ohad Hageby IDC SVM - Non linear case Not always the dataset is linearly separable!

Ohad Hageby IDC

Ohad Hageby IDC Mapping F to higher dimension We need a function Ф(x)=x’ to map x to a higher dimension feature space.

Ohad Hageby IDC Mapping F to higher dimension Pro: In many problems we can linearly separate when feature space is of higher dimension. Con: mapping to a higher dimension is computationally complex! “The curse of dimensionality” (in statistics) tells us we will need to sample exponentially much more data! Is that really so?

Ohad Hageby IDC Mapping F to higher dimension Statistical Learning theory tells us that learning in F can be simpler if one uses low complexity decision rules (like linear classifier). In short, not the dimensionality but the complexity of the function class matters. Fortunately, for some feature spaces and their mapping Ф we can use a trick!

Ohad Hageby IDC The “Kernel Trick” Kernel function map data vectors to feature space with higher dimension (like the Ф we are looking for). Some kernel functions has unique property and they can be used to directly calculate the scalar product in the feature space.

Ohad Hageby IDC Kernel Trick Example Given the following kernel function Ф, we will take x and y vectors in R 2, and see how we calculate the kernel function K(x,y) using dot product of Ф(x)Ф(y):

Ohad Hageby IDC Conclusion: We do not have to calculate Ф every time to calculate k(x,y)! It’s a straightforward dot product calculation of x and y.

Ohad Hageby IDC Moving back to SVM in the higher Dimension The Lagrangian will be: At the optimal point – “saddle point equations”: Which translate to:

Ohad Hageby IDC And the optimization problem

Ohad Hageby IDC The Decision Function Solving the (dual) optimization problem leads to the non-linear decision function

Ohad Hageby IDC The non separable case We considered until now the separable case which is consistent with empirical error zero. For noisy data this may not be the minimum in the expected risk (overfitting!) Solution: using “slack variables” to relax the hard margin constraints:

Ohad Hageby IDC We have now to also minimize upper bound on the empirical risk

Ohad Hageby IDC And the dual problem

Ohad Hageby IDC Examples Kernel Functions PolynomialsGaussiansSigmoids Radial Basis Functions …

Ohad Hageby IDC Example of an SV classifier found using RBF: Kernel k(x,x’)=exp(-||x-x’|| 2 ). Here the input space is X=[-1,1] 2 Taken from Bill Freeman’s Notes

Ohad Hageby IDC Part 2 Gender Classification with SVMs

Ohad Hageby IDC The Goal Learning to classify pictures according to their gender (Male/Female) when only the facial features appear (almost no hair)

Ohad Hageby IDC The experiment Faces were processed from FERET database pictures to be consistent with the requirement of the experiment

Ohad Hageby IDC The experiment SVM performance compared with: –Linear classifier –Quadratic classifier –Fisher Linear Discriminant –Nearest Neighbor

Ohad Hageby IDC The experiment Cont. The experiment was conducted on two sets of data: high and low resolution (of the same) pictures, a performance comparison was made. The goal was to learn the minimal required data for a classifier to classify gender. Performance of 30 humans was used as well for comparison. The data: 1755 pictures 711 females and 1044 males.

Ohad Hageby IDC Training Data 80 by 40 pixel images for the “high resolution” 21 by 12 pixel for the thumbnails For each classifier estimation with 5-fold cross validation. (4/5 training and 1/5 testing)

Ohad Hageby IDC Support Faces

Ohad Hageby IDC Results on Thumbnails

Ohad Hageby IDC Human Error Rate

Ohad Hageby IDC Human vs SVM

Ohad Hageby IDC Can you tell?

Ohad Hageby IDC Can you tell? Answer: F-M-M-F-M