Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.

An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Support vector machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.
Support Vector Machines Tao Department of computer science University of Illinois.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Omer Boehm A tutorial about SVM Omer Boehm
Support Vector Machines
Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pawan Lingras and Cory Butz
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Support Vector Machines Most of the slides were taken from:
Support Vector Machines and Kernels
COSC 4368 Machine Learning Organization
Presentation transcript:

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis

Learning through “empirical risk” minimization Typically, a discriminant function g(x) is estimated from a finite set of examples by minimizing an error function, e.g., the training error: class labels: correct class predicted class empirical risk minimization

Learning through “empirical risk” minimization (cont’d) Conventional empirical risk minimization does not imply good generalization performance. – There could be several different functions g(x) which all approximate the training data set well. – Difficult to determine which function would have the best generalization performance. Solution 1Solution 2 Which solution is best?

Statistical Learning: Capacity and VC dimension To guarantee good generalization performance, the complexity or capacity of the learned functions must be controlled. Functions with high capacity are more complicated (i.e., have many degrees of freedom). low capacity high capacity

Statistical Learning: Capacity and VC dimension (cont’d) How can we measure the capacity of a discriminant function? – In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of the capacity of a classifier. – The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.

Statistical Learning: Capacity and VC dimension (cont’d) It can be shown that a classifier that: (1) minimizes the empirical risk and (2) has low VC dimension will generalize well regardless of the dimensionality of the input space with probability (1-δ); (n: # of training examples) (Vapnik, 1995, “Structural Risk Minimization Principle”)n structural risk minimization (h: VC dimension)

VC dimension and margin of separation Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension. The optimal hyperplane is the one giving the largest margin of separation between the classes.

Margin of separation and support vectors How is the margin defined? – The margin is defined by the distance of the nearest training samples from the hyperplane. – Intuitively speaking, these are the most difficult samples to classify. – We refer to these samples as support vectors.

Margin of separation and support vectors (cont’d) different solutionscorresponding margins

SVM Overview Primarily two-class classifiers but can be extended to multiple classes. It performs structural risk minimization to achieve good generalization performance. The optimization criterion is the margin of separation between classes. Training is equivalent to solving a quadratic programming problem with linear constraints.

Linear SVM: separable case Linear discriminant Class labels Consider the equivalent problem: Decide ω 1 if g(x) > 0 and ω 2 if g(x) < 0

Linear SVM: separable case (cont’d) The distance r of a point x k from the separating hyperplane should satisfy the constraint: To constraint the length of w (for uniqueness), we impose:

Linear SVM: separable case (cont’d) Maximize margin: quadratic programming problem equivalent

Linear SVM: separable case (cont’d) Using Langrange optimization, minimize: Easier to solve the “dual” problem (Kuhn-Tucker construction):

Linear SVM: separable case (cont’d) The solution is given by: The discriminant is given by: dot product for any k

Linear SVM: separable case (cont’d) It can be shown that if x k is not a support vector, then the corresponding λ k =0. The solution depends on the support vectors only!

Linear SVM: non-separable case Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψ k : c: const

Linear SVM: non-separable case (cont’d) The constant c controls the trade-off between margin and misclassification errors. Aims to prevent outliers from affecting the optimal hyperplane.

Linear SVM: non-separable case (cont’d) Easier to solve the “dual” problem (Kuhn-Tucker construction):

Extensions to the non-linear case involves mapping the data to a an h-dimensional space: Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space. Nonlinear SVM

Nonlinear SVM (cont’d) Example:

Nonlinear SVM (cont’d) linear SVM: non-linear SVM:

The disadvantage of this approach is that the mapping might be very computationally intensive to compute! Is there an efficient way to compute ? Nonlinear SVM (cont’d) non-linear SVM:

The kernel trick Compute dot products using a kernel function

The kernel trick (cont’d) Mercer’s condition – Kernel functions which can be expressed as a dot product in some space satisfy the Mercer’s condition (see Burges’ paper) – The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is. Advantages of kernel trick – No need to know Φ() – Computations remain feasible even if the feature space has high dimensionality.

Polynomial Kernel K(x,y)=(x. y) d

Polynomial Kernel (cont’d)

Common Kernel functions

Example

Example (cont’d) h=6

Example (cont’d)

(Problem 4)

Example (cont’d) w 0 =0

Example (cont’d) w =

Example (cont’d) The discriminant w =

Comments SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima) Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set. Performance depends on the choice of the kernel and its parameters. Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.