Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine

Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Support vector machine
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Chapter 10 The Support Vector Method For Estimating Indicator Functions Intelligent Information Processing Laboratory, Fudan University.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Support Vector Machines Most of the slides were taken from:
Support Vector Machines and Kernels
Presentation transcript:

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis

Learning through “empirical risk” minimization Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error (also called empirical risk): class labels:

Learning through “empirical risk” minimization (cont’d) Conventional empirical risk minimization does not imply good generalization performance. There could be several different functions g(x) which all approximate the training data set well. Difficult to determine which function would have the best generalization performance.

Learning through “empirical risk” minimization (cont’d) Solution 1 Solution 2 Which solution is better?

Statistical Learning: Capacity and VC dimension To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled. Functions with high capacity are more complicated (i.e., have many degrees of freedom). low capacity high capacity

Statistical Learning: Capacity and VC dimension (cont’d) How do we measure capacity? In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of capacity. The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.

Statistical Learning: Capacity and VC dimension (cont’d) A function that (1) minimizes the empirical risk and (2) has low VC dimension will generalize well regardless of the dimensionality of the input space: with probability (1-δ); (n: # of training examples) (Vapnik, 1995, “Structural Risk Minimization Principle”) structural risk minimization n

VC dimension and margin of separation Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension. The optimal hyperplane is the one giving the largest margin of separation between the classes.

Margin of separation and support vectors How is the margin defined? The margin is defined by the distance of the nearest training samples from the hyperplane. We refer to these samples as support vectors. Intuitively speaking, these are the most difficult samples to classify.

Margin of separation and support vectors (cont’d) different solutions corresponding margins

SVM Overview Primarily two-class classifiers but can be extended to multiple classes. It performs structural risk minimization to achieve good generalization performance. The optimization criterion is the margin of separation between classes. Training is equivalent to solving a quadratic programming problem with linear constraints.

Linear SVM: separable case Linear discriminant Class labels Consider the equivalent problem: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

Linear SVM: separable case (cont’d) The distance of a point xk from the separating hyperplane should satisfy the constraint: To constraint the length of w (uniqueness), we impose: Using the above constraint:

Linear SVM: separable case (cont’d) quadratic programming problem maximize margin:

Linear SVM: separable case (cont’d) Using Langrange optimization, minimize: Easier to solve the “dual” problem (Kuhn-Tucker construction):

Linear SVM: separable case (cont’d) The solution is given by: dot product

Linear SVM: separable case (cont’d) dot product It can be shown that if xk is not a support vector, then the corresponding λk=0. Only the support vectors contribute to the solution!

Linear SVM: non-separable case Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψk :  

Linear SVM: non-separable case (cont’d) The constant c controls the trade-off between margin and misclassification errors. Aims to prevent outliers from affecting the optimal hyperplane.

Linear SVM: non-separable case (cont’d) Easier to solve the “dual” problem (Kuhn-Tucker construction):

Nonlinear SVM Extending these concepts to the non-linear case involves mapping the data to a high-dimensional space h: Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space.

Nonlinear SVM (cont’d) Example:

Nonlinear SVM (cont’d)

Nonlinear SVM (cont’d) The disadvantage of this approach is that the mapping might be very computationally intensive to compute! Is there an efficient way to compute ? non-linear SVM:

The kernel trick Compute dot products using a kernel function

The kernel trick (cont’d) Comments Kernel functions which can be expressed as a dot product in some space satisfy the Mercer’s condition (see Burges’ paper) The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is. Advantages of kernel trick No need to know Φ() Computations remain feasible even if the feature space has high dimensionality.

Polynomial Kernel K(x,y)=(x . y) d

Polynomial Kernel - Example

Common Kernel functions

Example

Example (cont’d) h=6

Example (cont’d)

Example (cont’d) (Problem 4)

Example (cont’d) w0=0

Example (cont’d) w =

Example (cont’d) w = The discriminant

Comments SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima) Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set. Performance depends on the choice of the kernel and its parameters. Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.