Support Vector Machines (SVM)

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
An Introduction to Support Vector Machines (M. Law)
An Introduction to Support Vector Machine (SVM)
Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
LECTURE 20: SUPPORT VECTOR MACHINES PT. 1 April 11, 2016 SDS 293 Machine Learning.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
SUPPORT VECTOR MACHINES
Classification Methods
Support Vector Machines
PREDICT 422: Practical Machine Learning
Large Margin classifiers
Lecture 18. Support Vector Machine (SVM)
Support Vector Machines
Computational Intelligence: Methods and Applications
Support Vector Machines
Nonparametric Methods: Support Vector Machines
An Introduction to Support Vector Machines
Support Vector Machines
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines Most of the slides were taken from:
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
COSC 4335: Other Classification Techniques
Linear Model Selection and regularization
CSSE463: Image Recognition Day 14
Support Vector Machines
CSSE463: Image Recognition Day 14
Support Vector Machine _ 2 (SVM)
Support Vector Machines and Kernels
Support Vector Machines
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 15
COSC 4368 Machine Learning Organization
SVMs for Document Ranking
Support Vector Machines 2
STT : Intro. to Statistical Learning
Classification Methods
Introduction to Machine Learning
Presentation transcript:

Support Vector Machines (SVM) STT592-002: Intro. to Statistical Learning Support Vector Machines (SVM) Chapter 09 Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning

9.1 Support Vector Classifier

Separable Hyperplanes Applied Modern Statistical Learning Methods Separable Hyperplanes Two class classification problem with 2 predictors: X1 and X2. Consider “linearly separable”: one can draw a straight line in which all points on one side belong to the first class and points on the other side to the second class. Then a natural approach is to find straight line that gives biggest separation between classes i.e. the points are as far from the line as possible. This is the basic idea of a support vector classifier.

Classification with a separating Hyperplanes STT592-002: Intro. to Statistical Learning Classification with a separating Hyperplanes For a test observation x*, find the value of: Q: what can the sign and magnitude of f(x*) tell us?

Classification with a separating Hyperplanes STT592-002: Intro. to Statistical Learning Classification with a separating Hyperplanes For a test observation x*, find the value of: Q: what can the sign and magnitude of f(x*) tell us? Magnitude of f(x∗): If f(x∗) is far from zero, then this means that x∗ lies far from the hyperplane, and so we can be confident about our class assignment for x∗. On the other hand, if f(x∗) is close to zero, then x∗ is located near the hyperplane, and so we are less certain about the class assignment for x∗.

Maximal Margin Classifier (Optimal separating Hyperplane) STT592-002: Intro. to Statistical Learning Maximal Margin Classifier (Optimal separating Hyperplane) Consider distance from each training observation to a given separating hyperplane; Margin: smallest distance from observations to hyperplane. Maximal margin hyperplane is a separating hyperplane with largest margin—that is, it is the hyperplane that has the farthest minimum distance to the training observations. maximal margin classifier: classify a test observation based on which side of the maximal margin hyperplane it lies. Widest “slab” b/w two classes

STT592-002: Intro. to Statistical Learning Support Vectors Right panel figure: Three observations are known as support vectors. They support the maximal margin hyperplane, in the sense that if they move slightly, then the maximal margin hyperplane move as well. Interestingly, the maximal margin hyperplane depends directly on the support vectors, but not on the other observations. Widest “slab” b/w two classes

Construction of the Maximal Margin Classifier STT592-002: Intro. to Statistical Learning Construction of the Maximal Margin Classifier separating Hyperplanes Maximal Margin Classifier M>0: Margin of hyperplane

One… Support Vector Classifiers STT592-002: Intro. to Statistical Learning One… Support Vector Classifiers Add a single point Maximal margin hyperplane is extremely sensitive to change of a single point. We would like to pay a price for: Greater robustness to individual observations, and Better classification of most of the training observations.

Support Vector Classifiers (soft margin classifier) STT592-002: Intro. to Statistical Learning Support Vector Classifiers (soft margin classifier) Allow some observations to be on incorrect side of the margin, or even the incorrect side of hyperplane.

Support Vector Classifiers STT592-002: Intro. to Statistical Learning Support Vector Classifiers Maximal Margin Classifier M>0: Margin of hyperplane Ɛi are slack variables. Ɛi =0: correct side of margin Ɛi >0: wrong side of margin Ɛi >1: wrong side of hyperplane. C: tuning parameters via Cross-Validation. Support Vector: on margin or wrong side of margin/hyperplane C large, Low var; High bias Support Vector Classifier C large: Low var since many observations are support vectors but potentially high bias. C small, Low bias; High var.

Its Easiest To See With A Picture (Grand. Book) Applied Modern Statistical Learning Methods Its Easiest To See With A Picture (Grand. Book) M is minimum perpendicular distance between each point and the separating line. Find the line which maximizes M. This line is called the “optimal separating hyperplane”. The classification of a point depends on which side of the line it falls on.

Non-Separating Example (Grand. Book) Applied Modern Statistical Learning Methods Non-Separating Example (Grand. Book) Let ξ*i represent the amount that ith point is on wrong side of margin (dashed line). Then we want to maximize M subject to some constraints

A Simulation Example With A Small Constant Applied Modern Statistical Learning Methods A Simulation Example With A Small Constant This is the simulation example from chapter 1. The distance between the dashed lines represents the margin or 2M. The purple lines represent the Bayes decision boundaries Eg: C=10000, 62% are support vectors

The Same Example With A Larger Constant Applied Modern Statistical Learning Methods The Same Example With A Larger Constant Using a larger constant allows for a greater margin and creates a slightly different classifier. Notice, however, that the decision boundary must always be linear. C=10000, 62% are support vectors C=0.01, 85% are support vectors

9.2 Support Vector Machine Classifier

Non-linear Separating Classes Applied Modern Statistical Learning Methods Non-linear Separating Classes How about Non-linear class boundaries? In practice… may not find a hyper-plane to perfectly separates two classes. Then find plane that gives best separation between points that are correctly classified subject to points on wrong side of the line not being off by too much. It is easier to see with a picture!

Classification with Non-linear Decision Boundaries STT592-002: Intro. to Statistical Learning Classification with Non-linear Decision Boundaries Consider enlarging the feature space using functions of the predictors, such as quadratic and cubic terms, in order to address this non-linearity. Or use Kernel functions. Left: SVM with polynomial kernel of degree 3 to the non-linear data… appropriate decision rule. Right: SVM with a radial kernel. In this example, either kernel is capable of capturing decision boundary.

Non-Linear Classifier with Kernel functions Applied Modern Statistical Learning Methods Non-Linear Classifier with Kernel functions The support vector classifier is fairly easy to think about. However, because it only allows for a linear decision boundary it may not be all that powerful. Recall that in chapter 3 we extended linear regression to non-linear regression using a basis function i.e.

Applied Modern Statistical Learning Methods A Basis Approach Conceptually, take a similar approach with support vector classifier. Support vector classifier finds optimal hyperplane in the space spanned by X1, X2,…, Xp. Create transformations (or a basis) b1(x), b2(x), …, bM(x) and find the optimal hyperplane in the transformed space spanned by b1(X), b2(X), …, bM(X). This approach produces a linear plane in the transformed space but a non-linear decision boundary in the original space. This is called the Support Vector Machine (SVM) Classifier.

SVM for classification STT592-002: Intro. to Statistical Learning For a test observation x*, find the value of: SVM for classification Inner Products

SVM for classification STT592-002: Intro. to Statistical Learning SVM for classification Kernels: Replace inner product of the support vector classifier, with a generalization of inner product of form: A kernel is a kernel function that quantifies the similarity of two observations. linear kernel: polynomial kernel of degree d: radial kernel:

Applied Modern Statistical Learning Methods In Reality While conceptually the basis approach is how the support vector machine works, there is some complicated math (which I will spare you) which means that we don’t actually choose b1(x), b2(x), …, bM(x). Instead we choose something called a Kernel function which takes the place of the basis. Common kernel functions include Linear Polynomial Radial Basis Sigmoid

Summary: Support Vector Machines STT592-002: Intro. to Statistical Learning Summary: Support Vector Machines Support Vector Classifier Support Vector Machines

Polynomial Kernel On Sim Data Applied Modern Statistical Learning Methods Polynomial Kernel On Sim Data Using a polynomial kernel we now allow SVM to produce a non-linear decision boundary. Notice that the test error rate is a lot lower.

Applied Modern Statistical Learning Methods Radial Basis Kernel Using a Radial Basis Function (RBF) Kernel you get an even lower error rate.

More Than Two Predictors Applied Modern Statistical Learning Methods More Than Two Predictors This idea works just as well with more than two predictors. For example, with three predictors you want to find the plane that produces the largest separation between the classes. With more than three dimensions it becomes hard to visualize a plane but it still exists. In general they are caller hyper-planes. One versus One: compare each pairs One versus All: one of K classes v.s. all remaining K-1 classes.

Review: Chap4 Chap8: How to draw ROC curve

STT592-002: Intro. to Statistical Learning LDA for Default Overall accuracy = 97.25%. Now the total number of mistakes is 252+23 = 275 (2.75% misclassification error rate) But we miss-predicted 252/333 = 75.7% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Eg: Sensitivity = % of true defaulters that are identified = 24.3% (low). Specificity = % of non-defaulters that are correctly identified = 99.8%.

Use 0.2 as Threshold for Default STT592-002: Intro. to Statistical Learning Use 0.2 as Threshold for Default Now the total number of mistakes is 138+235 = 373 (3.73% misclassification error rate) But we miss-predicted 138/333 = 41.4% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Eg: Sensitivity = % of true defaulters that are identified = 58.6% (higher). Specificity = % of non-defaulters that are correctly identified = 97.6%.

Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve Overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC, the better the classifier. For this data the AUC is 0.95, which is close to the maximum of one so would be considered very good.

Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve Eg: In the Default data, “+” indicates an individual who defaults, and “−” indicates one who does not. Connect to classical hypothesis testing literature, we think of “−” as the null hypothesis and “+” as the alternative (non-null) hypothesis.

Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve Eg: Sensitivity = % of true defaulters that are identified = 195/333=58.6%. Specificity = % of non-defaulters that are correctly identified = 97.6%. False positive rate = 1- Specificity = 1-97.6% = 2.4% (or 235/9667) True positive rate= Sensitivity= 58.6% = 195/333

An application to Heart Disease Data with training set STT592-002: Intro. to Statistical Learning An application to Heart Disease Data with training set An optimal classifier will hug the top left corner of the ROC plot.

An application to Heart Disease Data with testing data STT592-002: Intro. to Statistical Learning An application to Heart Disease Data with testing data An optimal classifier will hug the top left corner of the ROC plot.