Support Vector Machines

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Lecture 9 Support Vector Machines

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.

CHAPTER 10: Linear Discrimination

Pattern Recognition and Machine Learning

An Introduction of Support Vector Machine

Support Vector Machines

1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.

SVM—Support Vector Machines

Classification and Decision Boundaries

Support Vector Machines

Support Vector Machines (and Kernel Methods in general)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

x – independent variable (input)

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

Support Vector Machines

Lecture 10: Support Vector Machines

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:

An Introduction to Support Vector Machines Martin Law.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

CS 478 – Tools for Machine Learning and Data Mining SVM.

An Introduction to Support Vector Machine (SVM)

Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Support Vector Machines

SVMs in a Nutshell.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Support Vector Regression in Marketing Georgi Nalbantov.

Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Support Vector Machines

Support vector machines

CS 9633 Machine Learning Support Vector Machines

PREDICT 422: Practical Machine Learning

Geometrical intuition behind the dual problem

Support Vector Machines

Support Vector Machines

An Introduction to Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Hyperparameters, bias-variance tradeoff, validation

CSSE463: Image Recognition Day 14

Presenter: Georgi Nalbantov

COSC 4335: Other Classification Techniques

Support Vector Machines and Kernels

CSSE463: Image Recognition Day 15

Other Classification Models: Support Vector Machine (SVM)

COSC 4368 Machine Learning Organization

Support Vector Machines 2

Presentation transcript:

Support Vector Machines Summer Course: Data Mining Support Vector Machines Support Vector Machines and other penalization classifiers Presenter: Georgi Nalbantov Presenter: Georgi Nalbantov August 2009

Contents Purpose Linear Support Vector Machines Nonlinear Support Vector Machines (Theoretical justifications of SVM) Marketing Examples Other penalization classification methods Conclusion and Q & A (some extensions)

Purpose Task to be solved (The Classification Task): Classify cases (customers) into “type 1” or “type 2” on the basis of some known attributes (characteristics) Chosen tool to solve this task: Support Vector Machines

The Classification Task Given data on explanatory and explained variables, where the explained variable can take two values {  1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases: Given: ( x1, y1 ), … , ( xm , ym )  n  {  1 } Find:  : n  {  1 } “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k ) is minimal Existing techniques to solve the classification task: Linear and Quadratic Discriminant Analysis Logit choice models (Logistic Regression) Decision trees, Neural Networks, Least Squares SVM

Support Vector Machines: Definition Support Vector Machines are a non-parametric tool for classification/regression Support Vector Machines are used for prediction rather than description purposes Support Vector Machines have been developed by Vapnik and co-workers

Linear Support Vector Machines A direct marketing company wants to sell a new book: “The Art History of Florence” Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003). Problem: How to identify buyers and non-buyers using the two variables: Months since last purchase Number of art books purchased ∆ buyers ● non-buyers ∆ ● Number of art books purchased Months since last purchase

Linear SVM: Separable Case Main idea of SVM: separate groups by a line. However: There are infinitely many lines that have zero training error… … which line shall we choose? ∆ buyers ● non-buyers ∆ ● Number of art books purchased Months since last purchase

Linear SVM: Separable Case SVM use the idea of a margin around the separating line. The thinner the margin, the more complex the model, The best line is the one with the largest margin. margin ∆ buyers ● non-buyers ∆ ● Number of art books purchased Months since last purchase

Linear SVM: Separable Case The line having the largest margin is: w1x1 + w2x2 + b = 0 Where x1 = months since last purchase x2 = number of art books purchased Note: w1xi 1 + w2xi 2 + b  +1 for i  ∆ w1xj 1 + w2xj 2 + b  –1 for j  ● x2 w1x1 + w2x2 + b = 1 ∆ ● w1x1 + w2x2 + b = 0 w1x1 + w2x2 + b = -1 Number of art books purchased margin x1 Months since last purchase

Linear SVM: Separable Case The width of the margin is given by: Note: x2 w1x1 + w2x2 + b = 1 ∆ ● w1x1 + w2x2 + b = 0 w1x1 + w2x2 + b = -1 Number of art books purchased margin maximize the margin minimize x1 Months since last purchase

Linear SVM: Separable Case maximize the margin minimize x2 ∆ ● The optimization problem for SVM is: subject to: w1xi 1 + w2xi 2 + b  +1 for i  ∆ w1xj 1 + w2xj 2 + b  –1 for j  ● margin x1

Linear SVM: Separable Case “Support vectors” x2 “Support vectors” are those points that lie on the boundaries of the margin The decision surface (line) is determined only by the support vectors. All other points are irrelevant ∆ ● x1

Linear SVM: Nonseparable Case Non-separable case: there is no line separating errorlessly the two groups Here, SVM minimize L(w,C) : subject to: w1xi 1 + w2xi 2 + b  +1 – i for i  ∆ w1xj 1 + w2xj 2 + b  –1 + i for j  ● I,j  0 Training set: 1000 targeted customers x2 ∆ buyers ● non-buyers w1x1 + w2x2 + b = 1 maximize the margin minimize the training errors ∆ ∆ ∆ ∆ ∆ ∆ ● ● ∆ ● L(w,C) = Complexity + Errors ● ∆ ∆ ● ● ● ● ● ● ● x1

Linear SVM: The Role of C x2 x1 C = 1 ∆ ● x2 ∆ ● C = 5 ∆ ∆ x1 Bigger C Smaller C increased complexity decreased complexity ( thinner margin ) ( wider margin ) smaller number errors bigger number errors ( better fit on the data ) ( worse fit on the data ) Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors

Bias – Variance trade-off

From Regression into Classification We have a linear model, such as We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). The training data set we have consists of only MANY observations, for instance: Output (y) Input (x) -1 0.2 1 0.5 Training data: 1 0.7 . . . . . -1 -0.7

From Regression into Classification We have a linear model, such as y We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). 1 The training data set we have consists of only MANY observations, for instance: -1 Training data: x Output (y) Input (x) -1 0.2 1 0.5 1 0.7 “margin” x . . . . . Support vector -1 -0.7

From Regression into Classification: Support Vector Machines flatter line  greater penalization y smaller slope  bigger margin equivalently: 1 -1 x x “margin”

From Regression into Classification: Support Vector Machines y x2 x2 x1 x1 “margin” flatter line  greater penalization equivalently: smaller slope  bigger margin

Nonlinear SVM: Nonseparable Case Mapping into a higher-dimensional space Optimization task: minimize L(w,C) subject to: ∆ ● x2 ∆ ● x1

Nonlinear SVM: Nonseparable Case Map the data into higher-dimensional space: 2 3 ∆ ● x2 (1,1) (-1,1) (-1,-1) ∆ ● ∆ x1 ● (1,-1)

Nonlinear SVM: Nonseparable Case Find the optimal hyperplane in the transformed space ∆ ● x2 (1,1) (-1,1) (-1,-1) ∆ ● ∆ x1 ● (1,-1)

Nonlinear SVM: Nonseparable Case Observe the decision surface in the original space (optional) ∆ ● x2 ∆ ● ∆ x1 ● ∆ ●

Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem Primal Dual Subject to Subject to

Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem Dual Subject to (kernel function)

Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem Dual Subject to (kernel function)

Strengths and Weaknesses of SVM Strengths of SVM: Training is relatively easy No local minima It scales relatively well to high dimensional data Trade-off between classifier complexity and error can be controlled explicitly via C Robustness of the results The “curse of dimensionality” is avoided Weaknesses of SVM: What is the best trade-off parameter C ? Need a good transformation of the original space

The Ketchup Marketing Problem Two types of ketchup: Heinz and Hunts Seven Attributes Feature Heinz Feature Hunts Display Heinz Display Hunts Feature&Display Heinz Feature&Display Hunts Log price difference between Heinz and Hunts Training Data: 2498 cases (89.11% Heinz is chosen) Test Data: 300 cases (88.33% Heinz is chosen)

The Ketchup Marketing Problem Choose a kernel mapping: Cross-validation mean squared errors, SVM with RBF kernel Linear kernel Polynomial kernel RBF kernel Do (5-fold ) cross-validation procedure to find the best combination of the manually adjustable parameters (here: C and σ) C min max σ

The Ketchup Marketing Problem – Training Set Model Linear Discriminant Analysis Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 68 204 272 89.51% 58 2168 2226 % 25.00% 75.00% 100.00% 2.61% 97.39%

The Ketchup Marketing Problem – Training Set Model Logit Choice Model Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 214 58 272 77.79% 497 1729 2226 % 78.68% 21.32% 100.00% 22.33% 77.67%

The Ketchup Marketing Problem – Training Set Model Support Vector Machines Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 255 17 272 99.08% 6 2220 2226 % 93.75% 6.25% 100.00% 0.27% 99.73%

The Ketchup Marketing Problem – Training Set Model Majority Voting Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 272 89.11% 2226 % 0% 100% 100.00%

The Ketchup Marketing Problem – Test Set Model Linear Discriminant Analysis Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 3 32 35 88.33% 262 265 % 8.57% 91.43% 100.00% 1.13% 98.87%

The Ketchup Marketing Problem – Test Set Model Logit Choice Model Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 29 6 35 77% 63 202 265 % 82.86% 17.14% 100.00% 23.77% 76.23%

The Ketchup Marketing Problem – Test Set Model Support Vector Machines Heinz Predicted Group Membership Total Hunts Hit Rate Original Count 25 10 35 95.67% 3 262 265 % 71.43% 28.57% 100.00% 1.13% 98.87%

Part II Penalized classification and regression methods Support Hyperplanes Nearest Convex Hull classifier Soft Nearest Neighbor Application: An example Support Vector Regression financial study Conclusion

Classification: Support Hyperplanes + + Consider a (separable) binary classification case: training data (+,-) and a test point x. There are infinitely many hyperplanes that are semi-consistent (= commit no error) with the training data.

Classification: Support Hyperplanes + + + + + + + Support hyperplane of x + + + + + For the classification of the test point x, use the farthest-away h-plane that is semi-consistent with training data. The SH decision surface. Each point on it has 2 support h-planes.

Classification: Support Hyperplanes + + + + + + Toy Problem Experiment with Support Hyperplanes and Support Vector Machines

Support Vector Machines and Support Hyperplanes Classification: Support Vector Machines and Support Hyperplanes Support Vector Machines Support Hyperplanes

Support Vector Machines and Nearest Convex Hull cl. Classification: Support Vector Machines and Nearest Convex Hull cl. Support Vector Machines Nearest Convex Hull classification

Support Vector Machines and Soft Nearest Neighbor Classification: Support Vector Machines and Soft Nearest Neighbor Support Vector Machines Soft Nearest Neighbor

Classification: Support Hyperplanes (bigger penalization) Support Hyperplanes

Classification: Nearest Convex Hull classification (bigger penalization) Nearest Convex Hull classification

(bigger penalization) Classification: Soft Nearest Neighbor Soft Nearest Neighbor (bigger penalization) Soft Nearest Neighbor

Classification: Support Vector Machines, Nonseparable Case

Classification: Support Hyperplanes, Nonseparable Case

Classification: Nearest Convex Hull classification, Nonseparable Case

Classification: Soft Nearest Neighbor, Nonseparable Case

Summary: Penalization Techniques for Classification Penalization methods for classification: Support Vector Machines (SVM), Support Hyperplanes (SH), Nearest Convex Hull classification (NCH), and Soft Nearest Neighbour (SNN). In all cases, the classificarion of test point x is dete4rmined using the hyperplane h. Equivalently, x is labelled +1 (-1) if it is farther away from set S_ (S+).

Conclusion Support Vector Machines (SVM) can be applied in the binary and multi-class classification problems SVM behave robustly in multivariate problems Further research in various Marketing areas is needed to justify or refute the applicability of SVM Support Vector Regressions (SVR) can also be applied http://www.kernel-machines.org Email: nalbantov@few.eur.nl