Copyright 2005 by David Helmbold1 Support Vector Machines (SVMs) References: Cristianini & Shawe-Taylor book; Vapnik’s book; and “A Tutorial on Support.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

CHAPTER 13: Alpaydin: Kernel Machines
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
INTRODUCTION TO Machine Learning 2nd Edition
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Discriminative and generative methods for bags of features
Support Vector Machine
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines Kernel Machines
Support Vector Machines and Kernel Methods
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Lecture 10: Support Vector Machines
An Introduction to Support Vector Machines CSE 573 Autumn 2005 Henry Kautz based on slides stolen from Pierre Dönnes’ web site.
Support Vector Machine & Image Classification Applications
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machine
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
An Introduction to Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Hyperparameters, bias-variance tradeoff, validation
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
COSC 4335: Other Classification Techniques
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Support Vector Machines and Kernels
CSSE463: Image Recognition Day 14
CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
Support Vector Machines 2
Presentation transcript:

Copyright 2005 by David Helmbold1 Support Vector Machines (SVMs) References: Cristianini & Shawe-Taylor book; Vapnik’s book; and “A Tutorial on Support Vector Machines for Pattern Recognition” by Burges, in Datamining and Knowledge Discovery 2, 1998 (check citeseer)

2 SVMs Combines learning and optimization theory Exactly solve optimization (as opposed to ANN or Decision Trees) Allows “kernel trick” to get more features almost for free Soft margin relaxes linear separable assumption

3 Basic Idea: (separable case) Which linear threshold hypothesis to use? Pick one with biggest margin (gap) –Intuitively good, safe against small errors –Some supporting theory –Empirically works well Leads to a quadratic programming problem (Vapnik 95, Cortes and Vapnik 95) x x x x x x

4 Basic Idea: Pick one with biggest margin (gap)  Leads to a quadratic programming problem (Vapnik 95, Cortes and Vapnik 95) Support Vectors are the points with smallest margin (closest to boundary, circled) hypothesis stability: it only changes if support vectors change x x x x x x  

5 Good generalization If f(x) = ∑w j x j + b and ∑w j 2 = 1 and large margin, then it (expects) to generalize well generalization error: R = radius of support, N = # examples, delta = margin this is independent of dimension (# of features) !

6 SVM applications Text classification - (Joachims, used outlier removal when a i too large, Joachims also author of SVM-light) Object detection (e.g. Osuna et.al., for faces) Hand written digits - few support vectors Bioinformatics - protein homology, micro-array See Cristianini&Shaw-Taylor book, and www. kernel-machines.orgwww. kernel-machines.org for more info and references

7 SVM Mathematically Find w, b,  such that: w  x i + b   when t i = +1 w  x i + b  -  when t i = -1 and  as big as possible Scaling issue - fix by setting  =1 and finding shortest w : min w,b ||w|| 2 subject to t i (w  x i + b)  1 for all examples (x i,t i ) This is a quadratic programming problem, packages exist (also SMO algorithm)

8 More Math (Quickly) LaGrange multipliers, dual problem, …, magic, … The w vector is a linear combination of support vectors: w =  sup vects i a i x i Predict on x based on w  x  b, or equivalently  sup vects i a i (x i  x)  b Key idea 1: predictions (and finding a i ’s) depend only on dot products

9 Soft Margin for noise Allow margin errors  I  0 measures amount of error on x i Want to minimize both w  w and  i  i Need tradeoff parameter Still Quadratic programming, math messier x x x x x x   11 22

10 Don’t need to use dot product, can use any “dot-product like” function K(x,x’) Example: K(x,x’) = (x  x’+1) 2 In 1-dimension: K(x,x’) = x 2 x’ 2 + 2xx’ +1 = (x 2, x, 1)  (x’ 2, x’, 1), get quadratic term too (extra feature) How can this help? Kernel functions

11 xxx Want to classify these points with a SVM, No hyper-plane (threshold) has good margin

12 xxx x x x

13 Kernel trick: Kernel functions embed the low dimensional original space into a higher dimensional one, like the film of a soap bubble Although sample is not linearly separable in the original space, the higher-dimensional embedding might be Get embedding and “extra” features for free (almost) General trick for any dot-product based algorithm (e.g. Perceptron)

14 Common Kernels Polynomials: K(x,x’) = (x  x’+1) d –Gets new features like x 1 x 3 Radial Basis function (Gaussian): K(x,x’) = exp(-||x - x’|| 2 / 2  2 ) Sigmoid: K(x,x’) = tanh(a (x  x’) - b) Also special purpose string and graph kernels Which to use and parameters? cross validation can help, RBF Kernels common

15 What can be a Kernel function? K(x,z) is a (Mercer) kernel function if K(x,z) =  (x)  (z) for some feature transformation  () K(x,z) is a Mercer kernel function if the matrix K = (k(x i,x j )) i,j is positive semi-definite for all subsets {x 1,…,x n } of instance space Vapnik showed expected test error bound: (expected # support vectors) / (# examples)

16 Gaussian Kernels Act like weighted nearest neighbor on support vectors Allow very flexible hypothesis space Mathematically a dot product in an “infinite dimensional” space SVMs unify LTU (linear kernel) and nearest neighbor (Gaussian kernel) recall w = Σ α i x i what is the prediction?

Multiclass Kernel Machines 1-vs-all Pairwise separation (all pairs) Error-Correcting Output Codes (section 17.5) Single multiclass optimization t superscript = example # z t = label of example t i = class # 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

18 SVM for Regression Use a linear model (possibly kernelized) f(x)=w T x+w 0 Use the ε-(in)sensitive error function Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Kernel Regression Polynomial kernelGaussian kernel 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

One-Class Kernel Machines Consider a sphere with center a and radius R 21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) One-class SVM with Gaussian kernels (s is sigma)

23 SVM Summary Model: very flexible, but must pick many parameters (kernel, kernel parameters, trade-off) Data: Numeric (depending on kernel) Interpretable? Yes for dot product, pretty pictures for Gaussian kernels in low dimensions Missing values? No Noise/outliers? Very good (softmargin) Irrelevant features? Yes for dot product with abs value penalty Comp. efficiency? quadratic in # examples, but “exact” optimization rather than approximate (like ANN, decision trees) Chunking techniques and other optimizations (SMO iteratively optimize a i ’s over pairs, SGD, PEGASOS, etc.)