Part 2: Support Vector Machines

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Linear Classifiers (perceptrons)

An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Support vector machine
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Machines Kernel Machines
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
CS 4700: Foundations of Artificial Intelligence
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Introduction to Predictive Learning
SVM Support Vectors Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota Presented at the University.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
Part 2: Margin-Based Methods and Support Vector Machines
An Introduction to Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Predictive Learning from Data
Support Vector Machines and Kernels
Presentation transcript:

Part 2: Support Vector Machines Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at Tech Tune Ups, ECE Dept, June 1, 2011 Electrical and Computer Engineering 1 1 1

SVM: Brief History Margin (Vapnik & Lerner) Margin (Vapnik and Chervonenkis, 1964) 1964 RBF Kernels (Aizerman) 1965 Optimization formulation (Mangasarian) 1971 Kernels (Kimeldorf annd Wahba) 1992-1994 SVMs (Vapnik et al) 1996 – present Rapid growth, numerous apps 1996 – present Extensions to other problems

MOTIVATION for SVM Problems with ‘conventional’ methods: - model complexity ~ dimensionality (# features) - nonlinear methods  multiple minima - hard to control complexity SVM solution approach - adaptive loss function (to control complexity independent of dimensionality) - flexible nonlinear models - tractable optimization formulation

SVM APPROACH Linear approximation in Z-space using special adaptive loss function Complexity independent of dimensionality

OUTLINE Margin-based loss SVM for classification SVM examples Support vector regression Summary

Example: binary classification Given: Linearly separable data How to construct linear decision boundary?

Linear Discriminant Analysis LDA solution Separation margin

Perceptron (linear NN) Perceptron solutions and separation margin

Largest-margin solution All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (falsifiability)

Complexity of -margin hyperplanes If data samples belong to a sphere of radius R, then the set of -margin hyperplanes has VC dimension bounded by For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.

Motivation: philosophical Classical view: good model explains the data + low complexity Occam’s razor (complexity ~ # parameters) VC theory: good model explains the data + low VC-dimension ~ VC-falsifiability: good model: explains the data + has large falsifiability The idea: falsifiability ~ empirical loss function

Adaptive loss functions Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model The trade-off (between the two goals) is adaptively controlled  adaptive loss fct Examples of such loss functions for different learning problems are shown next

Margin-based loss for classification

Margin-based loss for classification: margin is adapted to training data

Epsilon loss for regression

Parameter epsilon is adapted to training data Example: linear regression y = x + noise where noise = N(0, 0.36), x ~ [0,1], 4 samples Compare: squared, linear and SVM loss (eps = 0.6)

OUTLINE Margin-based loss SVM for classification - Linear SVM classifier - Inner product kernels - Nonlinear SVM classifier SVM examples Support Vector regression Summary

SVM Loss for Classification Continuous quantity measures how close a sample x is to decision boundary

Optimal Separating Hyperplane Distance btwn hyperplane and sample  Margin Shaded points are SVs

Linear SVM Optimization Formulation (for separable data) Given training data Find parameters of linear hyperplane that minimize under constraints Quadratic optimization with linear constraints tractable for moderate dimensions d For large dimensions use dual formulation: - scales with sample size(n) rather than d - uses only dot products

Classification for non-separable data 21

SVM for non-separable data x ( ) = + 1 - 2 3 Minimize under constraints

SVM Dual Formulation Given training data Find parameters of an opt. hyperplane as a solution to maximization problem under constraints Solution where samples with nonzero are SVs Needs only inner products

Nonlinear Decision Boundary Fixed (linear) parameterization is too rigid Nonlinear curved margin may yield larger margin (falsifiability) and lower error

Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM Nonlinear mapping to feature z space, i.e. Linear in z-space ~ nonlinear in x-space BUT ~ kernel trick  Compute dot product via kernel analytically

SVM Formulation (with kernels) Replacing leads to: Find parameters of an optimal hyperplane as a solution to maximization problem under constraints Given: the training data an inner product kernel regularization parameter C

Examples of Kernels Kernel is a symmetric function satisfying general math conditions (Mercer’s conditions) Examples of kernels for different mappings xz Polynomials of degree q RBF kernel Neural Networks for given parameters Automatic selection of the number of hidden units (SV’s)

More on Kernels The kernel matrix has all info (data + kernel) H(1,1) H(1,2)…….H(1,n) H(2,1) H(2,2)…….H(2,n) …………………………. H(n,1) H(n,2)…….H(n,n) Kernel defines a distance in some feature space (aka kernel-induced feature space) Kernels can incorporate apriori knowledge Kernels can be defined over complex structures (trees, sequences, sets etc)

Support Vectors SV’s ~ training samples with non-zero loss SV’s are samples that falsify the model The model depends only on SVs  SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters. SVM Generalization ~ data compression

New insights provided by SVM Why linear classifiers can generalize? (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both Requires common-sense parameter tuning

OUTLINE Margin-based loss SVM for classification SVM examples Support Vector regression Summary

Ripley’s data set 250 training samples, 1,000 test samples SVM using RBF kernel Model selection via 10-fold cross-validation

Ripley’s data set: SVM model Decision boundary and margin borders SV’s are circled

Ripley’s data set: model selection SVM tuning parameters C, Select opt parameter values via 10-fold x-validation Results of cross-validation are summarized below: C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000 =2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4% =2-2 51.6% 22% 20% 16% 14% =2-1 33.2% 19.6% 15.6% 13.6% 14.8% =20 28% 18% 16.4% 12.8% =21 20.8% 17.2% =22 19.2% =23

Noisy Hyperbolas data set This example shows application of different kernels Note: decision boundaries are quite different RBF kernel Polynomial

Many challenging applications Mimic human recognition capabilities - high-dimensional data - content-based - context-dependent Example: read the sentence Sceitnitss osbevred: it is nt inptrant how lteters are msspled isnide the word. It is ipmoratnt that the fisrt and lsat letetrs do not chngae, tehn the txet is itneprted corrcetly SVM is suitable for sparse high-dimensional data 36 36

Example SVM Applications Handwritten digit recognition Genomics Face detection in unrestricted images Text/ document classification Image classification and retrieval …….

Handwritten Digit Recognition (mid-90’s) Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples Data encoding: 16x16 pixel image  256-dim. vector Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application Multi-class problem: one-vs-all approach  10 SVM classifiers (one per each digit)

Digit Recognition Results Summary - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type - 100 – 400 support vectors per class (digit) More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0 RBF 291 4.1 Neural Network 254 4.2 ~ 80-90% of SV’s coincide (for different kernels)

Document Classification (Joachims, 1998) The Problem: Classification of text documents in large data bases, for text indexing and retrieval Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly Predictive Learning Approach (SVM): construct a classifier using all possible features (words) Document/ Text Representation: individual words = input features (possibly weighted) SVM performance: Very promising (~ 90% accuracy vs 80% by other classifiers) Most problems are linearly separable  use linear SVM

OUTLINE Margin-based loss SVM for classification SVM examples Support vector regression Summary

Linear SVM regression Assume linear parameterization

Direct Optimization Formulation Given training data Minimize Under constraints

Example: SVM regression using RBF kernel SVM estimate is shown in dashed line SVM model uses only 5 SV’s (out of the 40 points)

RBF regression model Weighted sum of 5 RBF kernels gives the SVM model

Summary Margin-based loss: robust + performs complexity control Nonlinear feature selection (~ SV’s): performed automatically Tractable model selection – easier than most nonlinear methods. SVM is not a magic bullet solution - similar to other methods when n >> h - SVM is better when n << h or n ~ h