Introduction to Predictive Learning

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Introduction to Predictive Learning
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Part 2: Support Vector Machines
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Introduction to Predictive Learning
SVM Support Vectors Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Efficient Model Selection for Support Vector Machines
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
PRESENTED BY: SAUPTIK DHAR P RACTICAL C ONDITIONS FOR E FFECTIVENESS OF THE U NIVERSUM L EARNING 1.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota Presented at the University.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Part 2: Margin-Based Methods and Support Vector Machines
An Introduction to Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Predictive Learning from Data
Support Vector Machines 2
Presentation transcript:

Introduction to Predictive Learning LECTURE SET 7 Support Vector Machines Electrical and Computer Engineering 1 1

OUTLINE Objectives Motivation for margin-based loss explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

MOTIVATION for SVM Recall ‘conventional’ methods: - model complexity ~ dimensionality - nonlinear methods  multiple local minima - hard to control complexity ‘Good’ learning method: (a) tractable optimization formulation (b) tractable complexity control(1-2 parameters) (c) flexible nonlinear parameterization Properties (a), (b) hold for linear methods SVM solution approach

SVM APPROACH Linear approximation in Z-space using special adaptive loss function Complexity independent of dimensionality

Motivation for Nonlinear Methods Nonlinear learning algorithm proposed using ‘reasonable’ heuristic arguments. reasonable ~ statistical or biological 2. Empirical validation + improvement Statistical explanation (why it really works) Examples: statistical, neural network methods. In contrast, SVM methods have been originally proposed under VC theoretic framework. 5 5 5

OUTLINE Objectives Motivation for margin-based loss Loss functions for regression Loss functions for classification Philosophical interpretation Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

Main Idea Model complexity controlled by a special loss function used for fitting training data Such empirical loss functions may be different from the loss functions used in a learning problem setting Such loss functions are adaptive, i.e. can adapt their complexity to particular data set Different loss functions for different learning problems (classification, regression etc) Model complexity(VC-dim.) is controlled independently of the number of features

Robust Loss Function for Regression Squared loss ~ motivated by large sample settings parametric assumptions Gaussian noise For practical settings better to use linear loss

Epsilon-insensitive Loss for Regression Can also control model complexity

Empirical Comparison Univariate regression Squared, linear and SVM loss (with ): Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss

Empirical Comparison (cont’d) Univariate regression Squared, linear and SVM loss (with ) Test error (MSE) estimated for 5 independent realizations of training data (4 training samples) Squared loss Least modulus loss SVM loss with epsilon=0.6 1 0.024 0.134 0.067 2 0.128 0.075 0.063 3 0.920 0.274 0.041 4 0.035 0.053 0.032 5 0.111 0.027 0.005 Mean 0.244 0.113 0.042 St. Dev. 0.381 0.099 0.025

Loss Functions for Classification Decision rule Quantity is analogous to residuals in regression Common loss functions: 0/1 loss and linear loss Properties of a good loss function?

Motivation for margin-based loss (1) Given: Linearly separable data How to construct linear decision boundary? (a) Many linear decision boundaries (that have no errors)

Motivation for margin-based loss (2) Given: Linearly separable data Which linear decision boundary is better ? The model with larger margin is more robust for future data

Largest-margin solution All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability)

Margin-based loss for classification SVM loss or hinge loss Minimization of slack variables

Margin-based loss for classification: margin size is adapted to training data

Motivation: philosophical Classical view: good model explains the data + low complexity Occam’s razor (complexity ~ # parameters) VC theory: good model explains the data + low VC-dimension  VC-falsifiability (small VC-dim ~ large falsifiability), i.e. the goal is to find a model that: can explain training data / cannot explain other data The idea: falsifiability ~ empirical loss function

Adaptive Loss Functions Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model The trade-off (between the two goals) is adaptively controlled  adaptive loss fct For classification, the degree of falsifiability is ~ margin size (see below)

Margin-based loss for classification

Classification: non-separable data

Margin based complexity control Large degree of falsifiability is achieved by - large margin (classification) - small epsilon (regression) For linear classifiers: larger margin  smaller VC-dimension ~

-margin hyperplanes Solutions provided by minimization of SVM loss can be indexed by the value of margin  SRM structure: for VC-dim. If data samples belong to a sphere of radius R, then the VC dimension bounded by For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.

SVM Model Complexity Two ways to control model complexity -via model parameterization use fixed loss function: -via adaptive loss function: use fixed (linear) parameterization ~ Two types of SRM structures Margin-based loss can be motivated by Popper’s falsifiability

Margin-based loss: summary Classification: falsifiability controlled by margin Regression: falsifiability controlled by Single class learning: falsifiability controlled by radius r NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems.

OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers - Primal formulation (linearly separable case) - Dual optimization formulation - Soft-margin SVM formulation Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

Optimal Separating Hyperplane Distance btwn hyperplane and sample  Margin Shaded points are SVs

Optimization Formulation Given training data Find parameters of linear hyperplane that minimize under constraints Quadratic optimization with linear constraints tractable for moderate dimensions d For large dimensions use dual formulation: - scales better with n (rather than d) - uses only dot products

From Optimization Theory: For a given convex minimization problem with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers Karush-Kuhn-Tucker (KKT) conditions: Lagrange coefficients only for samples that satisfy the original constraint with equality ~ SV’s have positive Lagrange coefficients

Convex Hull Interpretation of Dual Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors

Dual Optimization Formulation Given training data Find parameters of an opt. hyperplane as a solution to maximization problem under constraints Note: data samples with nonzero are SV’s Formulation requires only inner products

Support Vectors SV’s ~ training samples with non-zero loss SV’s are samples that falsify the model The model depends only on SVs  SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.

Support Vectors SVM test error bound:  small # SV’s ~ good generalization Can be explained using LOO cross-validation SVM generalization can be related to data compression

Soft-Margin SVM formulation x ( ) = + 1 - 2 3 Minimize: under constraints

SVM Dual Formulation Given training data Find parameters of an opt. hyperplane as a solution to maximization problem under constraints Note: data samples with nonzero are SVs Formulation requires only inner products

OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

Nonlinear Decision Boundary Fixed (linear) parameterization is too rigid Nonlinear curved margin may yield larger margin (falsifiability) and lower error

Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM Nonlinear mapping to feature z space Linear in z-space ~ nonlinear in x-space But ~ symmetric fct  Compute dot product via kernel analytically

Example of Kernel Function 2D input space Mapping to z space (2-d order polynomial) Can show by direct substitution that for two input vectors Their dot product is calculated analytically

SVM Formulation (with kernels) Replacing leads to: Find parameters of an optimal hyperplane as a solution to maximization problem under constraints Given: the training data an inner product kernel regularization parameter C

Examples of Kernels Kernel is a symmetric function satisfying general (Mercer’s) conditions. Examples of kernels for different mappings xz Polynomials of degree m RBF kernel (width parameter) Neural Networks for given parameters Automatic selection of the number of hidden units (SV’s)

More on Kernels The kernel matrix has all info (data + kernel) K(1,1) K(1,2)…….K(1,n) K(2,1) K(2,2)…….K(2,n) …………………………. K(n,1) K(n,2)…….K(n,n) Kernel defines a distance in some feature space (aka kernel-induced feature space) Kernel parameter controls nonlinearity Kernels can incorporate a priori knowledge Kernels can be defined over complex structures (trees, sequences, sets, etc.)

New insights provided by SVM Why linear classifiers can generalize? (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both Requires common-sense parameter tuning

OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples - Model Selection - Histogram of Projections - SVM Extensions and Modifications SVM for Regression Summary and Discussion

SVM Model Selection The quality of SVM classifiers depends on proper tuning of model parameters: - kernel type (poly, RBF, etc) - kernel complexity parameter - regularization parameter C Note: VC-dimension depends on both C and kernel parameters These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale)

SVM Example 1: Ripley’s data Ripley’s data set: - 250 training samples - SVM using RBF kernel - model selection via 10-fold cross-validation Cross-validation error table:  optimal C = 1,000, gamma = 1 Note: may be multiple optimal parameter values gamma C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000 =2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4% =2-2 51.6% 22% 20% 16% 14% =2-1 33.2% 19.6% 15.6% 13.6% 14.8% =20 28% 18% 16.4% 12.8% =21 20.8% 17.2% =22 19.2% =23

Optimal SVM Model RBF SVM with optimal parameters C = 1,000, gamma = 1 Test error is 9.8% (estimated using1,000 test samples)

SVM Example 2: Noisy Hyperbolas Noisy Hyperbolas data set: - 100 training samples (50 per class) - 100 validation samples (used for parameter tuning) RBF SVM model: Poly SVM model (5-th degree): Which model is ‘better’? Model interpretation?

SVM Example 3: handwritten digits MNIST handwritten digits (5 vs. 8) ~ high-dimensional data - 1,000 training samples (500 per class) - 1,000 validation samples (used for parameter tuning) - 1,866 test samples Each sample is a real-valued vector of size 28*28=784: RBF SVM: optimal parameters C=1, 28 pixels 28 pixels

How to visualize high-dim SVM model? Histogram of projections for linear SVM: - project training data onto normal vector w (of SVM model) - show univariate histogram of projected training samples On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins Similar histograms can be obtained for nonlinear SVM

Histogram of Projections for Digits Data Projections of training/test data onto normal direction of RBF SVM decision boundary: Training data Test data

Practical Issues for SVM Classifiers Pre-processing all inputs pre-scaled to the range [0,1] or [-1,+1] Model Selection (parameter tuning) SVM Extensions - multi-class problems - unbalanced data sets - unequal misclassification costs

SVM for multi-class problems Digit recognition ~ ten-class problem: - estimate 10 binary classifiers (one digit vs the rest) For prediction: a test input is applied to all 10 binary SVM classifiers, and the class with the largest SVM output value is selected

Unbalanced Settings and Unequal Costs - different number of positive and negative samples - different prior probabilities for training /test data Different Misclassification Costs - two types of errors, FP and FN - Cost (false_positive) vs. Cost(false_negative) - Loss function: - these ‘costs’ need to be specified a priori, based on application requirements

SVM Modifications The Problem: Unbalanced Data + Unequal Costs where - How to modify standard SVM formulation? Unbalanced Data + Unequal Costs where In practice, need to specify

Example: SVM with unequal costs Ripley’s data set (as before) where - negative samples ~ ‘triangles’ - given misclassification costs Note: boundary shifted away from positive samples

SVM Applications Handwritten digit recognition Face detection in unrestricted images Text/ document classification Image classification and retrieval …….

Handwritten Digit Recognition (mid-90’s) Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples Data encoding: 28x28 grey scale pixel image Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application Multi-class problem: one-vs-all approach  10 SVM classifiers (one per each digit)

Digit Recognition Results Summary - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type - 100 – 400 support vectors per class (digit) More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0 RBF 291 4.1 Neural Network 254 4.2 ~ 80-90% of SV’s coincide (for different kernels) Reduced-set SVM (Burges, 1996) ~ 15 per class

Document Classification (Joachims, 1998) The Problem: Classification of text documents in large data bases, for text indexing and retrieval Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly Predictive Learning Approach (SVM): construct a classifier using all possible features (words) Document/ Text Representation: individual words = input features (possibly weighted) SVM performance: Very promising (~ 90% accuracy vs 80% by other classifiers) Most problems are linearly separable  use linear SVM

Image Data Mining (Chapelle et al, 1999) Example image data: Classification of images in data bases, for image indexing etc DATA SET Corel photo images: 2670 samples divided into 7 classes: airplanes, birds, fish, vehicles etc. Training data: 1375 images; Test data: 1375 images (50%) MAIN ISSUE: invariant representation/ data encoding

OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression - SV Regression formulation - Dual optimization formulation - Model selection - Example: Boston Housing Summary and Discussion

General SVM Modeling Approach 1 For linear model minimize SVM functional using SVM loss suitable for the learning problem at hand Transform (1) to dual optimization formulation (using only dot products) Use kernels to obtain nonlinear version of (2). Note: this approach is used for all learning problems. However, tunable parameters of margin-based loss are different for various types of learning problems

SVM Regression For linear model minimize SVM functional where empirical loss (for regression) is given by Two distinct ways to control model complexity: - by the value of C (with fixed epsilon) - by the value of epsilon (with fixed large C) SVM regression tunes both epsilon and C for optimal performance

Linear SVM regression For linear parameterization SVM regression functional: where

Direct Optimization Formulation Given training data Minimize Under constraints

Dual Formulation for SVM Regression Given training data And the values of Find coefficients which maximize Under constraints Yields the following solution

Example: RBF regression RBF estimate (dashed line) using SVM model uses only 5 SV’s (out of the 40 points)

Example: decomposition of RBF model Weighted sum of 5 RBF kernel fcts gives the SVM model

SVM Model Selection: General Setting/ tuning of SVM hyper-parameters - usually performed by experts - more recently, by non-expert practitioners Issues for SVM model selection (1) parameters controlling the ‘margin’ size (2) kernel type and kernel complexity Strategies for model selection - exhaustive search in the parameter space(via resampling) - efficient search using VC analytic bounds - rule-of-thumb analytic strategies (for a particular type of learning problem)

Model Selection: continued Parameters controlling margin size - for classification, parameter C - for regression, the value of epsilon - for single-class learning, the radius Complexity control ~ the fraction of SV’s( -SVM) - for classification, replace C with - for regression, specify the fraction of points allowed to lie outside -insensitive zone For very sparse data (d/n>>1) use linear SVM

Parameter Selection for SVM Regression Selection of parameter C Recall the SVM solution where and  with bounded kernels (RBF) Selection of in general, (noise level) But this does not reflect dependency on sample size For linear regression: suggesting The final prescription

Effect of SVM parameters on test error Training data univariate Sinc(x) function with additive Gaussian noise (sigma=0.2) (a) small sample size 50 (b) large sample size 200 2 4 6 8 10 0.2 0.4 0.6 0.05 0.1 0.15 C/n Prediction Risk epsilon 2 4 6 8 10 0.2 0.4 0.6 0.05 0.1 0.15 C/n Prediction Risk epsilon 2 4 6 8 10 0.2 0.4 0.6 0.05 0.1 0.15 C/n Prediction Risk epsilon

SVM vs Regularization System imitation  SVM System identification  regularization But their risk functionals ‘look similar’ Recent claims: SVM = special case of regularization These claims neglect the role of margin loss

Comparison for Classification Linear SVM vs Penalized LDA – comparison is fair Data Sets: small (20 samples per class) large (100 samples per class)

Comparison results: classification Small sample size: Linear SVM yields 0.5% - 1.1% error rate Penalized LDA yields 2.8% - 3% error Large sample size: Linear SVM yields 0.4% - 1.1% error rate Penalized LDA yields 1.1% - 2.2% error Conclusion: margin based complexity control is better than regularization

Comparison for regression Linear SVM vs linear ridge regression Note: Linear SVM has 2 parameters Sparse Data Set: 30 noisy samples, using target function corrupted with gaussian noise with Complexity Control: - for RR vary regularization parameter - for SVM ~ epsilon and C parameters

Control for ridge regression Coefficient shrinkage for ridge regression

Complexity control for SVM Coefficient shrinkage for SVM: (a)Vary C (epsilon=0) (b)Vary epsilon(C=large) -8 -6 -4 -2 2 4 6 8 -0.5 0.5 1 1.5 2.5 Log(n/C) Coefficients w1 w2 w3 w4 w5

Comparison: ridge regression vs SVM Sparse setting: n=10, noise Ridge regression: chosen by cross-validation SV Regression: C selected by cross-validation Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM)

OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

Summary Direct approach  different formulations Margin-based loss: robust, controls complexity (falsifiability) SRM: new type of structure Nonlinear feature selection (~ SV’s): incorporated into model estimation Appropriate applications - high-dimensional data - content-based /content-dependent