Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Kernel methods - overview
Sparse vs. Ensemble Approaches to Supervised Learning
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Computational Learning Theory
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Lecture 10: Support Vector Machines
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Bias-Variance Analysis.
An Introduction to Support Vector Machines Martin Law.
Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
Efficient Model Selection for Support Vector Machines
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
An Introduction to Support Vector Machines (M. Law)
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers Local Linear Regression Local Polynomial Regression 6.2 Selecting.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Support Vector Machines Tao Department of computer science University of Illinois.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
NTU & MSRA Ming-Feng Tsai
SVMs in a Nutshell.
Support Vector Machines Optimization objective Machine Learning.
Regression Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 9633 Machine Learning Support Vector Machines
CSE 4705 Artificial Intelligence
CS 4/527: Artificial Intelligence
10701 / Machine Learning Today: - Cross validation,
Support Vector Machines and Kernels
Support Vector Machines 2
Presentation transcript:

Oregon State University – Intelligent Systems Group 8/22/2003ICML Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi di Milano, Italy Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon USA Low Bias Bagged Support Vector Machines

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Two Questions:  Can bagging help SVMs?  If so, how should SVMs be tuned to give the best bagged performance?

Oregon State University – Intelligent Systems Group 8/22/2003 ICML The Answers  Can bagging help SVMs? Yes  If so, how should SVMs be tuned to give the best bagged performance? Tune to minimize the bias of each SVM

Oregon State University – Intelligent Systems Group 8/22/2003 ICML SVMs  Soft Margin Classifier Maximizes VC dimension subject to soft separation of the training data Dot product can be generalized using kernels K(x j,x i ;  ) Set C and  using an internal validation set  Excellent control of the bias/variance tradeoff: Is there any room for improvement? minimize: ||w|| 2 + C  i  i subject to: y i (w ¢ x i + b) +  i ¸ 1

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Bias/Variance Error Decomposition for Squared Loss  For regression problems, loss is (ŷ – y) 2 error 2 = bias 2 + variance + noise E S [(ŷ-y) 2 ] = (E S [ŷ] – f(x)) 2 + E S [(ŷ – E S [ŷ]) 2 ] + E[(y – f(x)) 2 ]  Bias: Systematic error at data point x averaged over all training sets S of size N  Variance: Variation around the average  Noise: Errors in the observed labels of x

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Example: 20 points y = x + 2 sin(1.5x) + N(0,0.2)

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Example: 50 fits (20 examples each)

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Bias

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Variance

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Noise

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Variance Reduction and Bagging  Bagging attempts to simulate a large number of training sets and compute the average prediction y m of those training sets  It then predicts y m  If the simulation is good enough, this eliminates all of the variance

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Bias and Variance for 0/1 Loss (Domingos, 2000)  At each test point x, we have 100 estimates: ŷ 1, …, ŷ {–1,+1}  Main prediction: y m = majority vote  Bias(x) = 0 if y m is correct and 1 otherwise  Variance(x) = probability that ŷ  y m Unbiased variance V U (x): variance when Bias = 0 Biased variance V B (x): variance when Bias = 1  Error rate(x) = Bias(x) + V U (x) – V B (x)  Noise is assumed to be zero

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Good Variance and Bad Variance  Error rate(x) = Bias(x) + V U (x) – V B (x)  V B (x) is “good” variance, but only when the bias is high  V U (x) is “bad” variance  Bagging will reduce both types of variance. This gives good results if Bias(x) is small.  Goal: Tune classifiers to have small bias and rely on bagging to reduce variance

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Lobag  Given: Training examples {(x i,y i )} N i=1 Learning algorithm with tuning parameters  Parameter settings to try {  1,  2,…}  Do: Apply internal bagging to compute out-of-bag estimates of the bias of each parameter setting. Let  * be the setting that gives minimum bias Perform bagging using  *

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Example: Letter2, RBF kernel,  = 100 minimum error minimum bias

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Experimental Study  Seven data sets: P2, waveform, grey- landsat, spam, musk, letter2 (letter recognition ‘B’ vs ‘R’), letter2+noise (20% added noise)  Three kernels: dot product, RBF (  = gaussian width), polynomial (  = degree)  Training set: 100 examples  Final classifier is bag of 100 SVMs trained with chosen C and 

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Results: Dot Product Kernel

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Results (2): Gaussian Kernel

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Results (3): Polynomial Kernel

Oregon State University – Intelligent Systems Group 8/22/2003 ICML McNemar’s Tests: Bagging versus Single SVM

Oregon State University – Intelligent Systems Group 8/22/2003 ICML McNemar’s Test: Lobag versus Single SVM

Oregon State University – Intelligent Systems Group 8/22/2003 ICML McNemar’s Test: Lobag versus Bagging

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Results: McNemar’s Test (wins – ties – losses) Kernel Lobag vs Bagging Lobag vs Single Bagging vs Single Linear3 – 26 – 123 – 7 – 021 – 9 – 0 Polynomial12 – 23 – 017 – 17 – 19 – 24 – 2 Gaussian17 – 17 – 118 – 15 – 29 – 22 – 4 Total32 – 66 – 258 – 39 – 339 – 55 – 6

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Discussion  For small training sets Bagging can improve SVM error rates, especially for linear kernels Lobag is at least as good as bagging and often better  Consistent with previous experience Bagging works better with unpruned trees Bagging works better with neural networks that are trained longer or with less weight decay

Oregon State University – Intelligent Systems Group 8/22/2003 ICML Conclusions  Lobag is recommended for SVM problems with high variance (small training sets, high noise, many features)  Added cost: SVMs require internal validation to set C and  Lobag requires internal bagging to estimate bias for each setting of C and   Future research: Smart search for low-bias settings of C and  Experiments with larger training sets