Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

VC Dimension – definition and impossibility result
Brief introduction on Logistic Regression
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Machine learning continued Image source:
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Executive Summary: Bayes Nets Andrew W. Moore Carnegie Mellon University, School of Computer Science Note to other teachers.
Visual Recognition Tutorial
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
Support Vector Machine
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
The Nature of Statistical Learning Theory by V. Vapnik
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Neural Networks Lecture 8: Two simple learning algorithms
Support Vector Machine & Image Classification Applications
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
INTRODUCTION TO Machine Learning 3rd Edition
Slides for “Data Mining” by I. H. Witten and E. Frank.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
ECE 471/571 – Lecture 22 Support Vector Machine 11/24/15.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
Support Vector Machines
Data Mining Lecture 11.
Support Vector Machines
Computational Learning Theory
10701 / Machine Learning Today: - Cross validation,
Computational Learning Theory
Ensemble learning.
Class #212 – Thursday, November 12
Support Vector Machines
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Support Vector Machines 2
K-means and Hierarchical Clustering
Presentation transcript:

Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Comments and corrections gratefully received.

VC-dimension: Slide 2 Copyright © 2001, Andrew W. Moore A learning machine A learning machine f takes an input x and transforms it, somehow using weights , into a predicted output y est = +/- 1 f x  y est  is some vector of adjustable parameters

VC-dimension: Slide 3 Copyright © 2001, Andrew W. Moore Examples f x  y est f(x,b) = sign(x.x – b) denotes +1 denotes -1

VC-dimension: Slide 4 Copyright © 2001, Andrew W. Moore Examples f x  y est denotes +1 denotes -1 f(x,w) = sign(x.w)

VC-dimension: Slide 5 Copyright © 2001, Andrew W. Moore Examples f x  y est f(x,w,b) = sign(x.w+b) denotes +1 denotes -1

VC-dimension: Slide 6 Copyright © 2001, Andrew W. Moore How do we characterize “power”? Different machines have different amounts of “power”. Tradeoff between: More power: Can model more complex classifiers but might overfit. Less power: Not going to overfit, but restricted in what it can model. How do we characterize the amount of power?

VC-dimension: Slide 7 Copyright © 2001, Andrew W. Moore Some definitions Given some machine f And under the assumption that all training points (x k,y k ) were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define Official terminology Terminology we’ll use

VC-dimension: Slide 8 Copyright © 2001, Andrew W. Moore Some definitions Given some machine f And under the assumption that all training points (x k,y k ) were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define Official terminology Terminology we’ll use R = #training set data points

VC-dimension: Slide 9 Copyright © 2001, Andrew W. Moore Vapnik-Chervonenkis dimension Given some machine f, let h be its VC dimension. h is a measure of f’s power (h does not depend on the choice of training set) Vapnik showed that with probability 1-  This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f

VC-dimension: Slide 10 Copyright © 2001, Andrew W. Moore What VC-dimension is used for Given some machine f, let h be its VC dimension. h is a measure of f’s power. Vapnik showed that with probability 1-  This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f But given machine f, how do we define and compute h?

VC-dimension: Slide 11 Copyright © 2001, Andrew W. Moore Shattering Machine f can shatter a set of points x 1, x 2.. x r if and only if… For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ),… (x r,y r ) …There exists some value of  that gets zero training error. There are 2 r such training sets to consider, each with a different combination of +1’s and –1’s for the y’s

VC-dimension: Slide 12 Copyright © 2001, Andrew W. Moore Shattering Machine f can shatter a set of points x 1, x 2.. X r if and only if… For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ),… (x r,y r ) …There exists some value of  that gets zero training error. Question: Can the following f shatter the following points? f(x,w) = sign(x.w)

VC-dimension: Slide 13 Copyright © 2001, Andrew W. Moore Shattering Machine f can shatter a set of points x 1, x 2.. X r if and only if… For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ),… (x r,y r ) …There exists some value of  that gets zero training error. Question: Can the following f shatter the following points? f(x,w) = sign(x.w) Answer: No problem. There are four training sets to consider w=(0,1)w=(0,-1)w=(2,-3)w=(-2,3)

VC-dimension: Slide 14 Copyright © 2001, Andrew W. Moore Shattering Machine f can shatter a set of points x 1, x 2.. X r if and only if… For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ),… (x r,y r ) …There exists some value of  that gets zero training error. Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b)

VC-dimension: Slide 15 Copyright © 2001, Andrew W. Moore Shattering Machine f can shatter a set of points x 1, x 2.. X r if and only if… For every possible training set of the form (x 1,y 1 ), (x 2,y 2 ),… (x r,y r ) …There exists some value of  that gets zero training error. Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) Answer: No way my friend.

VC-dimension: Slide 16 Copyright © 2001, Andrew W. Moore Definition of VC dimension Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,b) = sign(x.x-b)

VC-dimension: Slide 17 Copyright © 2001, Andrew W. Moore VC dim of trivial circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,b) = sign(x.x-b) Answer = 1: we can’t even shatter two points! (but it’s clear we can shatter 1)

VC-dimension: Slide 18 Copyright © 2001, Andrew W. Moore Reformulated circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC dimension of f(x,q,b) = sign(qx.x-b)

VC-dimension: Slide 19 Copyright © 2001, Andrew W. Moore Reformulated circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,q,b) = sign(qx.x-b) Answer = 2 q,b are -ve

VC-dimension: Slide 20 Copyright © 2001, Andrew W. Moore Reformulated circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,q,b) = sign(qx.x-b) Answer = 2 (clearly can’t do 3) q,b are -ve

VC-dimension: Slide 21 Copyright © 2001, Andrew W. Moore VC dim of separating line Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can f shatter these three points?

VC-dimension: Slide 22 Copyright © 2001, Andrew W. Moore VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can f shatter these three points? Yes, of course. All -ve or all +ve is trivial One +ve can be picked off by a line One -ve can be picked off too.

VC-dimension: Slide 23 Copyright © 2001, Andrew W. Moore VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter?

VC-dimension: Slide 24 Copyright © 2001, Andrew W. Moore VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points.

VC-dimension: Slide 25 Copyright © 2001, Andrew W. Moore VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Two of those lines will cross.

VC-dimension: Slide 26 Copyright © 2001, Andrew W. Moore VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Two of those lines will cross. If we put points linked by the crossing lines in the same class they can’t be linearly separated So a line can shatter 3 points but not 4 So VC-dim of Line Machine is 3

VC-dimension: Slide 27 Copyright © 2001, Andrew W. Moore VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Can you guess how?

VC-dimension: Slide 28 Copyright © 2001, Andrew W. Moore VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Define m input points thus: x 1 = (1,0,0,…,0) x 2 = (0,1,0,…,0) : x m = (0,0,0,…,1) So x k [j] = 1 if k=j and 0 otherwise Let y 1, y 2,… y m, be any one of the 2 m combinations of class labels. Guess how we can define w 1, w 2,… w m and b to ensure sign(w. x k + b) = y k for all k ? Note:

VC-dimension: Slide 29 Copyright © 2001, Andrew W. Moore VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Define m input points thus: x 1 = (1,0,0,…,0) x 2 = (0,1,0,…,0) : x m = (0,0,0,…,1) So x k [j] = 1 if k=j and 0 otherwise Let y 1, y 2,… y m, be any one of the 2 m combinations of class labels. Guess how we can define w 1, w 2,… w m and b to ensure sign(w. x k + b) = y k for all k ? Note: Answer: b=0 and w k = y k for all k.

VC-dimension: Slide 30 Copyright © 2001, Andrew W. Moore VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Now we know that h >= m In fact, h=m+1 Proof that h >= m+1 is easy Proof that h < m+2 is moderate

VC-dimension: Slide 31 Copyright © 2001, Andrew W. Moore What does VC-dim measure? Is it the number of parameters? Related but not really the same. I can create a machine with one numeric parameter that really encodes 7 parameters (How?) And I can create a machine with 7 parameters which has a VC-dim of 1 (How?) Andrew’s private opinion: it often is the number of parameters that counts.

VC-dimension: Slide 32 Copyright © 2001, Andrew W. Moore Structural Risk Minimization Let  (f) = the set of functions representable by f. Suppose Then (Hey, can you formally prove this?) We’re trying to decide which machine to use. We train each machine and make a table… ififi TRAINER R VC-ConfProbable upper bound on TESTERR Choice 1f1f1 2f2f2 3f3f3  4f4f4 5f5f5 6f6f6

VC-dimension: Slide 33 Copyright © 2001, Andrew W. Moore Using VC-dimensionality That’s what VC-dimensionality is about People have worked hard to find VC-dimension for.. Decision Trees Perceptrons Neural Nets Decision Lists Support Vector Machines And many many more All with the goals of 1.Understanding which learning machines are more or less powerful under which circumstances 2.Using Structural Risk Minimization for to choose the best learning machine

VC-dimension: Slide 34 Copyright © 2001, Andrew W. Moore Alternatives to VC-dim-based model selection What could we do instead of the scheme below? ififi TRAINER R VC-ConfProbable upper bound on TESTERR Choice 1f1f1 2f2f2 3f3f3  4f4f4 5f5f5 6f6f6

VC-dimension: Slide 35 Copyright © 2001, Andrew W. Moore Alternatives to VC-dim-based model selection What could we do instead of the scheme below? 1.Cross-validation ififi TRAINER R 10-FOLD-CV-ERRChoice 1f1f1 2f2f2 3f3f3  4f4f4 5f5f5 6f6f6

VC-dimension: Slide 36 Copyright © 2001, Andrew W. Moore Alternatives to VC-dim-based model selection ififi LOGLIKE(TRAINERR) #parametersAICChoice 1f1f1 2f2f2 3f3f3 4f4f4  5f5f5 6f6f6 What could we do instead of the scheme below? 1.Cross-validation 2.AIC (Akaike Information Criterion) As the amount of data goes to infinity, AIC promises* to select the model that’ll have the best likelihood for future data *Subject to about a million caveats

VC-dimension: Slide 37 Copyright © 2001, Andrew W. Moore Alternatives to VC-dim-based model selection ififi LOGLIKE(TRAINERR) #parametersBICChoice 1f1f1 2f2f2 3f3f3  4f4f4 5f5f5 6f6f6 What could we do instead of the scheme below? 1.Cross-validation 2.AIC (Akaike Information Criterion) 3.BIC (Bayesian Information Criterion) As the amount of data goes to infinity, BIC promises* to select the model that the data was generated from. More conservative than AIC. *Another million caveats

VC-dimension: Slide 38 Copyright © 2001, Andrew W. Moore Which model selection method is best? 1.(CV) Cross-validation 2.AIC (Akaike Information Criterion) 3.BIC (Bayesian Information Criterion) 4.(SRMVC) Structural Risk Minimize with VC- dimension AIC, BIC and SRMVC have the advantage that you only need the training error. CV error might have more variance SRMVC is wildly conservative Asymptotically AIC and Leave-one-out CV should be the same Asymptotically BIC and a carefully chosen k-fold should be the same BIC is what you want if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) Many alternatives to the above including proper Bayesian approaches. It’s an emotional issue.

VC-dimension: Slide 39 Copyright © 2001, Andrew W. Moore Extra Comments Beware: that second “VC-confidence” term is usually very very conservative (at least hundreds of times larger than the empirical overfitting effect). An excellent tutorial on VC-dimension and Support Vector Machines (which we’ll be studying soon): C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): ,

VC-dimension: Slide 40 Copyright © 2001, Andrew W. Moore What you should know The definition of a learning machine: f(x,  ) The definition of Shattering Be able to work through simple examples of shattering The definition of VC-dimension Be able to work through simple examples of VC- dimension Structural Risk Minimization for model selection Awareness of other model selection methods