Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Slides:



Advertisements
Similar presentations
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Advertisements

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Longin Jan Latecki Temple University
Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Sparse vs. Ensemble Approaches to Supervised Learning
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Machine Learning CS 165B Spring 2012
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Bias and Variance Two ways to measure the match of alignment of the learning algorithm to the classification problem involve the bias and variance. Bias.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
INTRODUCTION TO Machine Learning 3rd Edition
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Validation methods.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Trees, bagging, boosting, and stacking
Basic machine learning background with Python scikit-learn
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Model Combination.
LECTURE 19: FOUNDATIONS OF MACHINE LEARNING
Model generalization Brief summary of methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 9: Algorithm-Independent Machine Learning (Sections 1-7) 1. Introduction 2. Lack of Inherent Superiority of Any Classifier 3. Bias and Variance 4. Resampling for Estimating Statistics 5. Resampling for Classifier Design 6. Estimating and Comparing Classifiers 7. Combining Classifiers

Pattern Classification, Chapter 9 2 Algorithm independent machine learning means: Mathematical foundations that do not depend on a particular classifier or learning algorithm Techniques that can be used with different classifiers to provide guidance in their use 1. Introduction

Pattern Classification, Chapter No Free Lunch Theorem There are no a priori reasons to favor one learning or classification method over another 2. Lack of Inherent Superiority of Any Classifier

Pattern Classification, Chapter 9 4

5

6 Part 1 – averaged over all target functions, the expected off- training set error is the same for all learning algorithms Part 2 – even if we know the training set, no learning algorithm yields an off-training set error superior to any other Parts 3 & 4 – similar to Parts 1 & 2 for nonuniform target function distributions

Pattern Classification, Chapter 9 7

8

9

Minimum Descriptive Length (MDL) Sometimes claimed to justify preferring one classifier over another, “simpler” over “complex” classifiers Algorithmic (Kolmogorov-Chaitin) complexity of a binary string x, defined by analogy to entropy, is the size of the shortest program y that computes the string x and halts, where U = universal Turing machine

Pattern Classification, Chapter 9 11 Examples x = binary string of n 1s x = string of n binary digits of the constant  x = “truly” random string of n binary digits grows with n = length of x does not grow with n requires log 2 n to specify condition for halting

Pattern Classification, Chapter Minimum Descriptive Length Principle

Pattern Classification, Chapter 9 13

Pattern Classification, Chapter Overfitting Avoidance and Occam’s Razor Although we have mentioned the need to avoid overfitting via regularization, pruning, etc., the No Free Lunch result brings such techniques into question Nevertheless, although we can’t prove they help, these techniques have been found useful in practice

Pattern Classification, Chapter 9 15 Bias and variance are two ways to measure the “match” or “alignment” of the learning algorithm to the classification problem Bias measures accuracy of the match: high => poor match Variance measures precision of match: high => weak match Bias and variance are not independent 3. Bias and Variance

Pattern Classification, Chapter Bias and Variance for Regression

Pattern Classification, Chapter 9 17 Regression

Pattern Classification, Chapter Bias and Variance for Classification We will skip the mathematics here because it is complex And just review the figure on the next slide

Pattern Classification, Chapter 9 19 Classification

Pattern Classification, Chapter 9 20

Pattern Classification, Chapter 9 21

Pattern Classification, Chapter 9 22

Pattern Classification, Chapter Jackknife We have methods of estimating the mean and variance of a sample but not for estimating other statistics like the median, mode, and bias The jackknife and bootstrap methods are two of the most popular and theoretically grounded resampling techniques for extending estimates to arbitrary statistics The jackknife method is essentially a leave-one-out procedure for estimating various statistics 4. Resampling for Estimating Statistics

Pattern Classification, Chapter Bootstrap This method randomly selects B “bootstrap” data sets by repeatedly selecting n points from the training set, with replacement Note: with replacement means samples can be repeated The bootstrap estimate of a statistic is then merely the mean of the B estimates

Pattern Classification, Chapter 9 25 The generic term arcing – adaptive reweighting and combining – refers to reusing or selecting data in order to improve classification 5.1 Bagging (from “bootstrap aggregation”) Uses multiple versions of a training set, each created by drawing a subset of n’ < n samples from the training set with replacement Each bootstrap data set is used to train a different component classifier and the final decision is based on a vote of the component classifiers 5. Resampling for Classifier Design

Pattern Classification, Chapter Boosting Each bootstrap data set is used to train a different component classifier and the final decision is based on a vote of the component classifiers Classification accuracy is “boosted” by adding component classifiers to form an ensemble with high accuracy on the training set The subsets of training data chosen are “most informative” given the current set of component classifiers

Pattern Classification, Chapter 9 27 Boosting

Pattern Classification, Chapter Learning with Queries This is a special case of resampling Also called active learning or interactive learning Uses human “oracle” to label “valuable” patterns Two methods of selecting informative patterns Select the pattern for which the two largest discriminant functions have nearly the same value For multiclassifier systems, select the pattern yielding the greatest disagreement among the k classifiers

Pattern Classification, Chapter 9 29

Pattern Classification, Chapter Arcing, Learning with Queries, Bias and Variance Resampling in general, and learning with queries in particular, seems to violate the idea of drawing i.i.d. training data How can we do better with these techniques? In learning with queries we are not fitting parameters in a model, but instead seeking decision boundaries more directly As the number of component classifiers is increased, techniques like boosting effectively broaden the class of implementable functions

Pattern Classification, Chapter 9 31 There are two main reasons for determining the generalization rate of a classifier See if the classifier is good enough to be useful To compare performance with competing designs 6.1 Parametric Models Computing the generalization rate from the model is dangerous Estimates often overly optimistic – unrepresentativeness of training samples not revealed Should always suspect validity of the model Difficult to compute the error rate for complex distributions 6.2 Cross-Validation 6. Estimating and Comparing Classifiers

Pattern Classification, Chapter Cross-Validation Simple validation – split the training set into two parts Usual training – typically 90% of the data Validation – 10% – used for estimating generalization error Stop training when the error on the validation set reaches a minimum See next slide

Pattern Classification, Chapter 9 33

Pattern Classification, Chapter Cross-Validation (continued) m-fold cross-validation – divide training set into m parts of equal size n/m The classifier is trained m times, each time with a different set held out for validation, and the estimated performance is the mean of the m tests In the limit where m = n, the method is in effect the leave- one-out approach The validation error gives a classifier accuracy estimate

Pattern Classification, Chapter 9 35 For example, if no errors are made on 50 test samples, with probability 0.95 the true error rate is between zero and 8%

Pattern Classification, Chapter 9 36

Pattern Classification, Chapter Jackknife and Bootstrap Estimation of Classification Accuracy The jackknife approach trains the classifier n times, each time leaving out one training sample, and the estimated classifier accuracy is simply the mean of the leave-one-out accuracies The bootstrap approach trains B classifiers, each with a different bootstrap data set, and estimates accuracy as the mean

Pattern Classification, Chapter Maximum-Likelihood Model Comparison 6.5 Bayesian Model Comparison 6.6 The Problem-Average Error Rate skip

Pattern Classification, Chapter Predicting Final Performance from Learning Curves Training on large data sets can be computationally intensive, se we would like to use a classifier’s performance on a relatively small training set to predict its performance on a large one

Pattern Classification, Chapter 9 40

Pattern Classification, Chapter 9 41

Pattern Classification, Chapter The Capacity of a Separating Plane Of the 2n possible dichotomies of n points in d dimensions, the fraction linearly separable is for n=4 & d=1, f(n,d)=0.5

Pattern Classification, Chapter The Capacity of a Separating Plane (cont.) For four patterns in one dimension, f(n,d)=0.5 The following table shows all 16 of the equally likely labels of four patterns along a line (1D)

Pattern Classification, Chapter 9 44 This works well if each classifier is an “expert” in a different region of the pattern space 7.1 Component Classifiers with Discriminant Functions Provides a mixture distribution Basic architecture shown on next slide 7. Combining Classifiers

Pattern Classification, Chapter 9 45

Pattern Classification, Chapter Component Classifiers without Discriminant Functions For example, might combine three classifiers: Neural network => analog values kNN => rank order of the classes decision tree => single output Convert outputs to discriminant values g i that sum to 1