C4.5 - pruning decision trees

Slides:



Advertisements
Similar presentations
“Students” t-test.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
CHAPTER 9: Decision Trees
Machine Learning in Real World: C4.5
Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
C4.5 - pruning decision trees
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Point estimation, interval estimation
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
Sample size computations Petter Mostad
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Evaluation.
Ensemble Learning: An Introduction
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Experimental Evaluation
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
by B. Zadrozny and C. Elkan
Data Mining Schemes in Practice. 2 Implementation: Real machine learning schemes  Decision trees: from ID3 to C4.5  missing values, numeric attributes,
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Machine Learning: Decision Trees Homework 4 assigned courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.1: Decision Trees Rodney Nielsen Many /
Decision Trees.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Mining Practical Machine Learning Tools and Techniques.
Data Mining Practical Machine Learning Tools and Techniques
Confidence Intervals Cont.
Chapter 6 Decision Tree.
University of Waikato, New Zealand
Measurement, Quantification and Analysis
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Techniques for Data Mining
CONCEPTS OF ESTIMATION
Classification with Decision Trees
Statistical Process Control
Machine Learning in Practice Lecture 17
Data Mining(中国人民大学) Yang qiang(香港科技大学) Han jia wei(UIUC)
INTRODUCTION TO Machine Learning
Decision Trees Jeff Storey.
Chapter 9 Estimation: Additional Topics
Presentation transcript:

C4.5 - pruning decision trees

Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can have? A: No. This tree is the best classifier on the training set, but possibly not on new and unseen data. Because of overfitting, the tree may not generalize very well.

Pruning Goal: Prevent overfitting to noise in the data Two strategies for “pruning” the decision tree: Postpruning - take a fully-grown decision tree and discard unreliable parts Prepruning - stop growing a branch when information becomes unreliable

Prepruning Based on statistical significance test Stop growing the tree when there is no statistically significant dependency between any attribute and the class at a particular node Most popular test: chi-squared test ID3 used chi-squared test in addition to information gain Only statistically significant attributes were allowed to be selected by information gain procedure Information gain is also a statistical test, but it doesn’t consider the number of examples: two records (TF) can be split with an information gain of 1 bit

Early stopping a b class 1 2 3 4 Pre-pruning may stop the growth process prematurely: early stopping Classic example: XOR/Parity-problem No individual attribute exhibits any significant association to the class Structure is only visible in fully expanded tree Pre-pruning won’t expand the root node But: XOR-type problems rare in practice And: pre-pruning pretty fast

Post-pruning First, build full tree Then, prune it Fully-grown tree shows all attribute interactions Then, prune it Two pruning operations: Subtree replacement Subtree raising

Subtree replacement Bottom-up Consider replacing a tree only after considering all its subtrees

Estimating error rates Prune only if it reduces the estimated error C4.5’s method Derive confidence interval from training data Use a heuristic limit, derived from this, for pruning Standard Bernoulli-process-based method Shaky statistical assumptions (based on training data)

Estimating error rates Q: what is the error rate on the training set? A: 0.33 (2 out of 6) Q: What is the largest error you can get on the training set? A: 0.5 Q: When you find a pure node, what will the error be on the test set (new data)? A: Somewhat bigger than 0

Test set error Two rates the observed error rate (training set) the (unknown) actual error rate, will be different (larger?) Compute confidence interval around observed error rate Take high end of interval as worst case scenario

Estimating the error Assume making an error is Bernoulli trial with probability p p is unknown (true error rate) We observe f, the observed error rate f = E/N For large enough N, f follows a Normal distribution Mean and variance for f : p, p (1–p)/N p - p p +

Estimating the error c% confidence interval [–z  X  z] for random variable with 0 mean is given by: With a symmetric distribution: c

z-transforming f Transformed value for f : (i.e. subtract the mean and divide by the standard deviation) Resulting equation: Solving for p: -1 0 1

C4.5’s method Error estimate for subtree is weighted sum of error estimates for all its leaves Error estimate for a node (upper bound): If c = 25% then z = 0.69 (from normal distribution) Pr[X  z] z 1% 2.33 5% 1.65 10% 1.28 20% 0.84 25% 0.69 40% 0.25

C4.5’s method f is the observed error z = 0.69 e > f (not obvious) N , e = f

Example f = 5/14=0.36 e = 0.46 e < 0.51 so prune! f=0.33 e=0.47 Combined using ratios 6:2:6 gives 0.51

Summary Decision Trees No method is always superior – experiment! splits – binary, multi-way split criteria – information gain, gain ratio, … pruning No method is always superior – experiment!