Decision trees and empirical methodology Sec 4.3, 5.1-5.4.

Slides:



Advertisements
Similar presentations
Machine Learning: Intro and Supervised Classification
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Classification Techniques: Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Supervised Learning I, Cont’d Reading: DH&S, Ch
More Methodology; Nearest-Neighbor Classifiers Sec 4.7.
Decision Tree Algorithm
Steep learning curves Reading: Bishop Ch. 3.0, 3.1.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Supervised Learning I, Cont’d. Administrivia Machine learning reading group Not part of/related to this class We read advanced (current research) papers.
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
The joy of data Plus, bonus feature: fun with differentiation Reading: DH&S Ch
ICS 273A Intro Machine Learning
The joy of Entropy.
Supervised Learning I, Cont’d Reading: Bishop, Ch 14.4, 1.6, 1.5.
The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...
Ensemble Learning (2), Tree and Forest
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Machine Learning Queens College Lecture 2: Decision Trees.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
1 CSCI 3202: Introduction to AI Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AIDecision Trees.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Classification with Decision Trees and Rules. Copyright © Andrew W. Moore Density Estimation – looking ahead Compare it against the two other major kinds.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Chapter 18 Section 1 – 3 Learning from Observations.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Computational Intelligence: Methods and Applications
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning Chapter 18 and Parts of Chapter 20
CS639: Data Management for Data Science
Presentation transcript:

Decision trees and empirical methodology Sec 4.3,

Review... Goal: want to find/replicate target function f() Candidates from hypothesis space, H “best” candidate measured by accuracy (for the moment) Decision trees built by greedy, recursive search Produces piecewise constant, axis- orthagonal, hyperrectangular model Can handle continuous or categorical attributes Only categorical class labels Learning bias for small, well balanced trees

Splitting criteria What properties do we want our getBestSplitAttribute() function to have? Increase the purity of the data After split, new sets should be closer to uniform labeling than before the split Want the subsets to have roughly the same purity Want the subsets to be as balanced as possible These choices are designed to produce small trees Definition: Learning bias == tendency to find one class of solution out of H in preference to another

Entropy We’ll use entropy Consider a set of true/false labels Want our measure to be small when the set is pure (all true or all false), and large when set is almost split between the classes Expresses the amount of information in the set (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set In general, for classes :

The entropy curve

Entropy of a split A split produces a number of sets (one for each branch) Need a corresponding entropy of a split (i.e., entropy of a collection of sets) Definition: entropy of a split

Information gain The last, easy step: Want to pick the attribute that decreases the information content of the data as much as possible Q: Why decrease? Define: gain of splitting data set [X,Y] on attribute a :

Final algorithm Now we have a complete alg for the getBestSplitAttribute() function: Input: InstanceSet X, LabelSet Y Output: Attribute baseInfo=entropy(Y); foreach a in (X.attributes) { [X1,...,Xk,Y1,...,Yk]=splitData(X,Y,a); gain[a]=baseInfo-splitEntropy(Y1,...,Yk); } return argmax(gain);

DTs in practice... Growing to purity is bad (overfitting)

DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width

DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width

DTs in practice... Growing to purity is bad (overfitting) Terminate growth early Grow to purity, then prune back Multiway splits are a pain Entropy is biased in favor of more splits Correct w/ gain ratio

DTs in practice... Growing to purity is bad (overfitting) Terminate growth early Grow to purity, then prune back Multiway splits are a pain Entropy is biased in favor of more splits Correct w/ gain ratio Real-valued attributes rules of form if (x1<3.4) {... } How to pick the “3.4”?

Measuring accuracy So now you have a DT -- what now? Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing?

Measuring accuracy So now you have a DT -- what now? Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing? Theoretically -- prove that you have the “right” tree Very, very hard in practice Measure it Trickier than it sounds!....

Testing with training data So you have a data set: and corresponding labels: You build your decision tree: tree=buildDecisionTree(X,Y) What happens if you just do this: acc=0.0; for (i=1;i<=N;++i) { acc+=(tree.classify(X[i])==Y[i]); } acc/=N; return acc; ?

Testing with training data Answer: you tend to overestimate real accuracy (possibly drastically) x2: sepal width ? ? ? ? ? ?

Separation of train & test Fundamental principle (1st amendment of ML): Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!

Holdout data Usual to “hold out” a separate set of data for testing; not used to train classifier A.k.a., test set, holdout set, evaluation set, etc. E.g., is training set accuracy is test set (or generalization) accuracy

Gotchas... What if you’re unlucky when you split data into train/test? E.g., all train data are class A and all test are class B? No “red” things show up in training data Best answer: stratification Try to make sure class (+feature) ratios are same in train/test sets (and same as original data) Why does this work? Almost as good: randomization Shuffle data randomly before split Why does this work?