Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Lecture outline Classification Decision-tree classification.
CS 391L: Machine Learning: Decision Tree Learning
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Decision Tree Classifier
Classification.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Theses slides are based on the slides by
Chapter 7 Decision Tree.
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
Business Intelligence Technologies – Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.
Feature Selection: Why?
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Classification and Prediction
Presentation on Decision trees Presented to: Sir Marooof Pasha.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
1 Machine Learning Decision Trees Some of these slides are courtesy of R.Mooney, UT Austin and E. Keogh, UC Riverside.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Data Mining Decision Tree Induction
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Chapter 6 Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
ID3 Algorithm.
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Presentation transcript:

Machine Learning Decision Trees

E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length Abdomen Length Abdomen Length > 7.1? no yes Katydid Antenna Length Antenna Length > 6.0? no yes Katydid Grasshopper

E. Keogh, UC Riverside Grasshopper Antennae shorter than body? Cricket Foretiba has ears? KatydidsCamel Cricket Yes No 3 Tarsi? No Decision trees predate computers

E. Keogh, UC Riverside Decision tree –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute values of the sample against the decision tree Decision Tree Classification

E. Keogh, UC Riverside Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they can be discretized in advance) –Examples are partitioned recursively based on selected attributes. –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left How do we construct the decision tree?

E. Keogh, UC Riverside Information Gain as A Splitting Criteria Select the attribute with the highest information gain ( information gain is the expected reduction in entropy ). Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as 0 log(0) is defined as 0

E. Keogh, UC Riverside Information Gain in Decision Tree Induction Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding information that would be gained by branching on A Note: entropy is at its minimum if the collection of objects is completely uniform

E. Keogh, UC Riverside Person Hair Length WeightAgeClass Homer0”25036M Marge10”15034F Bart2”9010M Lisa6”788F Maggie4”201F Abe1”17070M Selma8”16041F Otto10”18038M Krusty6”20045M Comic8”29038?

How to Choose the Most descriptive Rule?

Entropy Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p 1 is the fraction of positive examples in S and p 0 is the fraction of negatives. If all examples are in one category, entropy is zero (we define 0  log(0)=0) If examples are equally mixed (p 1 =p 0 =0.5), entropy is a maximum of 1. Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases. For multi-class problems with c categories, entropy generalizes to:

Entropy Plot for Binary Classification

Information Gain The information gain of a feature F is the expected reduction in entropy resulting from splitting on this feature. where S v is the subset of S having value v for feature F. Entropy of each resulting subset weighted by its relative size.

E. Keogh, UC Riverside Hair Length <= 5? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Entropy(1F,3M) = -(1/4)log 2 (1/4) - (3/4)log 2 (3/4) = Entropy(3F,2M) = -(3/5)log 2 (3/5) - (2/5)log 2 (2/5) = Gain(Hair Length <= 5) = – (4/9 * /9 * ) = Let us try splitting on Hair length

E. Keogh, UC Riverside Weight <= 160? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Entropy(4F,1M) = -(4/5)log 2 (4/5) - (1/5)log 2 (1/5) = Entropy(0F,4M) = -(0/4)log 2 (0/4) - (4/4)log 2 (4/4) = 0 Gain(Weight <= 160) = – (5/9 * /9 * 0 ) = Let us try splitting on Weight

E. Keogh, UC Riverside age <= 40? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Entropy(3F,3M) = -(3/6)log 2 (3/6) - (3/6)log 2 (3/6) = 1 Entropy(1F,2M) = -(1/3)log 2 (1/3) - (2/3)log 2 (2/3) = Gain(Age <= 40) = – (6/9 * 1 + 3/9 * ) = Let us try splitting on Age

E. Keogh, UC Riverside Weight <= 160? yes no Hair Length <= 2? yes no Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… RECURSION! This time we find that we can split on Hair length, and we are done!

E. Keogh, UC Riverside Weight <= 160? yesno Hair Length <= 2? yes no We don’t need to keep the data around, just the test conditions. Male Female How would these people be classified?

E. Keogh, UC Riverside It is trivial to convert Decision Trees to rules… Weight <= 160? yesno Hair Length <= 2? yes no Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female

E. Keogh, UC Riverside Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions. Once we have learned the decision tree, we don’t even need a computer! This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call.

E. Keogh, UC Riverside Wears green? Yes No The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few data points, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”… Male Female

E. Keogh, UC Riverside Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

E. Keogh, UC Riverside Which of the “Pigeon Problems” can be solved by a Decision Tree? 1)Deep Bushy Tree 2)Useless 3)Deep Bushy Tree The Decision Tree has a hard time with correlated attributes ?

UT Austin, R. Mooney Cross-Validating without Losing Training Data If the algorithm is modified to grow trees breadth-first rather than depth-first, we can stop growing after reaching any specified tree complexity. First, run several trials of reduced error-pruning using different random splits of grow and validation sets. Record the complexity of the pruned tree learned in each trial. Let C be the average pruned-tree complexity. Grow a final tree breadth-first from all the training data but stop when the complexity reaches C. Similar cross-validation approach can be used to set arbitrary algorithm parameters in general.

E. Keogh, UC Riverside Advantages: –Easy to understand (Doctors love them!) –Easy to generate rules Disadvantages: –May suffer from overfitting. –Classifies by rectangular partitioning (so does not handle correlated features very well). –Can be quite large – pruning is necessary. –Does not handle streaming data easily Advantages/Disadvantages of Decision Trees

UT Austin, R. Mooney Additional Decision Tree Issues Better splitting criteria –Information gain prefers features with many values. Continuous features Predicting a real-valued function (regression trees) Missing feature values Features with costs Misclassification costs Incremental learning –ID4 –ID5 Mining large databases that do not fit in main memory