Decision Trees.

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
ICS320-Foundations of Adaptive and Learning Systems
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Trees Decision tree representation Top Down Construction
Decision Tree Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Decision tree learning
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Machine Learning CS 165B Spring 2012
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Chapter 9 – Classification and Regression Trees
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Decision-Tree Induction & Decision-Rule Induction
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Università di Milano-Bicocca Laurea Magistrale in Informatica
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Computational Intelligence: Methods and Applications
Decision trees (concept learnig)
Classification Algorithms
Decision Tree Learning
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Machine Learning: Decision Tree Learning
Data Science Algorithms: The Basic Methods
Introduction to Machine Learning Algorithms in Bioinformatics: Part II
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning Chapter 3. Decision Tree Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Machine Learning in Practice Lecture 17
Decision Trees Decision tree representation ID3 learning algorithm
Decision Trees Berlin Chen
INTRODUCTION TO Machine Learning
Presentation transcript:

Decision Trees

Representation of Concepts Concept learning: conjunction of attributes (Sunny AND Hot AND Humid AND Windy) + Decision trees: disjunction of conjunction of attributes (Sunny AND Normal) OR (Overcast) OR (Rain AND Weak) + More powerful representation Larger hypothesis space H Can be represented as a tree Common form of decision making in humans OutdoorSport sunny overcast rain Humidity Wind high normal strong weak No Yes

Rectangle learning…. - - - - - - - - - - - + + + + + + + + + + + + + Conjunctions (single rectangle) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + - - - - - - - - - - - Disjunctions of Conjunctions (union of rectangles) - - - - + + + + + + + + + + + + + - - - - - - - - - - - - - - -

Training Examples Day Outlook Temp Humidity Wind Tennis? D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14

Decision Trees Decision tree to represent learned target functions Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification Can be represented by logical formulas Outlook sunny overcast rain Humidity Wind high normal strong weak No Yes

Representation in decision trees Example of representing rule in DT’s: if outlook = sunny AND humidity = normal OR if outlook = overcast if outlook = rain AND wind = weak then playtennis

Applications of Decision Trees Instances describable by a fixed set of attributes and their values Target function is discrete valued 2-valued N-valued But can approximate continuous functions Disjunctive hypothesis space Possibly noisy training data Errors, missing values, … Examples: Equipment or medical diagnosis Credit risk analysis Calendar scheduling preferences

Decision Trees Given distribution Of training instances + + + + - - - Attribute 1 Attribute 2 Given distribution Of training instances Draw axis parallel Lines to separate the Instances of each class

Decision Tree Structure + + + + - - - Attribute 1 Attribute 2 Draw axis parallel Lines to separate the Instances of each class

Decision Tree Structure Decision leaf * Alternate splits possible + + + + - - - Attribute 1 Attribute 2 Decision nodes (splits) Decision node = condition = box = collection of satisfying examples 30 20 40

Decision Tree Construction Find the best structure Given a training data set

Top-Down Construction Start with empty tree Main loop: 1. Split the “best” decision attribute (A) for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, STOP, Else iterate over new leaf nodes Grow tree just deep enough for perfect classification If possible (or can approximate at chosen depth) Which attribute is best?

Best attribute to split? + + + + - - - Attribute 1 Attribute 2

Best attribute to split? + + + + - - - Attribute 1 Attribute 2 Attribute 1 > 40?

Best attribute to split? + + + + - - - Attribute 1 Attribute 2 Attribute 1 > 40?

Which split to make next? Pure box/node Mixed box/node + + + + - - - Attribute 1 Attribute 2 Already pure leaf No further need to split Attribute 1 > 20?

Which split to make next? Already pure leaf No further need to split + + + + - - - Attribute 1 Attribute 2 A2 > 30?

Principle of Decision Tree Construction Finally we want to form pure leaves Correct classification Greedy approach to reach correct classification Initially treat the entire data set as a single box For each box choose the spilt that reduces its impurity (in terms of class labels) by the maximum amount Split the box having highest reduction in impurity Continue to Step 2 Stop when all boxes are pure

Choosing Best Attribute? Consider 64 examples, 29+ and 35- Which one is better? Which is better? 29+, 35- A1 A2 29+, 35- t f t f 25+, 5- 4+, 30- 15+, 19- 14+, 16- t f 29+, 35- 21+, 5- 8+, 30- 18+, 33- 11+, 2- A1 A2

Entropy A measure for uncertainty purity information content Information theory: optimal length code assigns (- log2p) bits to message having probability p S is a sample of training examples p+ is the proportion of positive examples in S p- is the proportion of negative examples in S Entropy of S: average optimal number of bits to encode information about certainty/uncertainty about S Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p- Can be generalized to more than two values

Entropy Entropy can also be viewed as measuring purity of S, uncertainty in S, information in S, … E.g.: values of entropy for p+=1, p+=0, p+=.5 Easy generalization to more than binary values Sum over pi *(-log2 pi) , i=1,n i is + or – for binary i varies from 1 to n in the general case

Choosing Best Attribute? Consider 64 examples (29+,35-) and compute entropies: Which one is better? Which is better? E(S)=0.993 29+, 35- A1 E(S)=0.993 A2 29+, 35- t t f f 0.650 0.522 0.989 0.997 25+, 5- 4+, 30- 15+, 19- 14+, 16- E(S)=0.993 E(S)=0.993 t f 29+, 35- 21+, 5- 8+, 30- 18+, 33- 11+, 2- A1 A2 0.708 0.742 0.937 0.619

Information Gain Gain(S,A): reduction in entropy after choosing attr. A t f 29+, 35- 25+, 5- 4+, 30- 15+, 19- 14+, 16- A1 A2 E(S)=0.993 0.650 0.522 0.989 0.997 Gain: 0.395 Gain: 0.000 t f 29+, 35- 21+, 5- 8+, 30- 18+, 33- 11+, 2- A1 A2 E(S)=0.993 0.708 0.742 0.937 0.619 Gain: 0.265 Gain: 0.121

Gain function Gain is measure of how much can Reduce uncertainty Value lies between 0,1 What is significance of gain of 0? example where have 50/50 split of +/- both before and after discriminating on attributes values gain of 1? Example of going from “perfect uncertainty” to perfect certainty after splitting example with predictive attribute Find “patterns” in TE’s relating to attribute values Move to locally minimal representation of TE’s

Training Examples Day Outlook Temp Humidity Wind Tennis? D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14

Determine the Root Attribute Wind Weak Strong 9+, 5- E=0.940 6+, 2- E=0.811 3+, 3- E=1.000 Humidity High Low 3+, 4- E=0.985 6+, 1- E=0.592 Gain (S, Humidity) = 0.151 Gain (S, Wind) = 0.048 Gain (S, Outlook) = 0.246 Gain (S, Temp) = 0.029

Sort the Training Examples 9+, 5- {D1,…,D14} Outlook Sunny Overcast Rain {D1,D2,D8,D9,D11} 2+, 3- {D3,D7,D12,D13} 4+, 0- {D4,D5,D6,D10,D15} 3+, 2- Yes ? ? Ssunny= {D1,D2,D8,D9,D11} Gain (Ssunny, Humidity) = .970 Gain (Ssunny, Temp) = .570 Gain (Ssunny, Wind) = .019

Final Decision Tree for Example Outlook Sunny Rain Overcast Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Hypothesis Space Search (ID3) Hypothesis space (all possible trees) is complete! Target function is included in there

Hypothesis Space Search in Decision Trees Conduct a search of the space of decision trees which can represent all possible discrete functions. Goal: to find the best decision tree Finding a minimal decision tree consistent with a set of data is NP-hard. Perform a greedy heuristic search: hill climbing without backtracking Statistics-based decisions using all data

Hypothesis Space Search by ID3 Hypothesis space is complete! H is space of all finite DT’s (all discrete functions) Target function is included in there Simple to complex hill-climbing search of H Use of gain as hill-climbing function Outputs a single hypothesis (which one?) Cannot assess all hypotheses consistent with D (usually many) Analogy to breadth first search Examines all trees of given depth and chooses best… No backtracking Locally optimal ... Statistics-based search choices Use all TE’s at each step Robust to noisy data

Restriction bias vs. Preference bias Restriction bias (or Language bias) Incomplete hypothesis space Preference (or search) bias Incomplete search strategy Candidate Elimination has restriction bias ID3 has preference bias In most cases, we have both a restriction and a preference bias.

Inductive Bias in ID3 Preference for short trees, and for those with high information gain attributes near the root Principle of Occam's razor prefer the shortest hypothesis that fits the data Justification Smaller likelihood of a short hypothesis fitting the data at random Problems Other ways to reduce random fits to data Size of hypothesis based on the data representation Minimum description length principle

Overfitting the Data On training accuracy On testing Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization performance. - There may be noise in the training data the tree is fitting - The algorithm might be making decisions based on very little data A hypothesis h is said to overfit the training data if the is another hypothesis, h’, such that h has smaller error than h’ on the training data but h has larger error on the test data than h’. On training accuracy On testing Complexity of tree

Overfitting Attribute 2 A very deep tree required + + + + + + - + - - - Attribute 1 Attribute 2 A very deep tree required To fit just one odd training example

When to stop splitting further? + + + + + + - + - - - Attribute 1 Attribute 2 A very deep tree required To fit just one odd training example

Overfitting in Decision Trees Consider adding noisy training example (should be +): What effect on earlier tree? Day Outlook Temp Humidity Wind Tennis? D15 Sunny Hot Normal Strong No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak

Overfitting - Example Noise or other coincidental regularities Outlook Overcast Rain 3,7,12,13 4,5,6,10,14 3+,2- Sunny 1,2,8,9,11 4+,0- 2+,3- Yes Humidity Wind Normal High No Weak Strong

Avoiding Overfitting Two basic approaches - Prepruning: Stop growing the tree at some point during construction when it is determined that there is not enough data to make reliable choices. - Postpruning: Grow the full tree and then remove nodes that seem not to have sufficient evidence. (more popular) Methods for evaluating subtrees to prune: - Cross-validation: Reserve hold-out set to evaluate utility (more popular) - Statistical testing: Test if the observed regularity can be dismissed as likely to be occur by chance - Minimum Description Length: Is the additional complexity of the hypothesis smaller than remembering the exceptions ? This is related to the notion of regularization that we will see in other contexts– keep the hypothesis simple.

Reduced-Error Pruning A post-pruning, cross validation approach - Partition training data into “grow” set and “validation” set. - Build a complete tree for the “grow” data - Until accuracy on validation set decreases, do: For each non-leaf node in the tree Temporarily prune the tree below; replace it by majority vote. Test the accuracy of the hypothesis on the validation set Permanently prune the node with the greatest increase in accuracy on the validation test. Problem: Uses less data to construct the tree Sometimes done at the rules level General Strategy: Overfit and Simplify

Rule post-pruning Allow tree to grow until best fit (allow overfitting) Convert tree to equivalent set of rules One rule per leaf node Prune each rule independently of others Remove various preconditions to improve performance Sort final rules into desired sequence for use

Example of rule post pruning IF (Outlook = Sunny) ^ (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) ^ (Humidity = Normal) THEN PlayTennis = Yes Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak

Extensions of basic algorithm Continuous valued attributes Attributes with many values TE’s with missing data Attributes with associated costs Other impurity measures Regression tree

Continuous Valued Attributes Create a discrete attribute from continuous variables E.g., define critical Temperature = 82.5 Candidate thresholds chosen by gain function can have more than one threshold typically where values change quickly (48+60)/2 (80+90)/2 Temp 40 48 60 72 80 90 Tennis? N Y

Attributes with Many Values Problem: If attribute has many values, Gain will select it (why?) E.g. of birthdates attribute 365 possible values Likely to discriminate well on small sample For sample of fixed size n, and attribute with N values, as N -> infinity ni/N -> 0 - pi*log pi -> 0 for all i and entropy -> 0 Hence gain approaches max value

Attributes with many values Problem: Gain will select attribute with many values One approach: use GainRatio instead Entropy of the partitioning Penalizes higher number of partitions where Si is the subset of S for which A has value vi (example of Si/S = 1/N: SplitInformation = log N)

Unknown Attribute Values What if some examples are missing values of attribute A? Use training example anyway, sort through tree if node n tests A, assign most common value of A among other examples sorted to node n assign most common value of A among other examples with same target value assign probability pi to each possible value vi of A assign fraction pi of example to each descendant in tree Classify test instances with missing values in same fashion Used in C4.5

Attributes with Costs Consider medical diagnosis: BloodTest has cost $150, Pulse has a cost of $5. robotics, Width-From-1ft has cost 23 sec., from 2 ft 10s. How to learn a consistent tree with low expected cost? Replace gain by Tan and Schlimmer (1990) Nunez (1988) where w  [0, 1] determines importance of cost

Gini Index Another sensible measure of impurity (i and j are classes) After applying attribute A, the resulting Gini index is Gini can be interpreted as expected error rate

Gini Index . . . . . . Attributes: color, border, dot Classification: triangle, square

. . . . . . . red Color? green Gini Index for Color . yellow .

Gain of Gini Index

Three Impurity Measures

Regression Tree Similar to classification Use a set of attributes to predict the value (instead of a class label) Instead of computing information gain, compute the sum of squared errors Partition the attribute space into a set of rectangular subspaces, each with its own predictor The simplest predictor is a constant value

Rectilinear Division A regression tree is a piecewise constant function of the input attributes X2 X1 t1 r5 r2 X2  t2 X1  t3 r3 t2 r4 r1 r1 r2 r3 X2  t4 t1 t3 X1 r4 r5

Growing Regression Trees To minimize the square error on the learning sample, the prediction at a leaf is the average output of the learning cases reaching that leaf Impurity of a sample is defined by the variance of the output in that sample: I(LS)=vary|LS{y}=Ey|LS{(y-Ey|LS{y})2} The best split is the one that reduces the most variance: } { var | ) , ( y LS A I a å - = D

Regression Tree Pruning Exactly the same algorithms apply: pre-pruning and post-pruning. In post-pruning, the tree that minimizes the squared error on VS is selected. In practice, pruning is more important in regression because full trees are much more complex (often all objects have a different output values and hence the full tree has as many leaves as there are objects in the learning sample)

When Are Decision Trees Useful ? Advantages Very fast: can handle very large datasets with many attributes Flexible: several attribute types, classification and regression problems, missing values… Interpretability: provide rules and attribute importance Disadvantages Instability of the trees (high variance) Not always competitive with other algorithms in terms of accuracy

History of Decision Tree Research Hunt and colleagues in Psychology used full search decision trees methods to model human concept learning in the 60’s Quinlan developed ID3, with the information gain heuristics in the late 70’s to learn expert systems from examples Breiman, Friedmans and colleagues in statistics developed CART (classification and regression trees simultaneously A variety of improvements in the 80’s: coping with noise, continuous attributes, missing data, non-axis parallel etc. Quinlan’s updated algorithm, C4.5 (1993) is commonly used (New:C5) Boosting (or Bagging) over DTs is a good general purpose algorithm

Summary Decision trees are practical for concept learning Basic information measure and gain function for best first search of space of DTs ID3 procedure search space is complete Preference for shorter trees Overfitting is an important issue with various solutions Many variations and extensions possible

Software In R: C4.5: Weka Packages tree and rpart http://www.cse.unwe.edu.au/~quinlan Weka http://www.cs.waikato.ac.nz/ml/weka