Download presentation
Presentation is loading. Please wait.
Published byDonald Green Modified over 9 years ago
1
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning Avoiding overfitting through pruningAvoiding overfitting through pruning Numeric and missing attributesNumeric and missing attributes
2
Example of a Decision Tree Example: Learning to classify stars. Luminosity Mass Type A Type B Type C > T1 <= T1 > T2 <= T2
3
Short vs Long Hypotheses We mentioned a top-down, greedy approach to constructing decision trees denotes a preference of short hypotheses over long hypotheses. We mentioned a top-down, greedy approach to constructing decision trees denotes a preference of short hypotheses over long hypotheses. Why is this the right thing to do? Occam’s Razor: Prefer the simplest hypothesis that fits the data. Back since William of Occam (1320). Great debate in the philosophy of science.
4
Issues in Decision Tree Learning Practical issues while building a decision tree can be enumerated as follows: 1)How deep should the tree be? 2)How do we handle continuous attributes? 3)What is a good splitting function? 4)What happens when attribute values are missing? 5)How do we improve the computational efficiency?
5
How deep should the tree be? Overfitting the Data A tree overfits the data if we let it grow deep enough so that it begins to capture “aberrations” in the data that harm the predictive power on unseen examples: size t2 t3 humidity Possibly just noise, but the tree is grown larger to capture these examples
6
Overtting the Data: Definition Assume a hypothesis space H. We say a hypothesis h in H overfits a dataset D if there is another hypothesis h’ in H where h has better classification accuracy than h’ on D but worse classification accuracy than h’ on D’. 0.5 0.6 0.7 0.8 0.9 1.0 Size of the tree training data testing data overfitting
7
Causes for Overtting the Data What causes a hypothesis to overfit the data? 1)Random errors or noise Examples have incorrect class label or Examples have incorrect class label or incorrect attribute values. incorrect attribute values. 2)Coincidental patterns By chance examples seem to deviate from a pattern due to By chance examples seem to deviate from a pattern due to the small size of the sample. the small size of the sample. Overfitting is a serious problem that can cause strong performance degradation.
8
Solutions for Overtting the Data There are two main classes of solutions: 1)Stop the tree early before it begins to overfit the data. + In practice this solution is hard to implement because it + In practice this solution is hard to implement because it is not clear what is a good stopping point. is not clear what is a good stopping point. 2) Grow the tree until the algorithm stops even if the overfitting problem shows up. Then prune the tree as a post-processing problem shows up. Then prune the tree as a post-processing step. step. + This method has found great popularity in the machine + This method has found great popularity in the machine learning community. learning community.
9
Decision Tree Pruning 1.) Grow the tree to learn the training data training data 2.) Prune tree to avoid overfitting the data the data
10
Methods to Validate the New Tree Training and Validation Set Approach Divide dataset D into a training set TR and a Divide dataset D into a training set TR and a validation set TE validation set TE Build a decision tree on TR Build a decision tree on TR Test pruned trees on TE to decide the best final tree. Test pruned trees on TE to decide the best final tree. Dataset D Training TR Validation TE
11
Training and Validation There are two approaches: A.Reduced Error Pruning B.Rule Post-Pruning Dataset D Training TR (normally 2/3 of D) Validation TE (normally 1/3 of D)
12
Reduced Error Pruning Main Idea: 1) Consider all internal nodes in the tree. 2)For each node check if removing it (along with the subtree below it) and assigning the most common class to it does below it) and assigning the most common class to it does not harm accuracy on the validation set. not harm accuracy on the validation set. 3)Pick the node n* that yields the best performance and prune its subtree. its subtree. 4) Go back to (2) until no more improvements are possible.
13
Example Original Tree Possible trees after pruning:
14
Example Pruned Tree Possible trees after 2 nd pruning:
15
Example Process continues until no improvement is observed on the validation set: 0.5 0.6 0.7 0.8 0.9 1.0 Size of the tree validation data Stop pruning the tree
16
Reduced Error Pruning Disadvantages: If the original data set is small, separating examples away for validation may leave you with few examples for training. validation may leave you with few examples for training. Dataset D Training TR Testing TE Small dataset Training set is too small and so is the validation set
17
Rule Post-Pruning Main Idea: 1) Convert the tree into a rule-based system. 2)Prune every single rule first by removing redundant conditions. conditions. 3) Sort rules by accuracy.
18
Example x1 x2 x3 A B A C 10 110 0 Original tree Rules: ~x1 & ~x2 -> Class A ~x1 & x2 -> Class B x1 & ~x3 -> Class A x1 & x3 -> Class C Possible rules after pruning (based on validation set): ~x1 -> Class A ~x1 & x2 -> Class B ~x3 -> Class A ~x3 -> Class A x1 & x3 -> Class C
19
Advantages of Rule Post-Pruning The language is more expressive. Improves on interpretability. Pruning is more flexible. In practice this method yields high accuracy performance.
20
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning Avoiding overfitting through pruningAvoiding overfitting through pruning Numeric and missing attributesNumeric and missing attributes
21
Discretizing Continuous Attributes Example: attribute temperature. 1) Order all values in the training set 2) Consider only those cut points where there is a change of class 3) Choose the cut point that maximizes information gain temperature 97 97.5 97.6 97.8 98.5 99.0 99.2 100 102.2 102.6 103.2
22
Claude Shannon 1916 – 2001 Funded information theory on 1948 with his paper: “A Mathematical Theory of Communication” Awarded the Alfred Noble American Institute of American Engineers Award for his master’s thesis. Worked at MIT, Bell Labs. Met with Alan Turing, Marvin Minsky, John von Neumann, and Albert Einstein. Creator of the “Ultimate Machine”.
23
Missing Attribute Values We are at a node n in the decision tree. Different approaches: 1)Assign the most common value for that attribute in node n. 2)Assign the most common value in n among examples with the same classification as X. same classification as X. 3)Assign a probability to each value of the attribute based on the frequency of those values in node n. Each fraction is propagated frequency of those values in node n. Each fraction is propagated down the tree. down the tree. Example: X = (luminosity > T1, mass = ?)
24
Summary Decision-tree induction is a popular approach to classification that enables us to interpret the output hypothesis. that enables us to interpret the output hypothesis. The hypothesis space is very powerful: all possible DNF formulas. We prefer shorter trees than larger trees. Overfitting is an important issue in decision-tree induction. Different methods exist to avoid overfitting like reduced-error pruning and rule post-processing. pruning and rule post-processing. Techniques exist to deal with continuous attributes and missing attribute values. attribute values.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.