Decision Trees II CSC 600: Data Mining Class 9.

Decision Trees II CSC 600: Data Mining Class 9

Today Decision Trees Overfitting / Underfitting Pruning
Bias-Variance Tradeoff Pruning Regression Trees Ensemble Learners Boosting Bagging Random Forests

Training Set vs. Test Set
Overall dataset is divided into: Training set – used to build model Test set – evaluates model (sometimes a Validation set is also used; more later)

Problem Error rates on training set vs. testing set might be drastically different. No guarantee that the model with the smallest training error rate will have the smallest testing error rate

Overfitting Overfitting: occurs when model “memorizes” the training set data very low error rate on training data yet, high error rate on test data Model does not generalize to the overall problem This is bad! We wish to avoid overfitting.

Bias and Variance Bias: the error introduced by modeling a real-life problem (usually extremely complicated) by a much simpler problem The more flexible (complex) a method is, the less bias it will generally have. Variance: how much the learned model will change if the training set was different Does changing a few observations in the training set, dramatically affect the model? Generally, the more flexible (complex) a method is, the more variance it has.

Data Point Observations created by: Y=f(X)+ε
Example: we wish to build a model that separates the dark-colored points from the light-colored points. Black line is simple, linear model Currently, some classification error Low variance Bias present

More complex model (curvy line instead of linear)
Zero classification error for these data points No linear model bias Higher Variance?

More data has been added.
Re-train both models (linear line, and curvy line) in order to minimize error rate Variance: Linear model doesn’t change much Curvy line significantly changes Which model is better?

Model Overfitting Errors committed by a classification model are generally divided into: Training errors: misclassification on training set records Generalization errors (testing errors): errors made on testing set / previously unseen instances Good model has low training error and low generalization error. Overfitting: model has low training error rate, but high generalization errors

Model Underfitting and Overfitting
When tree is small: Underfitting Large training error rate Large testing error rate Structure of data isn’t yet learned When tree gets too large: Beware of overfitting Training error rate decreases while testing error rate increases Tree is too complex Tree “almost perfectly fits” training data, but doesn’t generalize to testing examples Model underfitting Model overfitting

Reasons for Overfitting
Presence of noise Lack of representative samples

Two training records are mislabeled.
Training Set Name Body Temp. Gives Birth 4 legged Hibernates Class (Mammal?) Porcupine Warm Yes Cat No Bat Whale Salamander Cold Komodo Dragon Python Salmon Eagle Guppy Two training records are mislabeled. Tree perfectly fits training data.

Testing Set Test error rate: 30% Name Body Temp. Gives Birth 4 legged
Hibernates Class (Mammal?) Human Warm Yes No Pigeon Elephant Leopard Shark Cold Turtle Penguin Eel Dolphin Spiny Anteater Gila Monster Test error rate: 30%

Testing Set Test error rate: 30% Reasons for misclassifications:
Name Body Temp. Gives Birth 4 legged Hibernates Class (Mammal?) Human Warm Yes No Pigeon Elephant Leopard Shark Cold Turtle Penguin Eel Dolphin Spiny Anteater Gila Monster Test error rate: 30% Reasons for misclassifications: Mislabeled records in training data “Exceptional case” Unavoidable Minimal error rate achievable by any classifier

Training error rate: 0% Test error rate: 30% overfitting Training error rate: 20% Test error rate: 10%

Overfitting and Decision Trees
The likelihood of overfitting occurring increases as a tree gets deeper the resulting classifications are based on smaller subsets of the full training dataset Overfitting involves splitting the data on an irrelevant feature.

Pruning: Handling Overfitting in Decision Trees
Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise and sample variance in the training set used to induce it. Pruning will result in decision trees being created that are not consistent with the training set used to build them. But we are more interested in created prediction models that generalize well to new data! Pre-pruning (Early Stopping) Post-pruning

Pre-pruning Techniques
Stop creating subtrees when the number of instances in a partition falls below a threshold Information gain measured at a node is not deemed to be sufficient to make partitioning the data worthwhile Depth of the tree goes beyond a predefined limit … other more advanced approaches Benefits: Computationally efficient; works well for small datasets. Downsides: Stopping too early will fail to create the most effective trees.

Post-pruning Decision tree initially grown to its maximum size
Then examine each branch Branches that are deemed likely to be due to overfitting are pruned. Post-pruning tends to give better results than prepruning Which is faster? Post-pruning is more computationally expensive than prepruning because entire tree is grown

Reduced Error Pruning Starting at the leaves, each node is replaced with its most popular class. If the accuracy is not affected, then the change is kept. Evaluate accuracy on a validation set Set aside some of the training set as a validation set Advantages: simplicity and speed

Post-Pruning Example Example validation set:
Induced decision tree from training data Need to prune?

Post-Pruning Example Pruned nodes in black

Occam’s Razor General Principle (Occam’s Razor): given two models with same generalization (testing) errors, the simpler model is preferred over the more complex model Additional components in a more complex model have greater chance at being fitted purely by chance Problem solving principle by philosopher William of Ockham ( )

Advantages of Pruning Smaller trees are easier to interpret
Increased generalization accuracy.

Regression Trees Target Attribute:
Decision (Classification) Trees: qualitative Regression Trees: continuous Decision trees: reduce the entropy in each subtree Regression trees: reduce the variance in each subtree Idea: adapt ID3 algorithm measure of Information Gain to use variance rather than node impurity

Regression Tree Splits
Classification Trees Regression Trees Gain: “goodness of the split” larger gain => better split (better test condition) Impurity (variance) at a node: Select feature to split on that minimizes the weighted variance across all resulting partitions: I(n) = impurity measure at node n k = number of attribute values N(n) = total number of records at child node n N = total number of records at parent node

Need to watch out for Overfitting
Want to avoid overfitting: Early stopping criterion Stop partitioning the dataset if the number of training instances is less than some threshold (5% of the dataset)

Example

Advantages and Disadvantages of Trees
Trees are very easy to explain Easier to explain than linear regression Trees can be displayed graphically and interpreted by a non-expert Decision trees may more closely mirror human decision-making Trees can easily handle qualitative predictors No dummy variables Trees usually do not have same level of predictive accuracy as other data mining algorithms But, predictive performance of decision trees can be improved by aggregating trees. Techniques: bagging, boosting, random forests

Sampling Common approach for selecting a subset of data objects to be analyzed Select only some instances for the training set, instead of all of them Sampling without replacement - as each item is sampled, it is removed from the population Sampling with replacement - the same object/instance can be picked more than once

Can also be applied to other learning algorithms
Ensemble Methods Currently using one single classifier induced from training data as our model, to predict class of test instance What if we used multiple decision trees? Motivation: committee of experts working together are likely to better solve a problem than a single expert But no “group think”: each model should make predictions independently of other models in the ensemble In practice: methods work surprisingly well, usually greatly improve decision tree accuracy

Ensemble Characteristics
Build multiple models from the same training data by creating each model on a modified version of the training data. Make a final, ensemble prediction by aggregating the predictions of the individual models Classification prediction: Let each model have a vote on the correct class prediction. Assign the class with the most votes. Regression prediction: Measure of central tendency (mean or median)

Rationale How can an ensemble method improve a classifier’s performance? Assume we have 25 binary classifiers Each has error rate: ε= 0.35 If all 25 classifiers are identical: They will vote the same way on each test instance Ensemble error rate: ε= 0.35

Rationale How can an ensemble method improve a classifier’s performance? Assume we have 25 binary classifiers Each has error rate: ε= 0.35 If all 25 classifiers are independent (errors are uncorrelated): Ensemble method only makes a wrong prediction if more than half of the base classifiers predict incorrectly. 6% much less than 35%

In practice, difficult to have base classifiers than are completely independent.
Ensemble methods have been shown to improve classification accuracies even when there is some correlation. Rationale Conditions necessary for an ensemble classifier to perform better than a single classifier: Base classifiers should be independent of each other Base classifiers should not do worse than a classifier doing random guessing Example: for two-class problem, base classifier error rate: ε< .5

Methods for Constructing an Ensemble Classifier

Decision Tree Model Ensembles
Boosting Bagging Random Forests

Boosting Boosting works by iteratively creating models and adding them to the ensemble Iteration stops when a predefined number of models have been added Each new model added to the ensemble is biased to pay more attention to instances that previous models misclassified (weighted dataset).

General Boosting Algorithm
Initially instances are assigned weights of 1/N Each is equally likely to be chosen for sample Sample drawn with replacement: Di Classifier induced on Di Weights of training examples are updated: Instances classified incorrectly have weights increased Instances classified correctly have weights decreased

General Boosting Algorithm
During each iteration the algorithm: Induces a model and calculates the total error, ε, by summing the weights of the training instances for which the predictions made by the model are incorrect. Increases the weights for the instances misclassified Decreases the weights for the instances correctly classified Calculate a confidence factor α, for the model such that α increases as ε decreases

Boosting Example Boosting (Round 1) 7 3 2 8 9 4 10 6
Suppose that Instance #4 is hard to classify. Weight for this instance will be increased in future iterations, as it gets misclassified repeatedly. Examples not chosen in previous round (Instances #1, #5) also may have better chance of being selected in next round. Why? Predictions in previous round are likely to be wrong since they weren’t trained on. Boosting (Round 2) 5 4 9 2 1 7 Boosting (Round 3) 4 8 10 5 6 3 As boosting rounds proceed, instances that are the hardest to classify become even more prevalent.

Prediction Once the set of models have been created the ensemble makes predictions using a weighted aggregate of the predictions made by the individual models. The weights used in this aggregation are simply the confidence factors associated with each model.

Boosting Algorithms Several different boosting algorithms exist
Different by: How weights of training instances are updated after each boosting round How predictions made by each classifier are combined Each boosting round produces one base classifier

Bagging Ensemble method that “manipulates the training set”
On average, each Di will contain 63% of original training data. Probability of sample being selected for Di: 1 – (1 – (1/N)N Converges to: 1 – 1/e = 0.632 Ensemble method that “manipulates the training set” Action: repeatedly sample with replacement according to uniform probability distribution Every instance has equal chance of being picked Some instances may be picked multiple times; others may not be chosen Sample Size: same as training set Di: each bootstrap sample Footnote: also called bootstrap aggregating Consequently, every bootstrap sample will be missing some of the instances from the dataset so each bootstrap sample will be different and this means that models trained on different bootstrap samples will also be different

Bagging Algorithm Model Generation:
Let n be the number of instances in the training data. For each of t iterations: Sample n instances with replacement from training data. Apply the learning algorithm to the sample. Store the resulting model. Classification: For each of the t models: Predict class of instance using model. Return class that has been predicted most often.

Bagging Example Decision Stump: one-level binary decision tree
Now going to apply bagging and create many decision stump base classifiers. Dataset: 10 instances Predictor Variable: x Target Variable: y X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Y -1 Decision Stump: one-level binary decision tree What’s the best performance of a decision stump on this data? Splitting condition will be x <= k, where k is the split point Best splits: x <= 0.35 or x <= 0.75 Best accuracy: 70%

Bagging Example First choose how many “bagging rounds” to perform
Chosen by analyst We’ll do 10 bagging rounds in this example:

Bagging Example In each round, create Di by sampling with replacement
0.1 0.2 0.3 0.4 0.5 0.6 0.9 Y 1 -1 Learn decision stump. What stump will be learned? If x <= 0.35 then y = 1 If x > 0.35 then y = -1

10 bagging rounds. 10 Di’s. 10 learned models.
If x <= 0.35 then y = 1 If x > 0.35 then y = -1 X 0.1 0.2 0.3 0.4 0.5 0.6 0.9 Y 1 -1 If x <= 0.65 then y = 1 If x > 0.65 then y = -1 X 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1 Y -1 If x <= 0.35 then y = 1 If x > 0.35 then y = -1 X 0.1 0.2 0.3 0.4 0.5 0.7 0.8 0.9 Y 1 -1 If x <= 0.3 then y = 1 If x > 0.3 then y = -1 X 0.1 0.2 0.4 0.5 0.7 0.8 0.9 Y 1 -1 If x <= 0.35 then y = 1 If x > 0.35 then y = -1 X 0.1 0.2 0.5 0.6 1 Y -1 If x <= 0.75 then y = -1 If x > 0.75 then y = 1 X 0.2 0.4 0.5 0.6 0.7 0.8 0.9 1 Y -1 X 0.1 0.4 0.6 0.7 0.8 0.9 1 Y -1 If x <= 0.75 then y = -1 If x > 0.75 then y = 1 X 0.1 0.2 0.5 0.7 0.8 0.9 1 Y -1 If x <= 0.75 then y = -1 If x > 0.75 then y = 1 X 0.1 0.3 0.4 0.6 0.7 0.8 1 Y -1 If x <= 0.75 then y = -1 If x > 0.75 then y = 1 X 0.1 0.3 0.8 0.9 Y 1 If x <= 0.05 then y = 1 If x > 0.05 then y = -1 10 bagging rounds. 10 Di’s. 10 learned models. 10 minutes to make this slide.

Classify test instance by using each base classifier and taking majority vote.
Bagging Example 10 test instances. Let’s see how each classifier votes: 100% overall ensemble method accuracy (improvement from 70%) Round x=0.1 1 2 3 4 5 6 -1 7 8 9 10 Sum Sign True x=0.2 1 -1 2 x=0.3 1 -1 2 x=0.4 -1 1 -6 x=0.5 -1 1 -6 x=0.6 -1 1 -6 x=0.7 -1 1 -6 x=0.8 -1 1 2 x=0.9 -1 1 2 X=1.0 -1 1 2

Bagging Summary In Previous Example: even though base classifier was decision stump (depth=1), bagging aggregated the classifiers to effectively learn a decision tree of depth=2 Bagging helps to reduce variance Bagging does not focus on any particular instance of training data less susceptible to model overfitting with noisy data robust to minor perturbations in training set If base classifiers are stable (not much variance), bagging can degrade performance Training set is ~37% smaller than original data

Bagging vs. Boosting Boosting: Sample with nonuniform distribution
Unlike bagging where each instance had equal chance of being selected Boosting Motivation: focus on instances that are harder to classify How: give harder instances more weight in future rounds

Random Forests Designed for decision tree classifiers
Bagging and Boosting can be applied to other learning algorithms Example of “manipulating the input features” Idea: combines predictions made by multiple decision trees each tree is induced based on an independent set of random vectors Random forests are generated from fixed probability distribution (unlike Boosting)

Random Forests

Random Forests Algorithm
Different variations of Random Forests exist “Normal” decision tree induction utilizes full set of features to determine each split Random forest injects randomness: Selection of random subset of features to determine each split How large should subset be? Typically root(K), for K features works well for classification (K / 3) for regression Tree is grown to full size with pruning Overall prediction is majority vote from all individually trained trees

Ensemble Method Performance on Classic Data Sets
Tan Table 5.5

Which to use? Which approach should we use? Bagging is simpler to implement and parallelize than boosting and, so, may be better with respect to ease of use and training time. Empirical results indicate: boosted decision tree ensembles were the best performing model of those tested for datasets containing up to 4,000 descriptive features. random forest ensembles (based on bagging) performed better for datasets containing more that 4,000 features.

Ensemble Methods Summary
Advantages: Astonishingly good performance Modeling human behavior: making judgment based on many trusted advisors, each with their own specialty Disadvantage: Combined models rather hard to analyze Tens or hundreds of individual models Current research: making models more comprehensible

Comparison of Decision Tree Algorithms
Proprietary extension of ID3 Algorithm to handle continuous and categorical features, and missing values Sorts continuous attributes at each node Uses post-pruning to mitigate overfitting Uses information gain, gain ratio to select optimal split J48: open source implementation of the C4.5 algorithm CART (Classification and Regression Trees) Restricted to binary splits Less bushy trees Uses Gini Performing evaluation experiments using different models and variants helps determine which will work best for a specific problem.

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.

Decision Trees II CSC 600: Data Mining Class 9.

Similar presentations

Presentation on theme: "Decision Trees II CSC 600: Data Mining Class 9."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Trees II CSC 600: Data Mining Class 9.

Similar presentations

Presentation on theme: "Decision Trees II CSC 600: Data Mining Class 9."— Presentation transcript:

Similar presentations

About project

Feedback