By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Pruning a Tree We will look into the following tree pruning methods ❏ Cost Complexity pruning ❏ Minimum error pruning ❏ Error based pruning ❏ Minimum Description Length pruning ❏ Optimal pruning (using dynamic programing approach)
Cost Complexity Pruning Cost complexity pruning proceeds in two stages In the first stage K number of trees are generated i.e. T 0,T 1,T 2 …T k where T 0 is the full grown tree and T k is the root tree In the 2 nd stage one of the trees is considered as pruned tree based on its generalization error estimation The tree T i+1 is obtained by replacing one or more of the sub-trees in the predecessor tree T i with suitable leaves
Cost Complexity Pruning The sub-trees that are pruned are those that obtain the lowest increase in apparent error rate per pruned leaf where ε(T,S) indicates the error rate of the tree T over the sample S and |leaves(T)| denotes the number of leaves in T. pruned(T,t) denotes the tree obtained by replacing the node t in T with a suitable leaf In the second phase, the generalization error of each pruned tree T 0,T 1,...,T k is estimated The tree having lowest generalization error is returned as the final pruned tree
Minimum Error Pruning MEP was suggested to get a single tree with minimum error rate while classifying set of examples into different classes If E k is considered to be expected error rate then it is defined as where ‘n’ is the total number of examples, n is the largest number in class ‘c’, k is the total number of classes This definition assumes that all the classes are equally probable
Minimum Error Pruning Expected error rate is calculated at each node (also called static error rate) Dynamic error rate is the weighted sum of static error rate of child nodes of each parent node If dynamic error rate is greater than static error rate, pruning is done
Example MEP Let us consider the tree structure Node 26 20:10:5 Node 27 15:5:0 Node 28 3:3:5 Node 29 2:2:0 Node 30 15:2:0 Node 31 0:3:0 Should we prune at node 27?
Example MEP n (at node 27)=20 n (at node 30)=17 n (at node 31)=3 n c (at node 27)=15 n c (at node 30)=15 n c (at node 31)=3 k=3 At node 27, static error rate E = At node 27, dynamic error rate E = Error rate has been reduced through subsequent splitting and hence pruning should not be done
Error Based Pruning Error based pruning was developed for C4.5 algorithm It uses an estimate of expected error rate A set of examples covered by the leaf of a tree is considered to be sample from which it is possible to calculate confidence for posterior probability of mis-classification Errors in sample are assumed to follow binomial distribution
Error Based Pruning Upper limit of confidence (default confidence level 25%) is extracted by solving for p in the equation where N is the number of cases covered by a node and E is the number of cases which is covered by that node erroneously Upper limit of confidence is multiplied by the number of cases covered by the leaf to determine the predicted error for the leaf Predicted error of a subtree is the sum of predicted error of its branches If predicted error of a leaf is less than the subtree under it, pruning is done
Tree Pruning in R Package ‘rpart’ has its own tree pruning function which uses cost complexity parameter for pruning the tree rpart function utilizes different methods for classification/regression depending on the dependent variable declared For a classification problem it uses cross validation error for evaluating the best pruned tree Performances of pruned tree can be seen using printcp() function and plotcp() function Final model can be extracted using prune() function
Tree Pruning in R R has another package called ‘tree’ which uses cart algorithm It has its own cross-validation technique for pruning a tree It does not use complexity parameter for pruning C50 package has its pruning method which is error based pruning RWeka package has J48 function for creating decision tree which also can prune a tree based on error based pruning
Simple demonstration Here we will use ‘banknote’ dataset R code for C4.5 algorithm ubsing J48() function under RWeka package # Build an unpruned decision tree using Weka_control parameters modelJ48=J48(Status~.,data=banknote,control = Weka_control(U=T)) modelJ48 plot(modelJ48)
Unpruned Tree The unpruned tree is also a small tree However, it is possible to get similar performance with a smaller tree as well
Pruned Tree J48() function can prune a tree using ‘Reduced error pruning’ and also by using ‘Error based pruning’ methods modelJ48.pruned=J48(Status~.,data=banknote,control = Weka_control(R=T)) # Reduced error pruning plot(modelJ48.pruned) modelJ48.pruned modelJ48.pruned1=J48(Status~.,data=banknote,control = Weka_control(C=0.1)) #Error based pruning with confidence factor 0.1 plot(modelJ48.pruned1) modelJ48.pruned1
Both will generate same pruned tree as given below Only one variable is good enough to detect genuine notes
Another Example Use existing ‘kyphosis’ dataset under ‘rpart’ package fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) plot(fit, uniform=TRUE, main="Classification Tree for Kyphosis") text(fit, use.n=TRUE, all=TRUE, cex=.8) A better looking plot can be generated using fancyRpartPlot() function under ‘rattle’ library library(rattle) library(rpart.plot) fancyRpartPlot(fit)
Plots
Pruning with complexity parameter printcp(fit) plotcp(fit) pruneTree=prune(fit,cp=0.059) plot(pruneTree,uniform = T,margin = 0.1, main="Classification Tree for Kyphosis") text(pruneTree, use.n=T, all=TRUE, cex=.8)
Pruning with K-fold cross validation Use tree() function within ‘tree’ package to classify auto dataset into automobile and trucks (DV: type) The first two variables are removed as the number of levels are quite large (R can handle 32 levels) # Create an unpruned tree by keeping mindev=0 model.tree=tree(type~.,data=auto[,-c(1,2)], control=tree.control(nobs=157,mindev=0)) plot(model.tree) text(model.tree,pretty = 2) # Run 10 fold cross validation tree of original tree cv.model.tree=cv.tree(model.tree) plot(cv.model.tree)
Pruning with K-fold cross validation # lowest plateau is starting from size=3 prune.model.tree=prune.tree(model.tree,best=3)
Plot of Pruned tree A simplified tree is extracted based on min deviance
Boosting Sequential production of classifiers Each classifier is dependent on the previous one, and focuses on the previous one’s errors Examples that are incorrectly predicted in previous classifiers are chosen more often or weighted more heavily
Boosting Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased Boosting algorithms differ in terms of (1) how the weights of the training examples are updated at the end of each round, and (2) how the predictions made by each classifier are combined Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
Ada Boosting Freund and Schapire, 1997 Ideas Complex hypotheses tend to over fitting Simple hypotheses may not explain data well Combine many simple hypotheses into a complex one Ways to design simple ones, and combination issues
Ada-boosting Two Approaches Select examples according to error in previous classifier (more representatives of misclassified cases are selected) – more common Weigh errors of the misclassified cases higher (all cases are incorporated, but weights are different) – does not work for some algorithms
Booting example Original Training set Training set Training set Training set
Ada-boosting Input: Training samples S = {(x i, y i )}, i = 1, 2, …, N Weak learner h Initialization Each sample has equal weight w i = 1/N For k = 1 … T Train weak learner h k according to weighted sample sets Compute classification errors Update sample weights w i Output Final model which is a linear combination of h k
Ada-boosting
Weak learner: error rate is only slightly better than random guessing Boosting: sequentially apply the weak learner to repeated modified version of the data, thereby producing a sequence of weak classifiers h(x). The prediction from all of the weak classifiers are combined through a weighted majority vote H(x) = sign[sum(a i h i (x))]
Configuration Training Samples Weighted Samples h 1 (x) h 2 (x) h 3 (x) h T (x) Sign[sum]
Calculations on weights and error For k = 1 to T Fit a learner to the training data using weights w i Compute Set w i
Points to keep in mind It penalizes models that have poor accuracy If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the re-sampling procedure is repeated Because of its tendency to focus on training examples that are wrongly classified, the boosting technique can be quite susceptible to over fitting.