By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

Decision Tree Approach in Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Chapter 7 – Classification and Regression Trees
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Longin Jan Latecki Temple University
Evaluation.
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
Decision Tree Algorithm
2D1431 Machine Learning Boosting.
Ensemble Learning: An Introduction
Adaboost and its application
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Examples of Ensemble Methods
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
CS 391L: Machine Learning: Ensembles
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Boosting ---one of combining models Xin Li Machine Learning Course.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Trees, bagging, boosting, and stacking
Boosting and Additive Trees
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Adaboost Team G Youngmin Jun
Data Mining Practical Machine Learning Tools and Techniques
The
Introduction to Data Mining, 2nd Edition
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Presentation transcript:

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

Pruning a Tree We will look into the following tree pruning methods ❏ Cost Complexity pruning ❏ Minimum error pruning ❏ Error based pruning ❏ Minimum Description Length pruning ❏ Optimal pruning (using dynamic programing approach)

Cost Complexity Pruning Cost complexity pruning proceeds in two stages In the first stage K number of trees are generated i.e. T 0,T 1,T 2 …T k where T 0 is the full grown tree and T k is the root tree In the 2 nd stage one of the trees is considered as pruned tree based on its generalization error estimation The tree T i+1 is obtained by replacing one or more of the sub-trees in the predecessor tree T i with suitable leaves

Cost Complexity Pruning The sub-trees that are pruned are those that obtain the lowest increase in apparent error rate per pruned leaf where ε(T,S) indicates the error rate of the tree T over the sample S and |leaves(T)| denotes the number of leaves in T. pruned(T,t) denotes the tree obtained by replacing the node t in T with a suitable leaf In the second phase, the generalization error of each pruned tree T 0,T 1,...,T k is estimated The tree having lowest generalization error is returned as the final pruned tree

Minimum Error Pruning MEP was suggested to get a single tree with minimum error rate while classifying set of examples into different classes If E k is considered to be expected error rate then it is defined as where ‘n’ is the total number of examples, n is the largest number in class ‘c’, k is the total number of classes This definition assumes that all the classes are equally probable

Minimum Error Pruning Expected error rate is calculated at each node (also called static error rate) Dynamic error rate is the weighted sum of static error rate of child nodes of each parent node If dynamic error rate is greater than static error rate, pruning is done

Example MEP Let us consider the tree structure Node 26 20:10:5 Node 27 15:5:0 Node 28 3:3:5 Node 29 2:2:0 Node 30 15:2:0 Node 31 0:3:0 Should we prune at node 27?

Example MEP n (at node 27)=20 n (at node 30)=17 n (at node 31)=3 n c (at node 27)=15 n c (at node 30)=15 n c (at node 31)=3 k=3 At node 27, static error rate E = At node 27, dynamic error rate E = Error rate has been reduced through subsequent splitting and hence pruning should not be done

Error Based Pruning Error based pruning was developed for C4.5 algorithm It uses an estimate of expected error rate A set of examples covered by the leaf of a tree is considered to be sample from which it is possible to calculate confidence for posterior probability of mis-classification Errors in sample are assumed to follow binomial distribution

Error Based Pruning Upper limit of confidence (default confidence level 25%) is extracted by solving for p in the equation where N is the number of cases covered by a node and E is the number of cases which is covered by that node erroneously Upper limit of confidence is multiplied by the number of cases covered by the leaf to determine the predicted error for the leaf Predicted error of a subtree is the sum of predicted error of its branches If predicted error of a leaf is less than the subtree under it, pruning is done

Tree Pruning in R Package ‘rpart’ has its own tree pruning function which uses cost complexity parameter for pruning the tree rpart function utilizes different methods for classification/regression depending on the dependent variable declared For a classification problem it uses cross validation error for evaluating the best pruned tree Performances of pruned tree can be seen using printcp() function and plotcp() function Final model can be extracted using prune() function

Tree Pruning in R R has another package called ‘tree’ which uses cart algorithm It has its own cross-validation technique for pruning a tree It does not use complexity parameter for pruning C50 package has its pruning method which is error based pruning RWeka package has J48 function for creating decision tree which also can prune a tree based on error based pruning

Simple demonstration Here we will use ‘banknote’ dataset R code for C4.5 algorithm ubsing J48() function under RWeka package # Build an unpruned decision tree using Weka_control parameters modelJ48=J48(Status~.,data=banknote,control = Weka_control(U=T)) modelJ48 plot(modelJ48)

Unpruned Tree The unpruned tree is also a small tree However, it is possible to get similar performance with a smaller tree as well

Pruned Tree J48() function can prune a tree using ‘Reduced error pruning’ and also by using ‘Error based pruning’ methods modelJ48.pruned=J48(Status~.,data=banknote,control = Weka_control(R=T)) # Reduced error pruning plot(modelJ48.pruned) modelJ48.pruned modelJ48.pruned1=J48(Status~.,data=banknote,control = Weka_control(C=0.1)) #Error based pruning with confidence factor 0.1 plot(modelJ48.pruned1) modelJ48.pruned1

Both will generate same pruned tree as given below Only one variable is good enough to detect genuine notes

Another Example Use existing ‘kyphosis’ dataset under ‘rpart’ package fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) plot(fit, uniform=TRUE, main="Classification Tree for Kyphosis") text(fit, use.n=TRUE, all=TRUE, cex=.8) A better looking plot can be generated using fancyRpartPlot() function under ‘rattle’ library library(rattle) library(rpart.plot) fancyRpartPlot(fit)

Plots

Pruning with complexity parameter printcp(fit) plotcp(fit) pruneTree=prune(fit,cp=0.059) plot(pruneTree,uniform = T,margin = 0.1, main="Classification Tree for Kyphosis") text(pruneTree, use.n=T, all=TRUE, cex=.8)

Pruning with K-fold cross validation Use tree() function within ‘tree’ package to classify auto dataset into automobile and trucks (DV: type) The first two variables are removed as the number of levels are quite large (R can handle 32 levels) # Create an unpruned tree by keeping mindev=0 model.tree=tree(type~.,data=auto[,-c(1,2)], control=tree.control(nobs=157,mindev=0)) plot(model.tree) text(model.tree,pretty = 2) # Run 10 fold cross validation tree of original tree cv.model.tree=cv.tree(model.tree) plot(cv.model.tree)

Pruning with K-fold cross validation # lowest plateau is starting from size=3 prune.model.tree=prune.tree(model.tree,best=3)

Plot of Pruned tree A simplified tree is extracted based on min deviance

Boosting Sequential production of classifiers Each classifier is dependent on the previous one, and focuses on the previous one’s errors Examples that are incorrectly predicted in previous classifiers are chosen more often or weighted more heavily

Boosting Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased Boosting algorithms differ in terms of (1) how the weights of the training examples are updated at the end of each round, and (2) how the predictions made by each classifier are combined Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

Ada Boosting Freund and Schapire, 1997 Ideas Complex hypotheses tend to over fitting Simple hypotheses may not explain data well Combine many simple hypotheses into a complex one Ways to design simple ones, and combination issues

Ada-boosting Two Approaches Select examples according to error in previous classifier (more representatives of misclassified cases are selected) – more common Weigh errors of the misclassified cases higher (all cases are incorporated, but weights are different) – does not work for some algorithms

Booting example Original Training set Training set Training set Training set

Ada-boosting Input: Training samples S = {(x i, y i )}, i = 1, 2, …, N Weak learner h Initialization Each sample has equal weight w i = 1/N For k = 1 … T Train weak learner h k according to weighted sample sets Compute classification errors Update sample weights w i Output Final model which is a linear combination of h k

Ada-boosting

Weak learner: error rate is only slightly better than random guessing Boosting: sequentially apply the weak learner to repeated modified version of the data, thereby producing a sequence of weak classifiers h(x). The prediction from all of the weak classifiers are combined through a weighted majority vote H(x) = sign[sum(a i h i (x))]

Configuration Training Samples Weighted Samples h 1 (x) h 2 (x) h 3 (x) h T (x) Sign[sum]

Calculations on weights and error For k = 1 to T Fit a learner to the training data using weights w i Compute Set w i

Points to keep in mind It penalizes models that have poor accuracy If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the re-sampling procedure is repeated Because of its tendency to focus on training examples that are wrongly classified, the boosting technique can be quite susceptible to over fitting.