Lecture Notes 4 Pruning Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS
Objectives Understand how CHAID and CART algorithms, and other variations, finalize a decision tree by pruning Pre-pruning vs. post-pruning (top-down vs. bottom-up) Cross validation Understand the use of tree modeling parameters Prior probabilities Decision weights Kass adjustment Examine the performance of different tree modeling configurations with SAS Enterprise Miner 5.2 Know how Proc ARBORETUM is used
Chapter 3: Pruning 4 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance
Chapter 3: Pruning 4 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance
Maximal Tree A maximal classification tree gives 100% accuracy on the training data and has no residual variability.
Overfitting Training Data New Data An maximal tree is the result of overfitting
Underfitting Training Data New Data An small tree with a few branches may underfit the data
The “Right-Sized” Tree Top-Down Stopping Rules (Pre-Pruning) Node size Tree depth Statistical significance Bottom-Up Selection Criteria (Post-Pruning) Accuracy Profit Class-probability trees Least squares
Top-Down Pruning 26.7 3.12 1.63 2.40 24.9 1.97 .039 1.36 1.67 .26 .76 53 14 39 11 1 2
Depth Multiplier 1 3 6 12 36 24 48 The depth adjustment = p-value X Depth multiplier 1 3 6 12 36 24 48 Depth multiplier 48 = 2x2x4x3
Tree Node Defaults and Options Splitting Rule Node Split Search Subtree P-Value Adjustment
Top-Down Pruning Options The default maximum depth in the Decision Tree node is 6. The value can be changed with the Maximum Depth option. The Split Size option specifies the smallest number of training observations that a node must have to be considered for splitting. Valid values are between 2 and 32767. The liberal significance level of .2 (logworth = 0.7) is the default. It can be changed with the Significance Level option. By default, the depth multiplier is applied. It can be turned off with the Split Adjustment option in the P-Value Adjustment properties. A further adjustment for the Number of Inputs available at a node for splitting can be used. This option is available in the P-Value Adjustment properties. It is not activated by default (Inputs=No). To specify pre-pruning only, set the SubTree option to Largest.
Bottom-Up Pruning Leaves Performance Generalization Training Data
Top-down vs. Bottom-up Top-down pruning is usually faster, but less effective than bottom-up pruning Breiman and Friedman, in their criticism of the FACT tree algorithm (Loh and Vanichsetakul 1988): Each stopping rule was tested on hundreds of simulated data sets with different structures. Each new stopping rule failed on some data set. It was not until a very large tree was built and then pruned, using cross-validation to govern the degree of pruning, that we observed something that worked consistently.
Model Selection Criteria .90/.88 .89/.91 .88/.91 .59/.64 Accuracy 5 4 3 2 1 Leaves .51/.43 .51/.40 .49/.44 .04/ .1 Profit .17/.15 .18/.14 .19/.16 .20/.16 .48/.46 ASE
Bottom-up Selection Criteria The default tree selection criterion is Decision. The final tree will be selected based upon profit or loss if a decision matrix has been specified. The Lift criterion of Assessment Measure enables the user to restrict assessment to a specified proportion of the data. By default Assessments Fraction is set to 0.25.
Effect of Prior Probabilities: Confusion Matrix Actual Class Decision/Action 1 Corrected i – population of the original data; i - sample population
Tree Accuracy t1 t2 t3 Tree accuracy is based on leave’s accuracy weighted by the size of leaves
Maximize Accuracy 1: 0: tot: Class: Tr 85% 15% 42% 1 Va 83% 17% 40% 1 8.6% 91% 58% Va 3.4% 97% 60% Training Accuracy = (.42)(.85) + (.58)(.91) = .88 Validation Accuracy = (.40)(.83) + (.60)(.97) = .91
Profit Matrix Actual Class Decision Bayes Rule: Decision 1 if 1
Maximize Profit Tr 8.6% 91% 58% .78 Va 3.4% 97% 60% .91 1: 0: tot: Va 3.4% 97% 60% .91 1: 0: tot: P1: P0: Class: 85% 15% 42% 1.18 1 83% 18% 40% 1.11 Training Profit = (.42)(1.18) + (.58)(0) = .50 Validation Profit = (.40)(1.11) + (.60)(0) = .44 1.56 1 actual predicted Profit Matrix
Chapter 3: Pruning 4 3.2 Pruning for Profit 3.1 Pruning 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance
Demonstration – Pruning for Profit Data set: INSURANCE Parameters Prior probabilities: (0.02, 0.98) Decision weights: $150, -$3 Purposes To get familiar with defining prior probabilities for the target variable (recall how this is done in SAS EM 4.3) To view the results of the tree node To understand how parameters define in the tree node panel affect the results Note: Interactive tree growing is not working at this moment
Cross Validation A B C D E Train BCDE ACDE ABDE ABCE ABCD Validate 1) 2) 3) 4) 5) Why cross validation? When the holdout set is small, performance measure can be unreliable How 1) Build a CHAID-type tree using the p-value associated with the chi-square or F-stat as a forward stopping rule. 2) Use v-fold cross validation, in which data is split into several equal sets and One of these sets is in turn used for validation. Then average the results.
CV Program Summary CV is most efficiently performed using the PREPARE DATA FOR CV CV is most efficiently performed using the ABORETUM procedure and SAS code. The procedure uses the p-value setting DO LOOP Vary P-value settings for tree NESTED DO LOOP 10x CV for each P-value END END SELECT BEST P-VALUE SETTING FIT FINAL MODEL
4 Chapter 3: Pruning 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance
Demonstration – Cross Validation Data set: INS_SMALL SAS Code: ex3.2.sas Parameters: p-value = 0.052 Purposes: How SAS generated graph is displayed with the web browser How to use PROC ARBOR How to customize the tree node
Configure the tree node Parameters (Proc ARBOR) Maximum Branch=4 (MAXBRANCH=4); Split Size=80 (SPLITSIZE=80); Leaf Size (LEAFSIZE=40); Exhaustive=0 (EXHAUST=0); Method=Largest (SUBTREE=largest); Minimum Categorical Size (MINCATSIZE=15); Time of Kass Adjustment= after (PADJUST=chaidafter).
Class Probability Tree Profit ASE
Least Squares Pruning (for regression trees) Binary Target
What is regression tree? In a Linear regression model, when the data has lots of features which interact in complicated, nonlinear ways, assembling a single global model can be very difficult. An alternative approach to nonlinear regression is to partition the space into smaller regions, where the interactions are more manageable. The sub-divisions can be partitioned further, i.e. recursive partitioning, until finally the chunks of the space are reached, each of which can fit simple models. In this way, the global linear regression model has two parts: one is just the recursive partition, i.e. regression tree, and the other is a simple model for each cell of the partition. There are two kinds of predictive trees: regression trees and classification trees (or class probability trees).
CART-Like Class Probability Tree Settings
Chapter 3: Pruning 4 3.4 Compare Various Tree Settings and Performance 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance
Demonstration – Tree Settings Comparison Data set: CUSTOMERS (used as a test data) Purposes: How to use a test data set from another data source node Compare the performance between cross validation tree model and the model using partitioned data Compare typical decision tree model and CHAID model as well as CART model
Models Diagram for the case in Chapter 1
CV Tree vs. CART-Like Class Probability Tree Better Worse (overfitting?) $200 more
Models
CHAID-like
CART-like
CART-like Class Probability
CHAID-like + Validation Data
Decision Tree
CART-Like
CHAID-Like
CART-Like Class Probability
CHAID-Like + Validation Data
Questions Why the model is called “CART-like” or “CHAID-like”? How the settings match the features of CHAID algorithm or CART algorithm? Try fitting a tree using the entropy criterion used in machine learning (e.g. C4.5/5.0) tree algorithms. How does it perform?