Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture Notes 4 Pruning Zhangxi Lin ISQS

Similar presentations


Presentation on theme: "Lecture Notes 4 Pruning Zhangxi Lin ISQS"— Presentation transcript:

1 Lecture Notes 4 Pruning Zhangxi Lin ISQS 7342-001
Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS

2 Objectives Understand how CHAID and CART algorithms, and other variations, finalize a decision tree by pruning Pre-pruning vs. post-pruning (top-down vs. bottom-up) Cross validation Understand the use of tree modeling parameters Prior probabilities Decision weights Kass adjustment Examine the performance of different tree modeling configurations with SAS Enterprise Miner 5.2 Know how Proc ARBORETUM is used

3 Chapter 3: Pruning 4 3.1 Pruning 3.2 Pruning for Profit
3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

4 Chapter 3: Pruning 4 3.1 Pruning 3.2 Pruning for Profit
3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

5 Maximal Tree A maximal classification tree gives 100%
accuracy on the training data and has no residual variability.

6 Overfitting Training Data New Data
An maximal tree is the result of overfitting

7 Underfitting Training Data New Data
An small tree with a few branches may underfit the data

8 The “Right-Sized” Tree
Top-Down Stopping Rules (Pre-Pruning) Node size Tree depth Statistical significance Bottom-Up Selection Criteria (Post-Pruning) Accuracy Profit Class-probability trees Least squares

9 Top-Down Pruning 26.7 3.12 1.63 2.40 24.9 1.97 .039 1.36 1.67 .26 .76 53 14 39 11 1 2

10 Depth Multiplier 1 3 6 12 36 24 48 The depth adjustment =
p-value X Depth multiplier 1 3 6 12 36 24 48 Depth multiplier 48 = 2x2x4x3

11 Tree Node Defaults and Options
Splitting Rule Node Split Search Subtree P-Value Adjustment

12 Top-Down Pruning Options
The default maximum depth in the Decision Tree node is 6. The value can be changed with the Maximum Depth option. The Split Size option specifies the smallest number of training observations that a node must have to be considered for splitting. Valid values are between 2 and The liberal significance level of .2 (logworth = 0.7) is the default. It can be changed with the Significance Level option. By default, the depth multiplier is applied. It can be turned off with the Split Adjustment option in the P-Value Adjustment properties. A further adjustment for the Number of Inputs available at a node for splitting can be used. This option is available in the P-Value Adjustment properties. It is not activated by default (Inputs=No). To specify pre-pruning only, set the SubTree option to Largest.

13 Bottom-Up Pruning Leaves Performance Generalization Training Data

14 Top-down vs. Bottom-up Top-down pruning is usually faster, but less effective than bottom-up pruning Breiman and Friedman, in their criticism of the FACT tree algorithm (Loh and Vanichsetakul 1988): Each stopping rule was tested on hundreds of simulated data sets with different structures. Each new stopping rule failed on some data set. It was not until a very large tree was built and then pruned, using cross-validation to govern the degree of pruning, that we observed something that worked consistently.

15 Model Selection Criteria
.90/.88 .89/.91 .88/.91 .59/.64 Accuracy 5 4 3 2 1 Leaves .51/.43 .51/.40 .49/.44 .04/ .1 Profit .17/.15 .18/.14 .19/.16 .20/.16 .48/.46 ASE

16 Bottom-up Selection Criteria
The default tree selection criterion is Decision. The final tree will be selected based upon profit or loss if a decision matrix has been specified. The Lift criterion of Assessment Measure enables the user to restrict assessment to a specified proportion of the data. By default Assessments Fraction is set to 0.25.

17 Effect of Prior Probabilities: Confusion Matrix
Actual Class Decision/Action 1 Corrected i – population of the original data; i - sample population

18 Tree Accuracy t1 t2 t3 Tree accuracy is based on leave’s accuracy weighted by the size of leaves

19 Maximize Accuracy 1: 0: tot: Class: Tr 85% 15% 42% 1 Va 83% 17% 40% 1
8.6% 91% 58% Va 3.4% 97% 60% Training Accuracy = (.42)(.85) + (.58)(.91) = .88 Validation Accuracy = (.40)(.83) + (.60)(.97) = .91

20 Profit Matrix Actual Class Decision Bayes Rule: Decision 1 if 1

21 Maximize Profit Tr 8.6% 91% 58% .78 Va 3.4% 97% 60% .91 1: 0: tot:
Va 3.4% 97% 60% .91 1: 0: tot: P1: P0: Class: 85% 15% 42% 1.18 1 83% 18% 40% 1.11 Training Profit = (.42)(1.18) + (.58)(0) = .50 Validation Profit = (.40)(1.11) + (.60)(0) = .44 1.56 1 actual predicted Profit Matrix

22 Chapter 3: Pruning 4 3.2 Pruning for Profit 3.1 Pruning
3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

23 Demonstration – Pruning for Profit
Data set: INSURANCE Parameters Prior probabilities: (0.02, 0.98) Decision weights: $150, -$3 Purposes To get familiar with defining prior probabilities for the target variable (recall how this is done in SAS EM 4.3) To view the results of the tree node To understand how parameters define in the tree node panel affect the results Note: Interactive tree growing is not working at this moment

24 Cross Validation A B C D E Train BCDE ACDE ABDE ABCE ABCD Validate 1)
2) 3) 4) 5) Why cross validation? When the holdout set is small, performance measure can be unreliable How 1) Build a CHAID-type tree using the p-value associated with the chi-square or F-stat as a forward stopping rule. 2) Use v-fold cross validation, in which data is split into several equal sets and One of these sets is in turn used for validation. Then average the results.

25 CV Program Summary CV is most efficiently performed using the
PREPARE DATA FOR CV CV is most efficiently performed using the ABORETUM procedure and SAS code. The procedure uses the p-value setting DO LOOP Vary P-value settings for tree NESTED DO LOOP 10x CV for each P-value END END SELECT BEST P-VALUE SETTING FIT FINAL MODEL

26 4 Chapter 3: Pruning 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

27 Demonstration – Cross Validation
Data set: INS_SMALL SAS Code: ex3.2.sas Parameters: p-value = 0.052 Purposes: How SAS generated graph is displayed with the web browser How to use PROC ARBOR How to customize the tree node

28 Configure the tree node
Parameters (Proc ARBOR) Maximum Branch=4 (MAXBRANCH=4); Split Size=80 (SPLITSIZE=80); Leaf Size (LEAFSIZE=40); Exhaustive=0 (EXHAUST=0); Method=Largest (SUBTREE=largest); Minimum Categorical Size (MINCATSIZE=15); Time of Kass Adjustment= after (PADJUST=chaidafter).

29 Class Probability Tree
Profit ASE

30 Least Squares Pruning (for regression trees)
Binary Target

31 What is regression tree?
In a Linear regression model, when the data has lots of features which interact in complicated, nonlinear ways, assembling a single global model can be very difficult. An alternative approach to nonlinear regression is to partition the space into smaller regions, where the interactions are more manageable. The sub-divisions can be partitioned further, i.e. recursive partitioning, until finally the chunks of the space are reached, each of which can fit simple models. In this way, the global linear regression model has two parts: one is just the recursive partition, i.e. regression tree, and the other is a simple model for each cell of the partition. There are two kinds of predictive trees: regression trees and classification trees (or class probability trees).

32 CART-Like Class Probability Tree Settings

33 Chapter 3: Pruning 4 3.4 Compare Various Tree Settings and Performance
3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

34 Demonstration – Tree Settings Comparison
Data set: CUSTOMERS (used as a test data) Purposes: How to use a test data set from another data source node Compare the performance between cross validation tree model and the model using partitioned data Compare typical decision tree model and CHAID model as well as CART model

35 Models Diagram for the case in Chapter 1

36 CV Tree vs. CART-Like Class Probability Tree
Better Worse (overfitting?) $200 more

37 Models

38 CHAID-like

39 CART-like

40 CART-like Class Probability

41 CHAID-like + Validation Data

42 Decision Tree

43 CART-Like

44 CHAID-Like

45 CART-Like Class Probability

46 CHAID-Like + Validation Data

47 Questions Why the model is called “CART-like” or “CHAID-like”?
How the settings match the features of CHAID algorithm or CART algorithm? Try fitting a tree using the entropy criterion used in machine learning (e.g. C4.5/5.0) tree algorithms. How does it perform?


Download ppt "Lecture Notes 4 Pruning Zhangxi Lin ISQS"

Similar presentations


Ads by Google