Lecture Notes 4 Pruning Zhangxi Lin ISQS

Slides:

Advertisements

Similar presentations

Chapter 7 Classification and Regression Trees

Advertisements

Random Forest Predrag Radenković 3237/10

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.

C4.5 - pruning decision trees

“I Don’t Need Enterprise Miner”

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

The loss function, the normal equation,

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Decision Tree Rong Jin. Determine Milage Per Gallon.

Data Mining Techniques Outline

Decision Tree Algorithm

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Tree-based methods, neutral networks

Three kinds of learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.

Introduction to Directed Data Mining: Decision Trees

Classification Part 4: Tree-Based Methods

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Building And Interpreting Decision Trees in Enterprise Miner.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.

Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Chapter 9 – Classification and Regression Trees

Review - Decision Trees

Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.

K Nearest Neighbors Classifier & Decision Trees

Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)

1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Copyright © 2010 SAS Institute Inc. All rights reserved. Decision Trees Using SAS Sylvain Tremblay SAS Canada – Education SAS Halifax Regional User Group.

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.

Biostatistics Case Studies 2008 Peter D. Christenson Biostatistician Session 6: Classification Trees.

Classification and Regression Trees

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Ch9: Decision Trees 9.1 Introduction A decision tree:

Data Science Algorithms: The Basic Methods

Introduction to Data Mining and Classification

Advanced Analytics Using Enterprise Miner

Introduction to Data Mining, 2nd Edition by

Lecture 05: Decision Trees

Machine Learning in Practice Lecture 17

Classification with CART

INTRODUCTION TO Machine Learning

INTRODUCTION TO Machine Learning 2nd Edition

Presentation transcript:

Lecture Notes 4 Pruning Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS

Objectives Understand how CHAID and CART algorithms, and other variations, finalize a decision tree by pruning Pre-pruning vs. post-pruning (top-down vs. bottom-up) Cross validation Understand the use of tree modeling parameters Prior probabilities Decision weights Kass adjustment Examine the performance of different tree modeling configurations with SAS Enterprise Miner 5.2 Know how Proc ARBORETUM is used

Chapter 3: Pruning 4 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

Chapter 3: Pruning 4 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

Maximal Tree A maximal classification tree gives 100% accuracy on the training data and has no residual variability.

Overfitting Training Data New Data An maximal tree is the result of overfitting

Underfitting Training Data New Data An small tree with a few branches may underfit the data

The “Right-Sized” Tree Top-Down Stopping Rules (Pre-Pruning) Node size Tree depth Statistical significance Bottom-Up Selection Criteria (Post-Pruning) Accuracy Profit Class-probability trees Least squares

Top-Down Pruning 26.7 3.12 1.63 2.40 24.9 1.97 .039 1.36 1.67 .26 .76 53 14 39 11 1 2

Depth Multiplier 1 3 6 12 36 24 48 The depth adjustment = p-value X Depth multiplier 1 3 6 12 36 24 48 Depth multiplier 48 = 2x2x4x3

Tree Node Defaults and Options Splitting Rule Node Split Search Subtree P-Value Adjustment

Top-Down Pruning Options The default maximum depth in the Decision Tree node is 6. The value can be changed with the Maximum Depth option. The Split Size option specifies the smallest number of training observations that a node must have to be considered for splitting. Valid values are between 2 and 32767. The liberal significance level of .2 (logworth = 0.7) is the default. It can be changed with the Significance Level option. By default, the depth multiplier is applied. It can be turned off with the Split Adjustment option in the P-Value Adjustment properties. A further adjustment for the Number of Inputs available at a node for splitting can be used. This option is available in the P-Value Adjustment properties. It is not activated by default (Inputs=No). To specify pre-pruning only, set the SubTree option to Largest.

Bottom-Up Pruning Leaves Performance Generalization Training Data

Top-down vs. Bottom-up Top-down pruning is usually faster, but less effective than bottom-up pruning Breiman and Friedman, in their criticism of the FACT tree algorithm (Loh and Vanichsetakul 1988): Each stopping rule was tested on hundreds of simulated data sets with different structures. Each new stopping rule failed on some data set. It was not until a very large tree was built and then pruned, using cross-validation to govern the degree of pruning, that we observed something that worked consistently.

Model Selection Criteria .90/.88 .89/.91 .88/.91 .59/.64 Accuracy 5 4 3 2 1 Leaves .51/.43 .51/.40 .49/.44 .04/ .1 Profit .17/.15 .18/.14 .19/.16 .20/.16 .48/.46 ASE

Bottom-up Selection Criteria The default tree selection criterion is Decision. The final tree will be selected based upon profit or loss if a decision matrix has been specified. The Lift criterion of Assessment Measure enables the user to restrict assessment to a specified proportion of the data. By default Assessments Fraction is set to 0.25.

Effect of Prior Probabilities: Confusion Matrix Actual Class Decision/Action 1 Corrected i – population of the original data; i - sample population

Tree Accuracy t1 t2 t3 Tree accuracy is based on leave’s accuracy weighted by the size of leaves

Maximize Accuracy 1: 0: tot: Class: Tr 85% 15% 42% 1 Va 83% 17% 40% 1 8.6% 91% 58% Va 3.4% 97% 60% Training Accuracy = (.42)(.85) + (.58)(.91) = .88 Validation Accuracy = (.40)(.83) + (.60)(.97) = .91

Profit Matrix Actual Class Decision Bayes Rule: Decision 1 if 1

Maximize Profit Tr 8.6% 91% 58% .78 Va 3.4% 97% 60% .91 1: 0: tot: Va 3.4% 97% 60% .91 1: 0: tot: P1: P0: Class: 85% 15% 42% 1.18 1 83% 18% 40% 1.11 Training Profit = (.42)(1.18) + (.58)(0) = .50 Validation Profit = (.40)(1.11) + (.60)(0) = .44 1.56 1 actual predicted Profit Matrix

Chapter 3: Pruning 4 3.2 Pruning for Profit 3.1 Pruning 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

Demonstration – Pruning for Profit Data set: INSURANCE Parameters Prior probabilities: (0.02, 0.98) Decision weights: $150, -$3 Purposes To get familiar with defining prior probabilities for the target variable (recall how this is done in SAS EM 4.3) To view the results of the tree node To understand how parameters define in the tree node panel affect the results Note: Interactive tree growing is not working at this moment

Cross Validation A B C D E Train BCDE ACDE ABDE ABCE ABCD Validate 1) 2) 3) 4) 5) Why cross validation? When the holdout set is small, performance measure can be unreliable How 1) Build a CHAID-type tree using the p-value associated with the chi-square or F-stat as a forward stopping rule. 2) Use v-fold cross validation, in which data is split into several equal sets and One of these sets is in turn used for validation. Then average the results.

CV Program Summary CV is most efficiently performed using the PREPARE DATA FOR CV CV is most efficiently performed using the ABORETUM procedure and SAS code. The procedure uses the p-value setting DO LOOP Vary P-value settings for tree NESTED DO LOOP 10x CV for each P-value END END SELECT BEST P-VALUE SETTING FIT FINAL MODEL

4 Chapter 3: Pruning 3.1 Pruning 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

Demonstration – Cross Validation Data set: INS_SMALL SAS Code: ex3.2.sas Parameters: p-value = 0.052 Purposes: How SAS generated graph is displayed with the web browser How to use PROC ARBOR How to customize the tree node

Configure the tree node Parameters (Proc ARBOR) Maximum Branch=4 (MAXBRANCH=4); Split Size=80 (SPLITSIZE=80); Leaf Size (LEAFSIZE=40); Exhaustive=0 (EXHAUST=0); Method=Largest (SUBTREE=largest); Minimum Categorical Size (MINCATSIZE=15); Time of Kass Adjustment= after (PADJUST=chaidafter).

Class Probability Tree Profit ASE

Least Squares Pruning (for regression trees) Binary Target

What is regression tree? In a Linear regression model, when the data has lots of features which interact in complicated, nonlinear ways, assembling a single global model can be very difficult. An alternative approach to nonlinear regression is to partition the space into smaller regions, where the interactions are more manageable. The sub-divisions can be partitioned further, i.e. recursive partitioning, until finally the chunks of the space are reached, each of which can fit simple models. In this way, the global linear regression model has two parts: one is just the recursive partition, i.e. regression tree, and the other is a simple model for each cell of the partition. There are two kinds of predictive trees: regression trees and classification trees (or class probability trees).

CART-Like Class Probability Tree Settings

Chapter 3: Pruning 4 3.4 Compare Various Tree Settings and Performance 3.2 Pruning for Profit 3.3 Pruning for Profit Using Cross Validation (Optional) 3.4 Compare Various Tree Settings and Performance

Demonstration – Tree Settings Comparison Data set: CUSTOMERS (used as a test data) Purposes: How to use a test data set from another data source node Compare the performance between cross validation tree model and the model using partitioned data Compare typical decision tree model and CHAID model as well as CART model

Models Diagram for the case in Chapter 1

CV Tree vs. CART-Like Class Probability Tree Better Worse (overfitting?) $200 more

Models

CHAID-like

CART-like

CART-like Class Probability

CHAID-like + Validation Data

Decision Tree

CART-Like

CHAID-Like

CART-Like Class Probability

CHAID-Like + Validation Data

Questions Why the model is called “CART-like” or “CHAID-like”? How the settings match the features of CHAID algorithm or CART algorithm? Try fitting a tree using the entropy criterion used in machine learning (e.g. C4.5/5.0) tree algorithms. How does it perform?