Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Pavan J Joshi 2010MCS2095 Special Topics in Database Systems

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Classification Techniques: Decision Tree Learning

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Decision Tree Rong Jin. Determine Milage Per Gallon.

Sparse vs. Ensemble Approaches to Supervised Learning

SE 450 Software Processes & Product Metrics Reliability: An Introduction.

Decision Tree Algorithm

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Classification Continued

Lecture 5 (Classification with Decision Trees)

Three kinds of learning

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Classification.

Ensemble Learning (2), Tree and Forest

Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.

Introduction to Directed Data Mining: Decision Trees

Overview DM for Business Intelligence.

Next Generation Techniques: Trees, Network and Rules

B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Lecture Notes 4 Pruning Zhangxi Lin ISQS

Machine Learning Chapter 3. Decision Tree Learning

More Trees Multiway Trees and 2-4 Trees. Motivation of Multi-way Trees Main memory vs. disk ◦ Assumptions so far: ◦ We have assumed that we can store.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Chapter 9 – Classification and Regression Trees

Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.

Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

Scaling up Decision Trees. Decision tree learning.

Data Mining Application: CART. CART: Binary Recursion Decision Tree program from Salford Systeems 30-day evaluation copy from.

Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.

Lecture Notes for Chapter 4 Introduction to Data Mining

Brian Lukoff Stanford University October 13, 2006.

Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Decision Trees.

Classification and Regression Trees

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Introduction to Data Mining, 2nd Edition by

Classification and Prediction

Introduction to Data Mining, 2nd Edition by

Introduction to Data Mining, 2nd Edition by

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning Chapter 3. Decision Tree Learning

Classification with CART

©Jiawei Han and Micheline Kamber

STT : Intro. to Statistical Learning

Presentation transcript:

Machine Learning in Real World: CART

2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning  Finding Optimal Tree

3 CART – Classification And Regression Tree  Developed by 4 statistics professors  Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)  Focused on accurate assessment when data is noisy  Currently distributed by Salford Systems

4 CART Tutorial Data: Gymtutor CART HELP, Sec 3 in CARTManual.pdf  ANYRAQTRacquet ball usage (binary indicator coded 0, 1)  ONAERNumber of on-peak aerobics classes attended  NSUPPSNumber of supplements purchased  OFFAERNumber of off-peak aerobics classes attended  NFAMMEMNumber of family members  TANNINGNumber of visits to tanning salon  ANYPOOLPool usage (binary indicator coded 0, 1)  SMALLBUSSmall business discount (binary indicator coded 0, 1)  FITFitness score  HOMEHome ownership (binary indicator coded 0, 1)  PERSTRNPersonal trainer (binary indicator coded 0, 1)  CLASSESNumber of classes taken.  SEGMENTMember’s market segment (1, 2, 3) – target

5 View data  CART Menu: View -> Data Info …

6 CART Example: Gymtutor

7 CART Model Setup  Target -- required  Predictors (default – all)  Categorical  ANYRAQT, ANYPOOL, SMALLBUS, HOME  Categorical: if field name ends in “$”, or from values  Testing  default – 10-fold cross-validation  …

8 Sample Tree

9 Color-coding using class

10 Decision Tree: Splitters

11 Tree Details

12 Tree Summary Reports

13 Pruning the tree

14 Keeping only important variables

15 Revised Tree

16 Automating CART: Command Log

 Automated field selection  handles any number of fields  automatically selects relevant fields  No data preprocessing needed  Does not require any kind of variable transforms  Impervious to outliers  Missing value tolerant  Moderate loss of accuracy due to missing values Key CART features

 Tree growing  Splitting rules to generate tree  Stopping criteria: how far to grow?  Missing values: using surrogates  Tree pruning  Trimming off parts of the tree that don’t work  Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first?  Optimal tree selection  Deciding on the best tree after growing and pruning  Balancing simplicity against accuracy CART: Key Parts of Tree Structured Data Analysis

 Data is split into two partitions  Q: Does C4.5 always have binary partitions?  Partitions can also be split into sub-partitions  hence procedure is recursive  CART tree is generated by repeated partitioning of data set  parent gets two children  each child produces two grandchildren  four grandchildren produce 8 great grandchildren CART is a form of Binary Recursive Partitioning

 Is continuous variable X  c ?  Does categorical variable D take on levels i, j, or k?  is GENDER M or F ?  Standard split:  if answer to question is YES a case goes left; otherwise it goes right  this is the form of all primary splits  example : Is AGE  62.5?  More complex conditions possible:  Boolean combinations: AGE<=62 OR BP<=91  Linear combinations:.66*AGE -.75*BP< -40 Splits always determined by questions with YES/NO answers

 For any node CART will examine ALL possible splits.  CART allows search over a random sample if desired  Look at first variable in our data set AGE with minimum value 40  Test split Is AGE  40?  Will separate out the youngest persons to the left  Could be many cases if many people have the same AGE  Next increase the AGE threshold to the next youngest person  Is AGE  43?  This will direct additional cases to the left  Continue increasing the splitting threshold value by value  each value is tested for how good the split is... how effective it is in separating the classes from each other  Q: Consider splits between values of the same class? Searching all Possible Splits

Sorted by Age Sorted by Blood Pressure Split Tables Q: Where splits need to be evaluated? X X

23  If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.  Advanced: CART also has other splitting criteria  Twoing is recommended for multi-class CART Splitting Criteria: Gini Index

 If splitter variable missing don’t know which way to send case (Left or Right in binary tree)  Could delete cases that have missing values  method used in classical statistical modeling  unacceptable in a data mining context w/ many missings  Freeze case in node in which missing splitter encountered  do with what tree has learned so far for this case  Allow cases with missing split variable to follow majority  assume all missings are somehow typical  Allow missing to be a separate value of variable  used by CHAID algorithm; an option in Salford software  allow special handling for missing but all missings treated as indistinguishable from each other Handling of Missing Splitter Values in Tree Growing

 CHAID treats missing as a distinct categorical value  e.g AGE is 25-44, 45-64, or missing  method also adopted by C4.5  If missing is a distinct value then all cases with missing go the same way in the tree  Assumption: whatever the unknown value it is the same for all cases with missing value  Problem: can be more than one reason for a database field to be missing:  E.g. Income as a splitter wants to separate high from low  Levels most likely to be missing? High Income AND Low Income!  Don’t want to send both groups to same side of tree Missing as a distinct splitter value

26 CART Treatment of Missing Primary Splitters: Surrogates  CART uses a more refined method —a surrogate is used as a stand in for a missing primary field  surrogate should be a valid replacement for primary  Consider our example of INCOME  Other variables like Education or Occupation might work as good surrogates  Higher education people usually have higher incomes  People in high income occupations will usually (though not always) have higher incomes  Using surrogate means that missing on primary not all treated same way  Whether go left or right depends on surrogate value  thus record specific... some cases go left others go right

 A primary splitter is the best splitter of a node  A surrogate is a splitter that splits in a fashion similar to the primary  Surrogate — variable with near equivalent information  Why Useful?  If the primary is expensive or difficult to gather and the surrogate is not  Then consider using the surrogate instead  Loss in predictive accuracy might be slight  If primary splitter is MISSING then CART will use a surrogate  if top surrogate missing CART uses 2nd best surrogate etc  If all surrogates missing also CART uses majority rule Surrogates Mimicking Alternatives to Primary Splitters

28 *Competitors vs. Surrogates Class A100 Class B100 Class C100 Primary Split Competitor Split Surrogate Split Class A9010 Class B8020 Class C1585 Class A8020 Class B2575 Class C1486 Class A7822 Class B7426 Class C2179 LeftRight

 You will never know when to stop... so don’t!  Instead... grow trees that are obviously too big  Largest tree grown is called “maximal” tree  Maximal tree could have hundreds or thousands of nodes  usually instruct CART to grow only moderately too big  rule of thumb: should grow trees about twice the size of the truly best tree  This becomes first stage in finding the best tree  Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data) CART Pruning Method: Grow Full Tree, Then Prune

30 Maximal Tree Example

 Take a very large tree (“maximal” tree)  Tree may be radically over-fit  Tracks all the idiosyncrasies of THIS data set  Tracks patterns that may not be found in other data sets  At bottom of tree splits based on very few cases  Analogous to a regression with very large number of variables  PRUNE away branches from this large tree  But which branch to cut first?  CART determines a pruning sequence:  the exact order in which each node should be removed  pruning sequence determined for EVERY node  sequence determined all the way back to root node Tree Pruning

32 Pruning: Which nodes come off next?

"weakest link"  Prune away "weakest link" — the nodes that add least to overall accuracy of the tree  contribution to overall tree a function of both increase in accuracy and size of node  accuracy gain is weighted by share of sample  small nodes tend to get removed before large ones  If several nodes have same contribution they all prune away simultaneously  Hence more than two terminal nodes could be cut off in one pruning  Sequence determined all the way back to root node  need to allow for possibility that entire tree is bad  if target variable is unpredictable we will want to prune back to root... the no model solution Order of Pruning: Weakest Link Goes First

34 Pruning Sequence Example 24 Terminal Nodes 21 Terminal Nodes 20 Terminal Nodes 18 Terminal Nodes

35 Now we test every tree in the pruning sequence  Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy  how many cases right and how many wrong  measure accuracy overall and by class  Do same for 2nd largest tree, 3rd largest tree, etc  Performance of every tree in sequence is measured  Results reported in table and graph formats  Note that this critical stage is impossible to complete without test data  CART procedure requires test data to guide tree evaluation

 Compare error rates measured by  learn data  large test set  Learn R(T) always decreases as tree grows (Q: Why?)  Test R(T) first declines then increases (Q: Why?)  Overfitting is the result tree of too much reliance on learn R(T)  Can lead to disasters when applied to new data ** No. Terminal Nodes R(T) R ts (T) Training Data Vs. Test Data Error Rates

 First, provides a rough guide of how you are doing  Truth will typically be WORSE than training data measure  If tree performing poorly on training data error may not want to pursue further  Training data error rate more accurate for smaller trees  So reasonable guide for smaller trees  Poor guide for larger trees  At optimal tree training and test error rates should be similar  if not something is wrong  useful to compare not just overall error rate but also within node performance between training and test data Why look at training data error rates (or cost) at all?

 Within a single CART run which tree is best?  Process of pruning the maximal tree can yield many sub-trees  Test data set or cross- validation measures the error rate of each tree  Current wisdom — select the tree with smallest error rate  Only drawback — minimum may not be precisely estimated  Typical error rate as a function of tree size has flat region  Minimum could be anywhere in this region CART: Optimal Tree

 Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run  Instead suggest SMALLEST TREE within 1 SE of the minimum error tree  Tends to provide very stable results from run to run  Is possibly as accurate as minimum cost tree yet simpler  Current learning — one SERULE is good for small data sets  For large data sets one should pick most accurate tree  known as the zero SE rule One SE Rule -- One Standard Error Rule

 Optimal tree has lowest or near lowest cost as determined by a test procedure  Tree should exhibit very similar accuracy when applied to new data  BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database  trees somewhat larger or smaller than “optimal” may be preferred  Room for user judgment  judgment not about split variable or values  judgment as to how much of tree to keep  determined by story tree is telling  willingness to sacrifice a small amount of accuracy for simplicity In what sense is the optimal tree “best”?

41 CART Summary  CART Key Features  binary splits  gini index as splitting criteria  grow, then prune  surrogates for missing values  optimal tree – 1 SE rule  lots of nice graphics

42 Decision Tree Summary  Decision Trees  splits – binary, multi-way  split criteria – entropy, gini, …  missing value treatment  pruning  rule extraction from trees  Both C4.5 and CART are robust tools  No method is always superior – experiment! witten & eibe