Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT
CHAID or CART Chi-Square Automatic Interaction Detector Based on Chi-Square All variables discretecized Dependent variable: nominal Classification and Regression Tree Variables can be discrete or continuous Based on GINI or F-Test Dependent variable: nominal or continuous
Use of Decision Trees Classify observations from a target binary or nominal variable Segmentation Predictive response analysis from a target numerical variable Behaviour Decision support rules Processing
Decision Tree
Example: dmdata.sav Underlying Theory X 2
CHAID Algorithm Selecting Variables Example Regions (4), Gender (3, including Missing) Age (6, including Missing) For each variable, collapse categories to maximize chi-square test of independence: Ex: Region (N, S, E, W,*) (WSE, N*) Select most significant variable Go to next branch … and next level Stop growing if …estimated X 2 < theoretical X 2
CART (Nominal Target) Nominal Targets: GINI (Impurity Reduction or Entropy) Squared probability of node membership Gini=0 when targets are perfectly classified. Gini Index =1-∑p i 2 Example Prob: Bus = 0.4, Car = 0.3, Train = 0.3 Gini = 1 –(0.4^ ^ ^2) = 0.660
CART (Metric Target) Continuous Variables: Variance Reduction (F-test)
Comparative Advantages (From Wikipedia) Simple to understand and interpret Requires little data preparation Able to handle both numerical and categorical data Uses a white box model easily explained by Boolean logic. Possible to validate a model using statistical tests Robust