Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye.

Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye

Data Background From SPSS Answer Tree program, we use its credit scoring example From SPSS Answer Tree program, we use its credit scoring example There are 323 data points There are 323 data points The target variable is The target variable is credit ranking (good [48%], bad [52%]) credit ranking (good [48%], bad [52%]) The four predictor variables are The four predictor variables are age categorical (young [58%], middle [24%], old[18%]) age categorical (young [58%], middle [24%], old[18%]) has AMEX card (yes [48%], no [52%]) has AMEX card (yes [48%], no [52%]) paid weekly/monthly (weekly pay [51%], monthly salary [49%]) paid weekly/monthly (weekly pay [51%], monthly salary [49%]) social class (management [12%], professional [49%], clerical [15%], skilled [13%], unskilled [12%]) social class (management [12%], professional [49%], clerical [15%], skilled [13%], unskilled [12%])

Data Background Useful to see how the target variable is distributed by each of the predictor variable

Data Background Pearson Correlation Coefficients, N = 323 Prob > |r| under H0: Rho=0 Prob > |r| under H0: Rho=0 CREDIT_R PAY_WEEK AGE AMEX CREDIT_R PAY_WEEK AGE AMEX CREDIT_R 1.00000 0.70885 0.66273 0.02653 CREDIT_R 1.00000 0.70885 0.66273 0.02653 CREDIT_R <.0001 <.0001 0.6348 CREDIT_R <.0001 <.0001 0.6348 PAY_WEEK 0.70885 1.00000 0.51930 0.08292 PAY_WEEK 0.70885 1.00000 0.51930 0.08292 PAY_WEEK <.0001 <.0001 0.1370 PAY_WEEK <.0001 <.0001 0.1370 AGE 0.66273 0.51930 1.00000 -0.00172 AGE 0.66273 0.51930 1.00000 -0.00172 AGE <.0001 <.0001 0.9755 AGE <.0001 <.0001 0.9755 AMEX 0.02653 0.08292 -0.00172 1.00000 AMEX 0.02653 0.08292 -0.00172 1.00000 AMEX 0.6348 0.1370 0.9755 AMEX 0.6348 0.1370 0.9755 Correlation Matrix:

Objective To create a predictive model of good credit risks To create a predictive model of good credit risks To assess the performance of the model, we randomly split data into two parts: a training set to develop the model (60%) and the rest (40%) to validate. To assess the performance of the model, we randomly split data into two parts: a training set to develop the model (60%) and the rest (40%) to validate. This is done to avoid possible “over fitting” since the validation set was not involve in deriving the model This is done to avoid possible “over fitting” since the validation set was not involve in deriving the model Using the same data, we compare the results using R’s Tree, Answer Tree’s CART, and SAS’ Proc Logistic Using the same data, we compare the results using R’s Tree, Answer Tree’s CART, and SAS’ Proc Logistic

Logistic Regression Let x be a vector of explanatory variables Let x be a vector of explanatory variables Let y be a binary target variable (0 or 1) Let y be a binary target variable (0 or 1) p = Pr(Y=1|x) is the target probability p = Pr(Y=1|x) is the target probability The linear logistic model has the form The linear logistic model has the form Predicted probability, phat = 1/(1+exp(-(α+β’x))) Note that the range for p is (0,1), but logit(p) is the whole real line

Logistic Results Using the training set, the maximum likelihood estimate failed to converge using the social class and age variables Using the training set, the maximum likelihood estimate failed to converge using the social class and age variables Only the paid weekly/monthly and has AMEX card variables could be estimated Only the paid weekly/monthly and has AMEX card variables could be estimated The AMEX variable was highly insignificant and so was dropped The AMEX variable was highly insignificant and so was dropped Apparently, the tree algorithm does a better job in handling all the variables Apparently, the tree algorithm does a better job in handling all the variables SAS output of model: Standard Wald Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.5856 0.2662 35.4756 <.0001 PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001 So, the odds of weekly pay to be a good risk over a monthly salary person is exp(-3.6) ≈ 0.027 to 1 or 36 to 1 against.

Validation Results With only one variable in our predicted model, there are only two possible predicted probabilities: 0.117 and 0.830 With only one variable in our predicted model, there are only two possible predicted probabilities: 0.117 and 0.830 Taking the higher probability as predicting a “good” account, our results are below Taking the higher probability as predicting a “good” account, our results are below Validation Set Training Set Validation Set Training Set Actual Bad Actual Good Actual Bad Actual Good Actual Bad Actual Good Actual Bad Actual Good Predicted Bad 60 11 83 11 Predicted Bad 60 11 83 11 Predicted Good 8 50 17 83 Predicted Good 8 50 17 83 Percent Agreement 85.3% 85.6% Percent Agreement 85.3% 85.6% The better measure is use the validation set results. Note that the results are very similar, so overfitting does not appear to be a problem.

Growing a Tree in R (Based on training data) > credit_data credit_data<- read.csv(file="training.csv") > library(tree) > credit_tree credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + AMEX, data=credit_data, split=c("gini")) > tree.pr tree.pr<-prune.tree(credit_tree) > plot(tree.pr) # figure 1 > plot(credit_tree, type="u"); text(credit_tree, pretty=0) # figure 2 > tree.1 tree.1<-prune.tree(credit_tree, best=5) > plot(tree.1, type="u"); text(tree.1, pretty=0) # figure 3, 4, 5 # figure 3, 4, 5 > summary(tree.1)

Figure 1 Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Tree Based on Validation Data

Implementing using SPSS ANSWER TREE Training sample – C&RT (Min impunity change.01)

Implementing using SPSS ANSWER TREE Training sample – CHAID (Pearson Chi2, p=.05)

Summary classification for training data Level 1 Level 2 Predicted % of error Pay-MonthlyAge-med,oldGood0% Pay-MonthlyAge-YoungBad47% Pay-WeeklyClass-P,MBad32% Pay-WeeklyClass-C,S,UBad3%

Summary of Validation data grouped by training data classification Level 1 Level 2 Predicted % of error Pay-MonthlyAge-med,oldGood2% Pay-MonthlyAge-YoungBad59% Pay-WeeklyClass-P,MBad23% Pay-WeeklyClass-C,S,UBad13%

Crosstabulation of predicted and actual classification Training sample Validation sample Actual good Actual bad Actual good Actual bad Predict good 6806721 Predict bad 26100140 Agreement86.6%82.9% Agreement in regression 85.6%85.3%

Summary of result Similar trees were generated from R and SPSS ANSWER TREE Similar trees were generated from R and SPSS ANSWER TREE Similar results were derived using different tree generation methods – C&RT and CHAID Similar results were derived using different tree generation methods – C&RT and CHAID Classification tree has higher percentage of agreement between predicted values and actual values than logistic regression on training data Classification tree has higher percentage of agreement between predicted values and actual values than logistic regression on training data Utilizing the grouping criteria derived from training data, logistic regression has higher percentage of agreement than classification tree Utilizing the grouping criteria derived from training data, logistic regression has higher percentage of agreement than classification tree

Conclusion Classification tree is a non-parametric method to select predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive probability Classification tree is a non-parametric method to select predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive probability Classification tree can be implemented in different software and using different tree growing methodologies Classification tree can be implemented in different software and using different tree growing methodologies Classification tree normally performs better than parametric models with higher percentage of agreement between predicted values and actual values Classification tree normally performs better than parametric models with higher percentage of agreement between predicted values and actual values Classification tree has special advantages in industries like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of predicting error Classification tree has special advantages in industries like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of predicting error

Conclusion – con’d As a non-parametric method, no function form is specified and no parameter will be estimated and tested As a non-parametric method, no function form is specified and no parameter will be estimated and tested As showed in this small study, the lower percentage of agreement for validation data shows “overfitting” might be a potential problem in classification tree As showed in this small study, the lower percentage of agreement for validation data shows “overfitting” might be a potential problem in classification tree

Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye.

Similar presentations

Presentation on theme: "Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye.

Similar presentations

Presentation on theme: "Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye."— Presentation transcript:

Similar presentations

About project

Feedback