Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Lecture 11 (Chapter 9).
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Brief introduction on Logistic Regression
The %LRpowerCorr10 SAS Macro Power Estimation for Logistic Regression Models with Several Predictors of Interest in the Presence of Covariates D. Keith.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Chapter 8 – Logistic Regression
Logistic Regression Example: Horseshoe Crab Data
Overview of Logistics Regression and its SAS implementation
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Generalized Linear Mixed Model English Premier League Soccer – 2003/2004 Season.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Log-Linear Models & Dependent Samples Feng Ye, Xiao Guo, Jing Wang.
Instructor: K.C. Carriere
PH6415 Review Questions. 2 Question 1 A journal article reports a 95%CI for the relative risk (RR) of an event (treatment versus control as (0.55, 0.97).
Basic Data Mining Techniques
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
EPI 809/Spring Multiple Logistic Regression.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Prediction with Regression Analysis (HK: Chapter 7.8) Qiang Yang HKUST.
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Decision Tree Models in Data Mining
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Logistic Regression III: Advanced topics Conditional Logistic Regression for Matched Data Conditional Logistic Regression for Matched Data.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
April 11 Logistic Regression –Modeling interactions –Analysis of case-control studies –Data presentation.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
Chapter 9 – Classification and Regression Trees
K Nearest Neighbors Classifier & Decision Trees
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
2 December 2004PubH8420: Parametric Regression Models Slide 1 Applications - SAS Parametric Regression in SAS –PROC LIFEREG –PROC GENMOD –PROC LOGISTIC.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
Relationship between two variables Two quantitative variables: correlation and regression methods Two qualitative variables: contingency table methods.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Week 5: Logistic regression analysis Overview Questions from last week What is logistic regression analysis? The mathematical model Interpreting the β.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
1 STA 617 – Chp11 Models for repeated data Analyzing Repeated Categorical Response Data  Repeated categorical responses may come from  repeated measurements.
Logistic Regression. Linear Regression Purchases vs. Income.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Logistic Regression Analysis Gerrit Rooks
Dates Presentations Wed / Fri Ex. 4, logistic regression, Monday Dec 7 th Final Tues. Dec 8 th, 3:30.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Classification and Regression Trees
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Nonparametric Statistics
Statistics Correlation and regression. 2 Introduction Some methods involve one variable is Treatment A as effective in relieving arthritic pain as Treatment.
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Notes on Logistic Regression
Nonparametric Statistics
Teaching Analytics with Case Studies: Finding Love in a Classification Tree Ruth Hummel, PhD JMP Academic Ambassador.
Introduction to Logistic Regression
Presentation transcript:

Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye

Data Background From SPSS Answer Tree program, we use its credit scoring example From SPSS Answer Tree program, we use its credit scoring example There are 323 data points There are 323 data points The target variable is The target variable is credit ranking (good [48%], bad [52%]) credit ranking (good [48%], bad [52%]) The four predictor variables are The four predictor variables are age categorical (young [58%], middle [24%], old[18%]) age categorical (young [58%], middle [24%], old[18%]) has AMEX card (yes [48%], no [52%]) has AMEX card (yes [48%], no [52%]) paid weekly/monthly (weekly pay [51%], monthly salary [49%]) paid weekly/monthly (weekly pay [51%], monthly salary [49%]) social class (management [12%], professional [49%], clerical [15%], skilled [13%], unskilled [12%]) social class (management [12%], professional [49%], clerical [15%], skilled [13%], unskilled [12%])

Data Background Useful to see how the target variable is distributed by each of the predictor variable

Data Background Pearson Correlation Coefficients, N = 323 Prob > |r| under H0: Rho=0 Prob > |r| under H0: Rho=0 CREDIT_R PAY_WEEK AGE AMEX CREDIT_R PAY_WEEK AGE AMEX CREDIT_R CREDIT_R CREDIT_R <.0001 < CREDIT_R <.0001 < PAY_WEEK PAY_WEEK PAY_WEEK <.0001 < PAY_WEEK <.0001 < AGE AGE AGE <.0001 < AGE <.0001 < AMEX AMEX AMEX AMEX Correlation Matrix:

Objective To create a predictive model of good credit risks To create a predictive model of good credit risks To assess the performance of the model, we randomly split data into two parts: a training set to develop the model (60%) and the rest (40%) to validate. To assess the performance of the model, we randomly split data into two parts: a training set to develop the model (60%) and the rest (40%) to validate. This is done to avoid possible “over fitting” since the validation set was not involve in deriving the model This is done to avoid possible “over fitting” since the validation set was not involve in deriving the model Using the same data, we compare the results using R’s Tree, Answer Tree’s CART, and SAS’ Proc Logistic Using the same data, we compare the results using R’s Tree, Answer Tree’s CART, and SAS’ Proc Logistic

Logistic Regression Let x be a vector of explanatory variables Let x be a vector of explanatory variables Let y be a binary target variable (0 or 1) Let y be a binary target variable (0 or 1) p = Pr(Y=1|x) is the target probability p = Pr(Y=1|x) is the target probability The linear logistic model has the form The linear logistic model has the form Predicted probability, phat = 1/(1+exp(-(α+β’x))) Note that the range for p is (0,1), but logit(p) is the whole real line

Logistic Results Using the training set, the maximum likelihood estimate failed to converge using the social class and age variables Using the training set, the maximum likelihood estimate failed to converge using the social class and age variables Only the paid weekly/monthly and has AMEX card variables could be estimated Only the paid weekly/monthly and has AMEX card variables could be estimated The AMEX variable was highly insignificant and so was dropped The AMEX variable was highly insignificant and so was dropped Apparently, the tree algorithm does a better job in handling all the variables Apparently, the tree algorithm does a better job in handling all the variables SAS output of model: Standard Wald Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.0001 PAY_WEEK <.0001 So, the odds of weekly pay to be a good risk over a monthly salary person is exp(-3.6) ≈ to 1 or 36 to 1 against.

Validation Results With only one variable in our predicted model, there are only two possible predicted probabilities: and With only one variable in our predicted model, there are only two possible predicted probabilities: and Taking the higher probability as predicting a “good” account, our results are below Taking the higher probability as predicting a “good” account, our results are below Validation Set Training Set Validation Set Training Set Actual Bad Actual Good Actual Bad Actual Good Actual Bad Actual Good Actual Bad Actual Good Predicted Bad Predicted Bad Predicted Good Predicted Good Percent Agreement 85.3% 85.6% Percent Agreement 85.3% 85.6% The better measure is use the validation set results. Note that the results are very similar, so overfitting does not appear to be a problem.

Growing a Tree in R (Based on training data) > credit_data credit_data<- read.csv(file="training.csv") > library(tree) > credit_tree credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + AMEX, data=credit_data, split=c("gini")) > tree.pr tree.pr<-prune.tree(credit_tree) > plot(tree.pr) # figure 1 > plot(credit_tree, type="u"); text(credit_tree, pretty=0) # figure 2 > tree.1 tree.1<-prune.tree(credit_tree, best=5) > plot(tree.1, type="u"); text(tree.1, pretty=0) # figure 3, 4, 5 # figure 3, 4, 5 > summary(tree.1)

Figure 1 Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Tree Based on Validation Data

Implementing using SPSS ANSWER TREE Training sample – C&RT (Min impunity change.01)

Implementing using SPSS ANSWER TREE Training sample – CHAID (Pearson Chi2, p=.05)

Summary classification for training data Level 1 Level 2 Predicted % of error Pay-MonthlyAge-med,oldGood0% Pay-MonthlyAge-YoungBad47% Pay-WeeklyClass-P,MBad32% Pay-WeeklyClass-C,S,UBad3%

Summary of Validation data grouped by training data classification Level 1 Level 2 Predicted % of error Pay-MonthlyAge-med,oldGood2% Pay-MonthlyAge-YoungBad59% Pay-WeeklyClass-P,MBad23% Pay-WeeklyClass-C,S,UBad13%

Crosstabulation of predicted and actual classification Training sample Validation sample Actual good Actual bad Actual good Actual bad Predict good Predict bad Agreement86.6%82.9% Agreement in regression 85.6%85.3%

Summary of result Similar trees were generated from R and SPSS ANSWER TREE Similar trees were generated from R and SPSS ANSWER TREE Similar results were derived using different tree generation methods – C&RT and CHAID Similar results were derived using different tree generation methods – C&RT and CHAID Classification tree has higher percentage of agreement between predicted values and actual values than logistic regression on training data Classification tree has higher percentage of agreement between predicted values and actual values than logistic regression on training data Utilizing the grouping criteria derived from training data, logistic regression has higher percentage of agreement than classification tree Utilizing the grouping criteria derived from training data, logistic regression has higher percentage of agreement than classification tree

Conclusion Classification tree is a non-parametric method to select predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive probability Classification tree is a non-parametric method to select predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive probability Classification tree can be implemented in different software and using different tree growing methodologies Classification tree can be implemented in different software and using different tree growing methodologies Classification tree normally performs better than parametric models with higher percentage of agreement between predicted values and actual values Classification tree normally performs better than parametric models with higher percentage of agreement between predicted values and actual values Classification tree has special advantages in industries like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of predicting error Classification tree has special advantages in industries like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of predicting error

Conclusion – con’d As a non-parametric method, no function form is specified and no parameter will be estimated and tested As a non-parametric method, no function form is specified and no parameter will be estimated and tested As showed in this small study, the lower percentage of agreement for validation data shows “overfitting” might be a potential problem in classification tree As showed in this small study, the lower percentage of agreement for validation data shows “overfitting” might be a potential problem in classification tree