6.1.4 AIC, Model Selection, and the Correct Model oAny model is a simplification of reality oIf a model has relatively little bias, it tends to provide accurate estimates of the quantities of interest oBest model is often the simplest (less parameters)- model parsimony Akaike Information Criterion (AIC)- alternative to significance tests to estimate quantities of interest oCriterion for choosing between competing statistical models oAIC judges a model by how close its fitted values tend to be to the true values oThe AIC selects the model that minimizes: AIC = -2(maximized log likelihood – # parameters in the model) oThis penalizes a model for having too many parameters oServes the purpose of model comparison only; does not provide diagnostic about the fit of the model to the data
AIC = -2(maximized log likelihood – # parameters in the model) In SAS: AIC = -2LogL + 2p Crab Example : Table 6.2 (p. 215): The best models have smallest AIC’s oBest models have main effects, COLOR and WIDTH (AIC = 197.5) PROC LOGISTIC (Backward Elimination) : proc logistic descending ; class color spine / param = ref ; model y = width weight color spine / selection = backward lackfit ; Backward Elimination Procedure Step 0. The following effects were entered: Intercept width weight color spine Step 1. Effect spine is removed Step 2. Effect weight is removed Model Fit Statistics Criterion Intercept Only Intercept and Covariates AIC Log L Analysis of Maximum Likelihood Estimates Parameter DFDFEstimate Standard Error Wald Chi-SquarePr > ChiSq Intercept <.0001 width <.0001 In our case, AIC is equal in all steps: = -2LogL + 2p = (1), where p = 1
oRather than using selection techniques, such as stepwise, which look at significance levels of each parameter, use theory and common sense to build a model (Add and remove parameters that make sense) oA time ordering among variables may suggest causal relationships Example : (table 6.3, p. 217) In a British study, 1036 men and women (married and divorced) were asked whether they’ve had premarital and/or extramarital sex. We want to determine whether G = gender, P = premarital sex, and E = extramarital sex are factors in whether a person is M= married or divorced. Simple Model : G → P → E → M Any of these is an explanatory variable when a variable listed to its right is the response Complex Model (Triangular) : (Fig. 3.1, p. 218) 1 st stage : predicts G has a direct effect on P 2 nd stage : predicts P and G have direct effects on E 3rd stage : predicts E has direct effect on M ; P has direct and indirect effects on M; G has indirect effects through P and E Using Causal Hypotheses to Guide Model Building
Table 6.4 : Goodness of Fit Tests for Model Selection 1 st Stage : predicts Gender has a direct effect on Premarital Sex The estimated odds of premarital sex for females is.27 times that for males. PMS YesNoTotal Female Male Total data causal2 ; input gender $ PMS TOTALPMS ; datalines ; F M ; Model (Response P, no Actual Explanatory) PROC GENMOD DATA = CAUSAL2 DESCENDING ; CLASS GENDER ; MODEL PMS/TOTALPMS = / DIST = BIN LINK = LOGIT; Model (Response P, Actual Explanatory G) PROC GENMOD DATA = CAUSAL2 DESCENDING ; CLASS GENDER ; MODEL PMS/TOTALPMS = GENDER / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood
The L-R statistic -2(L 0 – L 1 ) test whether certain model parameters are zero by comparing the log likelihood L 1 for the fitted model M 1 with L 0 for the simpler model M 0 (formula p. 187) For the example, we will use the fact -2(L 0 – L 1 ) = G 2 (M 0 ) - G 2 (M 1 ) using SAS output. 1 st Stage : G 2 = G 2 (M 0 ) - G 2 (M 1 ) = – = (L 0 – L 1 ) = -2( – ( ) = Df = 1 – 0 = 1, so χ 2 p-value <.001 and there is evidence of a gender effect on pre marital sex suggesting having G as an explanatory variable is a better model. 2 nd Stage : predicts Gender and Premarital Sex have direct effects on Extramarital Sex Goodness of Fit as a Likelihood-Ratio Test PMSEMS GENDER YesNo Total FemaleYes No MaleYes No data causal3 ; input gender $ PMS $ EMS TOTALEMS ; datalines ; F Y F N M Y M N ;
Model (Response E, no Actual Explanatory) PROC GENMOD DATA = CAUSAL3 DESCENDING ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Model (Response E, P Actual Explanatory) PROC GENMOD DATA = CAUSAL3 ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = PMS / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Model (Response E, G+P Actual Explanatory) PROC GENMOD DATA = CAUSAL3 DESCENDING ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = GENDER PMS / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Criteria For Assessing Goodness Of Fit Criterion DFDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Model (Response E, no Actual Explanatory)Model (Response E, P Actual Explanatory) Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Model (Response E, G+P Actual Explanatory Model E = 1 vs. E = P G 2 (M 0 ) - G 2 (M 1 ) = – = (L 0 – L 1 ) = -2( –( ) = df = 3-2= 1, so χ 2 p-value <.001, so there is evidence of a P effect on E Model E = P vs. E = G+P G 2 = G 2 (M 0 ) - G 2 (M 1 ) = = 2.9 df = 2-1 = 1, so χ 2 p-value >.10 so only weak evidence occurs that G had a direct effect as well as indirect effect on E. So E = P is a sufficient model.
3 rd stage : predicts Extramarital Sex has direct effect on Marriage ; Premarital Sex has direct and indirect effects on Marriage; Gender has indirect effects through PMS and EMS PMSEMS Divorced GENDER YesNo FemaleYes No NoYes No MaleYes No NoYes No data causal ; input gender $ PMS $ EMS $ DIVORCED TOTAL ; datalines ; F Y Y F Y N F N Y F N N M Y Y M Y N M N Y M N N ; Model M = E + P vs. M = E*P G 2 = G 2 (M 0 ) - G 2 (M 1 ) = – = 12.91, with df = 5-4= 1 so χ 2 p-value <.10 so the interaction EMS*PMS is a better model to predict Divorce Model M = E*P vs. M = E*P + G G 2 = G 2 (M 0 ) - G 2 (M 1 ) = = , with df = 4-3= 1 so χ < p-value <.05 so adding G to interaction EMS*PMS fits slightly better. Conclusion for Causal Relationships Good alternative for model building by using common sense to hypothesize relationships
6.1.6 New Model-Building Strategies for Data Mining oData mining is the analysis of huge data sets, in order to find previously unsuspected relationships which are of interest or value oModel Building is challenging oThere are alternatives to traditional statistical methods, such as automated algorithms that ignore concepts such as sampling error and modeling oSignificance tests are usually irrelevant, since nearly any variable has significant effect if n is sufficiently large oFor large n, inference is less relevant than summary measures of predictive power