6.1.4 AIC, Model Selection, and the Correct Model oAny model is a simplification of reality oIf a model has relatively little bias, it tends to provide.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Simple Logistic Regression
Lecture 16: Logistic Regression: Goodness of Fit Information Criteria ROC analysis BMTRY 701 Biostatistical Methods II.
Logistic Regression Example: Horseshoe Crab Data
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Psychology 202b Advanced Psychological Statistics, II February 22, 2011.
1 Experimental design and analyses of experimental data Lesson 6 Logistic regression Generalized Linear Models (GENMOD)
1 Joyful mood is a meritorious deed that cheers up people around you like the showering of cool spring breeze.
458 Model Uncertainty and Model Selection Fish 458, Lecture 13.
Specific to General Modelling The traditional approach to econometrics modelling was as follows: 1.Start with an equation based on economic theory. 2.Estimate.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
EPI 809/Spring Multiple Logistic Regression.
1 Modeling Ordinal Associations Section 9.4 Roanna Gee.
Logistic Regression Biostatistics 510 March 15, 2007 Vanessa Perez.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
OLS versus MLE Example YX Here is the data:
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
The Method of Likelihood Hal Whitehead BIOL4062/5062.
Logistic Regression I HRP 261 2/09/04 Related reading: chapters and of Agresti.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 4: Taking Risks and Playing the Odds: OR vs.
EIPB 698E Lecture 10 Raul Cruz-Cano Fall Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
2 December 2004PubH8420: Parametric Regression Models Slide 1 Applications - SAS Parametric Regression in SAS –PROC LIFEREG –PROC GENMOD –PROC LOGISTIC.
1 STA 617 – Chp9 Loglinear/Logit Models Loglinear / Logit Models  Chapter 5-7 logistic regression: GLM with logit link binomial / multinomial  Chapter.
Logistic Regression Database Marketing Instructor: N. Kumar.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
Logistic Regression. Conceptual Framework - LR Dependent variable: two categories with underlying propensity (yes/no) (absent/present) Independent variables:
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Lecture Slide #1 Logistic Regression Analysis Estimation and Interpretation Hypothesis Tests Interpretation Reversing Logits: Probabilities –Averages.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
GEE Approach Presented by Jianghu Dong Instructor: Professor Keumhee Chough (K.C.) Carrière.
1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and.
Foundations of Sociological Inquiry Statistical Analysis.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Sigmoidal Response (knnl558.sas). Programming Example: knnl565.sas Y = completion of a programming task (1 = yes, 0 = no) X 2 = amount of programming.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
1 Say good things, think good thoughts, and do good deeds.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Analysis of matched data Analysis of matched data.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
BINARY LOGISTIC REGRESSION
Notes on Logistic Regression
Further Inference in the Multiple Regression Model
Statistics in MSmcDESPOT
CJT 765: Structural Equation Modeling
Introduction to Logistic Regression
Modeling Ordinal Associations Bin Hu
Presentation transcript:

6.1.4 AIC, Model Selection, and the Correct Model oAny model is a simplification of reality oIf a model has relatively little bias, it tends to provide accurate estimates of the quantities of interest oBest model is often the simplest (less parameters)- model parsimony Akaike Information Criterion (AIC)- alternative to significance tests to estimate quantities of interest oCriterion for choosing between competing statistical models oAIC judges a model by how close its fitted values tend to be to the true values oThe AIC selects the model that minimizes: AIC = -2(maximized log likelihood – # parameters in the model) oThis penalizes a model for having too many parameters oServes the purpose of model comparison only; does not provide diagnostic about the fit of the model to the data

AIC = -2(maximized log likelihood – # parameters in the model) In SAS: AIC = -2LogL + 2p Crab Example : Table 6.2 (p. 215): The best models have smallest AIC’s oBest models have main effects, COLOR and WIDTH (AIC = 197.5) PROC LOGISTIC (Backward Elimination) : proc logistic descending ; class color spine / param = ref ; model y = width weight color spine / selection = backward lackfit ; Backward Elimination Procedure Step 0. The following effects were entered: Intercept width weight color spine Step 1. Effect spine is removed Step 2. Effect weight is removed Model Fit Statistics Criterion Intercept Only Intercept and Covariates AIC Log L Analysis of Maximum Likelihood Estimates Parameter DFDFEstimate Standard Error Wald Chi-SquarePr > ChiSq Intercept <.0001 width <.0001 In our case, AIC is equal in all steps: = -2LogL + 2p = (1), where p = 1

oRather than using selection techniques, such as stepwise, which look at significance levels of each parameter, use theory and common sense to build a model (Add and remove parameters that make sense) oA time ordering among variables may suggest causal relationships Example : (table 6.3, p. 217) In a British study, 1036 men and women (married and divorced) were asked whether they’ve had premarital and/or extramarital sex. We want to determine whether G = gender, P = premarital sex, and E = extramarital sex are factors in whether a person is M= married or divorced. Simple Model : G → P → E → M Any of these is an explanatory variable when a variable listed to its right is the response Complex Model (Triangular) : (Fig. 3.1, p. 218) 1 st stage : predicts G has a direct effect on P 2 nd stage : predicts P and G have direct effects on E 3rd stage : predicts E has direct effect on M ; P has direct and indirect effects on M; G has indirect effects through P and E Using Causal Hypotheses to Guide Model Building

Table 6.4 : Goodness of Fit Tests for Model Selection 1 st Stage : predicts Gender has a direct effect on Premarital Sex The estimated odds of premarital sex for females is.27 times that for males. PMS YesNoTotal Female Male Total data causal2 ; input gender $ PMS TOTALPMS ; datalines ; F M ; Model (Response P, no Actual Explanatory) PROC GENMOD DATA = CAUSAL2 DESCENDING ; CLASS GENDER ; MODEL PMS/TOTALPMS = / DIST = BIN LINK = LOGIT; Model (Response P, Actual Explanatory G) PROC GENMOD DATA = CAUSAL2 DESCENDING ; CLASS GENDER ; MODEL PMS/TOTALPMS = GENDER / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood

The L-R statistic -2(L 0 – L 1 ) test whether certain model parameters are zero by comparing the log likelihood L 1 for the fitted model M 1 with L 0 for the simpler model M 0 (formula p. 187) For the example, we will use the fact -2(L 0 – L 1 ) = G 2 (M 0 ) - G 2 (M 1 ) using SAS output. 1 st Stage : G 2 = G 2 (M 0 ) - G 2 (M 1 ) = – = (L 0 – L 1 ) = -2( – ( ) = Df = 1 – 0 = 1, so χ 2 p-value <.001 and there is evidence of a gender effect on pre marital sex suggesting having G as an explanatory variable is a better model. 2 nd Stage : predicts Gender and Premarital Sex have direct effects on Extramarital Sex Goodness of Fit as a Likelihood-Ratio Test PMSEMS GENDER YesNo Total FemaleYes No MaleYes No data causal3 ; input gender $ PMS $ EMS TOTALEMS ; datalines ; F Y F N M Y M N ;

Model (Response E, no Actual Explanatory) PROC GENMOD DATA = CAUSAL3 DESCENDING ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Model (Response E, P Actual Explanatory) PROC GENMOD DATA = CAUSAL3 ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = PMS / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Model (Response E, G+P Actual Explanatory) PROC GENMOD DATA = CAUSAL3 DESCENDING ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = GENDER PMS / DIST = BIN LINK = LOGIT TYPE3 RESIDUALS OBSTATS ; Criteria For Assessing Goodness Of Fit Criterion DFDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Model (Response E, no Actual Explanatory)Model (Response E, P Actual Explanatory) Criteria For Assessing Goodness Of Fit CriterionDFValueValue/DF Deviance Pearson Chi-Square Log Likelihood Model (Response E, G+P Actual Explanatory Model E = 1 vs. E = P G 2 (M 0 ) - G 2 (M 1 ) = – = (L 0 – L 1 ) = -2( –( ) = df = 3-2= 1, so χ 2 p-value <.001, so there is evidence of a P effect on E Model E = P vs. E = G+P G 2 = G 2 (M 0 ) - G 2 (M 1 ) = = 2.9 df = 2-1 = 1, so χ 2 p-value >.10 so only weak evidence occurs that G had a direct effect as well as indirect effect on E. So E = P is a sufficient model.

3 rd stage : predicts Extramarital Sex has direct effect on Marriage ; Premarital Sex has direct and indirect effects on Marriage; Gender has indirect effects through PMS and EMS PMSEMS Divorced GENDER YesNo FemaleYes No NoYes No MaleYes No NoYes No data causal ; input gender $ PMS $ EMS $ DIVORCED TOTAL ; datalines ; F Y Y F Y N F N Y F N N M Y Y M Y N M N Y M N N ; Model M = E + P vs. M = E*P G 2 = G 2 (M 0 ) - G 2 (M 1 ) = – = 12.91, with df = 5-4= 1 so χ 2 p-value <.10 so the interaction EMS*PMS is a better model to predict Divorce Model M = E*P vs. M = E*P + G G 2 = G 2 (M 0 ) - G 2 (M 1 ) = = , with df = 4-3= 1 so χ < p-value <.05 so adding G to interaction EMS*PMS fits slightly better. Conclusion for Causal Relationships Good alternative for model building by using common sense to hypothesize relationships

6.1.6 New Model-Building Strategies for Data Mining oData mining is the analysis of huge data sets, in order to find previously unsuspected relationships which are of interest or value oModel Building is challenging oThere are alternatives to traditional statistical methods, such as automated algorithms that ignore concepts such as sampling error and modeling oSignificance tests are usually irrelevant, since nearly any variable has significant effect if n is sufficiently large oFor large n, inference is less relevant than summary measures of predictive power