Regression diagnostics Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/Talks/ Jan-19 Jan-19 H.S. H.S. 1
Agenda Linear regression diagnostics (Logistic regression) Assumtions Robust results (Logistic regression) (Poisson regression) Time: Linear: 50-60 min Logistic, binary, conditional: 60 min 2 January 2019 Jan-19 H.S. H.S. 2
Birth weight by gestational age Linear regression Birth weight by gestational age Jan-19 2 January 2019 H.S. H.S. 3
Workflow Scatterplots Bivariate analysis Regression Model fitting Cofactors in/out Interactions Test of assumptions Independent errors Linear effects Constant error variance Influence (robustness) Model fitting Adjust if cofactor is confounder, problems cofacor has missing, cofactor has error, cofactor is in causal path Test of assumptions Stata: Estimation remains in memory. Post estimation commands Normally do assumtions first, then influence. Note leverage value of outlier 2 January 2019 Jan-19 H.S. H.S. 4
Scatterplot Jan-19 2 January 2019 H.S. H.S. 5 pregnancy=280 days is normal N=518 Jan-19 2 January 2019 H.S. H.S. 5
Results Outcome: birthweight Covariates: gestational age, sex, parity Model: linear regression OBS: synthetic data Study birthweigth Vary with sex, mothers age and espesially gest age (length of pregnancy in days) Descriptive (bivar analysis) Expected =3531 for boys, parity=0 gest=280 To get average must take 3530-166/2+2/5*230+2*17=3580 Jan-19 H.S.
Model diagnostics Model Assumptions Robustness Independent errors (residuals) Linear effects Constant error variance Robustness Y must be normal? Normal Y-skewed X Skewed Y-Normal X 2 January 2019 Jan-19 H.S. H.S. 7
Checking assumptions Jan-19 H.S.
1. Independent residuals No diagnostic tool Possible violations Pupils nested in schools: weak correlations Repeated measurement: strong correlations Models Adjust for clustering Linear mixed models GEE Birth weight example: possible violations: repeated births by the same mother Jan-19 H.S.
2. Linear effects Save residuals and predicted values Plot resid vs pred If non-linear: Plot resid vs cont. vars Add square term or cut in categories Add gest^2, linearity now OK Jan-19 H.S.
Linear effect test Model 1: only linear terms Significant means non-linearity Model 2: linear terms+square term Jan-19 H.S.
3. Constant residual variance Plot resid vs pred If non-constant variance: Robust regression Weighted regression SPSS: log transform, use poisson regression look for missing cofactor Jan-19 H.S.
Constant variance test Significant means non-const. var. Jan-19 H.S.
Weighted regression Estimate residual variance Weights=1/variance Effects Takes care of heteroskedasticity “robustification” Jan-19 H.S.
Summary of assumptions Dependent residuals Mixed models: xtmixed Non linear effects gen gest2=gest^2 regress weigth gest gest2 sex Non-constant variance regress weigth gest sex, robust Linear mixed models: xtmixed 2 January 2019 Jan-19 H.S. H.S. 15
Checking robustness , Measures of influence Jan-19 H.S.
Measures of influence Measure change in: Predicted (y) Deviance Remove obs 1, see change remove obs 2, see change Measure change in: Predicted (y) Deviance Coefficients (beta) 2 January 2019 Jan-19 H.S. H.S. 17
Influence idea Outlierness Leverage Influence Residuals Distance from x-mean Influence Combination Jan-19 H.S.
Leverage versus residuals2 “Adjusted” scatterplot Added variable plot (partial regression leverage) Look at: 321: high lev, med resid 111: low lev, high resid Lack the ability to see 2 points with opposite large effects, as in delta beta Jan-19 H.S.
Delta beta (for gestational age) Pro: Advantage: direct measure of the coef we are interested in, both pos and neg directions Con: Disadvatage: one measure for each covariate Scaled? Jan-19 H.S.
Delta fitted value, Dfits Jan-19 H.S.
Summary: Robustness, influence Linear regression sensitive! Look for influential points Leverage versus residual plots Added variable plots Delta-beta Rerun regression without influential points and look for change in: coefficients constant term p-values Found influential point in a dataset with N=30 000! (MoBa, exercise) Jan-19 H.S.
Logistic, Poisson regression Assumptions Independent errors as before Linear effects as before Constant error variance no! Robustness Linear not robust! Poisson medium robust Logistic fairly robust Jan-19 H.S.