Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research
Objectives of session Recognise the need to check fit of the model Recognise the need to check fit of the model Carry out checks of assumptions in SPSS for simple linear regression Carry out checks of assumptions in SPSS for simple linear regression Understand predictive model Understand predictive model Understand residuals Understand residuals
How is the fitted line obtained? Use method of least squares (LS) Seek to minimise squared vertical differences between each point and fitted line Results in parameter estimates or regression coefficients of slope (b) and intercept (a) – y=a+bx
Consider Fitted line of y = a +bx Explanatory (x) Dependent (y) a
Consider the regression of age on minimum LDL cholesterol achieved Select Regression Select Regression Linear…. Linear…. Dependent (y) – Min LDL achieved Dependent (y) – Min LDL achieved Independent (x) - Age_Base Independent (x) - Age_Base
N.B may look very small but represents: The DECREASE in LDL achieved for each increase in one unit of age i.e. ONE year Output from SPSS linear regression Coefficients a ModelUnstandardized CoefficientsStandardized Coefficients BStd. ErrorBetatsig 1(Constant) Age at baseline a. Dependent Variable: Min LDL achieved
H 0 : slope b = 0 Test t = slope/se = /0.002 = with p<0.001, so statistically significant Predicted LDL = xAge Output from SPSS linear regression Coefficients a ModelUnstandardized CoefficientsStandardized Coefficients BStd. ErrorBetatsig 1(Constant) Age at baseline a. Dependent Variable: Min LDL achieved
Predicted LDL achieved = xAge So for a man aged 65 the predicted LDL achieved = – 0.008x 65 = Prediction Equation from linear regression AgePredicted Min LDL
Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed
Use Graphs and Scatterplot to obtain the Lowess line of fit
1.Create Scatterplot and then double-click to enter chart editor 2.Chose Icon ‘Add fit line at total’ 3.Then select type of fit such as Lowess
Linear assumption: Fitted lowess smoothed line Lowess smoothed line (red) gives a good eyeball examination of linear assumption (green)
Definition of a residual A residual is the difference between the predicted value (fitted line) and the actual value or unexplained variation r i = y i – E ( y i ) Or r i = y i – ( a + bx )
Residuals
To assess the residuals in SPSS linear regression, select plots….. Normalised or standardised predicted value of LDL Normalised residual Select histogram of residuals and normal probability plot
In SPSS linear regression, select Statistics….. Select confidence intervals for regression coefficients Model fit Select Durbin- Watson for serial correlation and identification of outliers
Output: Scatterplot of residuals vs. predicted Note 1)Mean of residuals = 0 2)Most of data lie within + or -3 SDs of mean
Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed
Plot of residuals with normal curve super- imposed Output: Histogram of standardised residuals
Output: Cumulative probability plot Look for deviation from diagonal line to indicate non- normality
Output: Description of residuals Subjects with standardised residuals > 3 Descriptive statistics for residuals Worth investigation? Casewise Diagnostics(a) Case NumberStd. ResidualMin LDL Predicted Residual a. Dependent Variable: Min LDL achieved
R – correlation between min LDL achieved and Age at baseline, here R 2 - % variation explained, here 1.5%, not particularly high Durbin-Watson test - serial correlation of residuals should be approximately 2 if no serial correlation Output: Model fit and serial correlation Model Summary ModelRR SquareAdjusted R SquareStd. Error of the Estimate Durbin-Watson a a. Predictors: (Constant), Age at baseline
Summary After fitting any regression model check assumptions - Functional form – linearity is default, often not best fit, consider quadratic… Functional form – linearity is default, often not best fit, consider quadratic… Check Residuals for approx. normality Check Residuals for approx. normality Check Residuals for outliers (> 3 SDs) Check Residuals for outliers (> 3 SDs) All accomplished within SPSS All accomplished within SPSS
Practical on Model Checking Read in ‘LDL Data.sav’ 1) Fit age squared term in min LDL model and check fit of model compared to linear fit (Hint: Use transform/compute to create age squared term and fit age and age 2 ) 2) Fit separate linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherence Check assumptions and interpret results