Assumptions in linear regression models Unit 2 Assumptions in linear regression models Yi = β0 + β1 x1i + … + βk xki + εi, i = 1, … , n Assumptions x1i , … , xki are deterministic (not random variables) ε1 … εn are independent random variables with null mean, i.e. E(εi) = 0 and common variance, i.e. V(εi) = σ2. Consequences E(Yi) = β0 + β1 x1i + … + βk xki and V(Yi) = σ2. i = 1, … , n The OLS (ordinary least squares) estimators of β0, … βk indicated with b1, …, bk are BLUE (Best Linear Unbiased Estimators) – Gauss Markov theorem.
Normality assumption Unit 2 If, in addition, we assume that the errors are Normal r.v. ε1 … εn are independent NORMAL r.v. with null mean and common variance σ2, i.e. εi ~N(0, σ2), i = 1, … , n Consequences Yi ~ N( β0 + β1 x1i + … + βk xki , σ2), i = 1, … , n bi ~ N( βi V(bi)), i = 0, … , k The normality assumption is needed to make reliable inference (confidence intervals and tests of hypotheses). I.e. probability statements are exact If the normality assumption does not hold, under some conditions, a large n (observations), via a Central Limit theorem allows reliable asymptotic inference on the estimated betas.
Unit 2 Checking assumptions The error term ε is unobservable. Instead we can provide an estimate by using the parameter estimates. The regression residual is defined as ei = yi – yi , i= 1, 2, ... n Plots of the regression residuals are fundamental in revealing model inadequacies such as non-normality unequal variances presence of outliers correlation (in time) of error terms
Detecting model lack of fit with residuals Unit 2 Detecting model lack of fit with residuals Plot the residuals ei on the vertical axis against each of the independend variables x1, ..., xk on the horizontal axis. Plot the residuals ei on the vertical axis against the predicted value y on the horizontal axis. In each plot look for trends, dramatic changes in variability, and /or more than 5% residuals lie outside 2s of 0. Any of these patterns indicates a problem with model fit. Use the Scatter/Dot graph command in SPSS to construct any of the plots above.
Examples: residuals vs. predicted Unit 2 Examples: residuals vs. predicted fine nonlinearity unequal variances outliers
Examples: residuals vs. predicted Unit 2 Examples: residuals vs. predicted auto-correlation nonlinearity and auto-correlation
Partial residuals plot Unit 2 Partial residuals plot An alternative method to detect lack of fit in models with more than one independent variable uses the partial residuals; for a selected j-th independent var xj, e* = y – (b0+ b1x1+...+ bj-1xj-1 + bj+1xj+1 + ... + bkxk ) = e + bjxj Partial residuals measure the influence of xj after the effects of all other independent vars have been removed. A plot of the partial residuals for xj against xj often reveals more information about the relationship between y and xj than the usual residual plot. If everything is fine they should show a straight line with slope bj. Partial residual plots can be calculated in SPSS by selecting “Produce all partial plots” in the “Plots” options in the “Regression” dialog box.
Example Model 1: E(Y) = β0 + β1x1 + β2x2 Unit 2 Example A supermarket chain wants to investigate the effect of price p on the weekly demand of a house brand of coffee. Eleven prices were randomly assigned to the stores and were advertised using the same procedure. A few weeks later the chain conducted the same experiment using no advertising Y : weekly demand in pounds X1: price, dollars/pound X2: advertisement: 1 = Yes, 0 =No. Model 1: E(Y) = β0 + β1x1 + β2x2 Data: Coffee2.sav
Computer Output Unit 2
Residual and partial residual (price) plots Unit 2 Residuals vs. price. Shows non-linearity Partial residuals for price vs. price. Shows nature of non-linearity. Try using 1/x instead of x
E(Y) = β0 + β1(1/x1) + β2x2 Unit 2 RecPrice = 1/Price
Residual and partial residual (1/price) plots Unit 2 After fitting the independent variable “x1 = 1/price” the Residual plot does not show any pattern and the Partial residual plot for (1/price) does not show any non linearity.
An example with simulated data The true model, supposedly unknown, is Y = 1 + x1 + 2∙x2 + 1.5∙x1∙x2 + ε, with ε~N(0,1) Data: Interaz.sav Fit a model based on data Fit a model based on data X2 Cor(X1,X2)=0.131 Y x1 x2
Model 1: E(Y) = β0 + β1x1 + β2x2 Anovab SS df MS F Sig. Regressione 8447,42 2 4233,711 768,494 ,000a Residuo 533,12 97 5,496 Totale 8980,54 99 Adj. R2=0.939 Coefficientia t Sig. B DS VIF 1 (Costante) -6,092 ,630 -9,668 ,000 X1 3,625 ,207 17,528 1,018 X2 6,145 ,189 32,465
Model 1: standardized residual plot Nonlinearity is present To what is due? Since the scatter-plots do not show any non-linearity it could be due to an interaction Y
Model 1: partial regression plots Show that linear effects are roughly fine. But some non-linearity shows up X1 X2
Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 Anovab SS df MS F Sig. Regressione 8885,372 3 2961,791 2987,64 000a Residuo 95,169 96 ,991 Totale 8980,541 99 Adj. R2=0.989 Coefficientia t Sig. B DS VIF 1 (Costante) ,305 ,405 ,753 ,453 X1 1,288 ,142 9,087 ,000 2,648 X2 2,098 ,209 10,051 6,857 IntX1X2 1,411 ,067 21,018 9,280
Model 2: standardized residual plot Looks fine
Model 2: partial regression plots Maybe an outlier is present X1 X2 All plots show linearity of the corresponding terms X1X2
Model 3: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x22 Suppose I wanto to try fitting a quadratic term Anovab SS df MS F Sig. Regressione 8890,686 4 2222,67 2349,92 ,000a Residuo 89,856 95 ,946 Totale 8980,541 99 Adj. R2=0.990 Coefficientia t Sig. B DS VIF 1 (Costante) ,023 ,413 ,055 ,956 X1 1,258 ,139 9,051 ,000 2,670 X2 2,615 ,299 8,757 14,713 IntX1X2 1,436 ,066 21,619 9,528 X2Square -,137 ,058 -2,370 ,020 11,307 x22 seems fine Higher MC
Model 3: standardized residual plot Looks fine
Model 3: partial regression plots X1 X2 Doesn’t show “linearity” X1X2 X22
Checking the normality assumption Unit 2 Checking the normality assumption The inference procedures on the estimates (tests and confidence intervals) are based on the Normality assumption on the error term ε. If this assumption is not satisfied the conclusions drawn may be wrong. Again, the residuals ei are used for checking this assumption Two widely used graphical tools are the P-P plot for Normality of the residuals the histogram of the residuals compared with the Normal density function. The P-P plot for Normality and histogram of the residuals can be calculated in SPSS by selecting the appropriate boxes in the “Plots” options in the “Regression” dialog box.
Social Workers example: E(ln(Y)) = β0 + β1x Unit 2 Histogram should match the continuous line Points should be as close as possible to the straight line Both graphs do not show strong departures from the Normality assumption.