Violations of Regression Assumptions

Violations of Regression Assumptions
Copyright (c) 2008 by The McGraw-Hill Companies. This spreadsheet is intended solely for educational purposes by licensed users of LearningStats. It may not be copied or resold for profit.

Minitab Copyright Notice
Copyright Notice Portions of MINITAB Statistical Software input and output contained in this document are printed with permission of Minitab, Inc. MINITABTM is a trademark of Minitab Inc. in the United States and other countries and is used herein with the owner's permission.

Regression Assumptions
yi = b0 + b1x1i + b2x2i + … + bpxpi + ei Correct model specified (no variables omitted) Appropriate model form (e.g., linear) Predictors are non-stochastic and independent Errors (disturbances) are random zero mean normally distributed homoscedastic (constant variance) mutually independent (non-autocorrelated)

Errors are normally distributed Errors have constant variance s2
Standard Notation ei ~ N(0,s2) Errors are normally distributed Errors have constant variance s2 Errors have zero mean

Violations of Assumptions
yi = b0 + b1x1i + b2x2i + … + bpxpi + ei Relevant predictors were omitted Wrong model form specified (e.g., linear) Collinear predictors (i.e., correlated Xj and Xk) Non-normal errors (e.g., skewed, outliers) Heteroscedastic errors (non-constant variance) Autocorrelated errors (non-independent)

What Is Specification Bias?
Wrong model form or the wrong variables Example of Wrong Model Form: You said Y = a + bX, but actually Y = a + bX + cX2 Example of Wrong Variables: You said Y = a + bX but actually Y = a + bX +cZ

Specification Bias a linear model was specified, but the data actually are non-linear

Detecting Specification Bias
In a bivariate model: Plot Y against X Plot residuals against estimated Y In a multivariate model: Plot residuals against actual Y Plot fitted Y against actual Y Look for patterns (there should be none)

Residuals Plotted on Y residuals are correlated with Y (suggests incorrect specification)

What Is Multicollinearity?
The "independent" variables are related Collinearity: Correlation between any two predictors Multicollinearity: Relationship among several predictors

Effects of Multicollinearity
Estimates may be unstable Standard errors may be misleading Confidence intervals generally too wide High R2 yet t statistics insignificant

Variance Inflation Factor
VIFs give a simple multicollinearity test. Each predictor has a VIF. For predictor j, the VIF is where Rj2 is the coefficient of determination when predictor j is regressed against all the other predictors.

Variance Inflation Factor
Example A: If Rj2 =.00 then VIFj = 1: MegaStat and MINITAB will calculate the VIF for each predictor if you request it Example B: If Rj2 = .90 then VIFj = 10:

Evidence of Multicollinearity
Any VIF > 10 Sum of VIFs > 10 High correlation for pairs of predictors Xj and Xk Unstable estimates (i.e., the remaining coefficients change sharply when a suspect predictor is dropped from the model)

Example: Estimating Body Fat
Problem: Several VIFs exceed 10.

Correlation Matrix of Predictors
Age and Height are relatively independent of other predictors. Problem: Neck, Chest, Abdomen, and Thigh are highly correlated.

Solution: Eliminate Some Predictors
R2 is reduced slightly, but all VIFs are now below 10.

Stability Check for Coefficients
There are large changes in estimated coefficients as high VIF predictors are eliminated, revealing that the original estimates were unstable. But the “fit” deteriorates when we eliminate predictors.

Example: College Graduation Rates
Minor problem? The sum of the VIFs exceeds 10 (but few statisticians would worry since no single VIF is very large).

Remedies for Multicollinearity
Drop one or more predictors But this may create specification error Transform some variables (e.g., log X) Enlarge the sample size (if you can) Tip If they feel the model is correctly specified, statisticians tend to ignore multicollinearity unless its influence on the estimates is severe.

What Is Heteroscedasticity?
Non-constant error variance Homoscedastic: Errors have the same variance for all values of the predictors (or Y) Heteroscedastic Error variance changes with the values of the predictors (or Y)

How to Detect Heteroscedasticity
Excel and MegaStat and MINITAB will do these residual plots if you request them Plot residuals against each predictor (a bit tedious) Plot residuals against estimated Y (quick check) There are more general tests, but they are complex

Homoscedastic Residuals

Heteroscedastic Residuals
To detect heteroscedasticity, we plot the residuals against each predictor. Some predictors may show a problem, while others are O.K. A quick overall test is to plot the residuals only against estimated Y.

Effects of Heteroscedasticity
Happily ... OLS coefficients bj are still unbiased OLS coefficients bj are still consistent But ... Std errors of b's are biased (bias may be + or -) t values and CI for b's may be unreliable May indicate incorrect model specification

Remedies for Heteroscedasticity
Avoid totals (e.g., use per capita data) Transform some variables (e.g., log X) Don't worry about it (may not be serious)

What Is Autocorrelation?
The errors are not independent Independent errors: et does not depend on et-1 (r = 0) Autocorrelated errors: et depends on et (r 0) Good News Autocorrelation is a worry in time-series models (the subscript t = 1, 2, ..., n denotes time) but generally not in cross-sectional data.

What Is Autocorrelation?
Assumed Model: et = r et-1 + ut where ut is assumed non-autocorrelated Independent errors: et does not depend on et-1 (r = 0) Autocorrelated errors: et depends on et (r 0)  The residuals will show a pattern over time

Autocorrelated Residuals
Common Rare When a residual tends to be followed by another of the same sign, we have positive autocorrelation When a residual tends to be followed by another of opposite sign, we have negative autocorrelation

How to Detect Autocorrelation
Look for pattern in residuals plotted against time Look for cycles of of followed by Look for alternating pattern Calculate the correlation between et and et-1 This is called the “autocorrelation coefficient” It should not differ significantly from) Check Durbin-Watson statistic DW = 2 indicates absence of autocorrelation DW < 2 indicates positive autocorrelation (common) DW > 2 indicates negative autocorrelation (rare)

residuals are autocorrelated
Residual Time Plot residuals are autocorrelated problem: runs of and

Durbin-Watson Test

Common Types of Autocorrelation
Errors are autocorrelated (relatively minor) Lagged Y used as predictor (OK if large n) First Order Autocorrelation Yt = b0 + b1Xt1 + b2Xt2 + et where et = ret-1 + ut and ut is N(0,s2) Lagged Predictor Yt = b0 + b1Xt-1 + b2Yt-1 + et

Effects of Simple First-Order Autocorrelation
OLS coefficients bj are still unbiased OLS coefficients bj are still consistent If r > 0 (the typical situation) then the standard errors of bj is underestimated computed t values will be too high C.I. for bj will be too narrow

General Effects of Autocorrelation

Data Transformations for Autocorrelation
Use first differences: DY = f(DX1, DX2) Use Cochrane-Orcutt transformation DYt = g0 + b1DX1 + b2DX2 + et Comment Simple, but only suffices when r is near 1. Yt* = Yt - rYt-1 Xt* = Xt - rXt-1 Comment We must estimate the sample autocorrelation coeffficient and use it to estimate r.

The errors are normally distributed
What Is Non-Normality? The errors are normally distributed Normal errors: The histogram of residuals is "bell-shaped" There are no outliers in the residuals The probability plot is linear Non-normal errors Any violations of the above

Residual Histogram histogram should be symmetric and bell-shaped
there are outliers beyond 3 s

Residual Probability Plot
If normal, dots should be linear (45o line) possible outlier

Effects of Non-Normal Errors
Confidence intervals for Y may be incorrect May indicate outliers May indicate incorrect model specification But usually not considered a serious problem

Detection of Non-Normal Errors
Look at histogram of residuals Should be symmetric Should be bell-shaped Look for outliers or asymmetry Outliers are a serious violation Mild asymmetry is common Look at probability plot of residuals Should be linear Look for outliers

Remedies for Non-Normal Errors
Avoid totals (e.g., use per capita data) Transform some variables (e.g., log X) Enlarge the sample (asymptotic normality)

Influential Observations
High "leverage" of certain data points These are data points with extreme X values Sometimes called “high leverage” observations One case may strongly affect the estimates

How to Detect Influential Observations
In MINITAB, look for observations denoted "X" (These are observations with unusual X values) In MINITAB, look for observations denoted "R" (These are observations with unusual residuals) Do your own tests (MINITAB does them automatically)

Rules for Finding Influential Observations
Unusual X: look at hat matrix for hii > 2p/n where p is the number of coefficients in the model n is the number of observations Unusual Y: look for large studentized deleted residuals Use z values as a reference if n is large Use t values for d.f. = n - p - 1 if n is small Unusual X and Y: use Cook’s distance measure Use F(p,n-p) as critical value Unusual X and Y: use MINITAB’s Dfits measure Rule of thumb is Dfits > 2{p/n}.5

Remedies for Influential Observations
Discard the observation only if you have logical reasons for thinking the observation is flawed Use method of least absolute deviations (but Minitab and Excel don’t calculate absolute deviations) Call a professional statistician

Assessing Fit The fit of a model can be assessed by the R2 and R2adj,
There are many ways The fit of a model can be assessed by the R2 and R2adj, the F statistic in ANOVA table the standard error syIx plot of fitted Y against actual Y

Overall Fit: Actual Y versus Fitted Y
The correlation between actual Y and fitted Y is the multiple correlation coefficient The closer to a 45o line, the better the fit

Summing It Up Computers do most of the work
Regression is somewhat robust Be careful but don't panic Excelsior!

Violations of Regression Assumptions

Similar presentations

Presentation on theme: "Violations of Regression Assumptions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Violations of Regression Assumptions

Similar presentations

Presentation on theme: "Violations of Regression Assumptions"— Presentation transcript:

Similar presentations

About project

Feedback