Violations of Regression Assumptions

Slides:



Advertisements
Similar presentations
13 Multiple Regression Chapter Multiple Regression
Advertisements

Week 13 November Three Mini-Lectures QMM 510 Fall 2014.
Multivariate Regression
The Multiple Regression Model.
Lecture 25 Multiple Regression Diagnostics (Sections )
Part I – MULTIVARIATE ANALYSIS C2 Multiple Linear Regression I
Lecture 24 Multiple Regression (Sections )
Regression Diagnostics - I
1 4. Multiple Regression I ECON 251 Research Methods.
Regression Diagnostics Checking Assumptions and Data.
Business Statistics - QBM117 Statistical inference for regression.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. A PowerPoint Presentation Package to Accompany Applied Statistics.
Regression Method.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
What does it mean? The variance of the error term is not constant
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Assumptions & Requirements.  Three Important Assumptions 1.The errors are normally distributed. 2.The errors have constant variance (i.e., they are homoscedastic)
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 13 Simple Linear Regression
Chapter 15 Multiple Regression Model Building
Ch5 Relaxing the Assumptions of the Classical Model
Process Control Charts
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Inference for Least Squares Lines
One-Sample Hypothesis Tests
Statistics for Managers using Microsoft Excel 3rd Edition
Chapter 9 Multiple Linear Regression
Kakhramon Yusupov June 15th, :30pm – 3:00pm Session 3
Regression Using Excel
Multiple Regression Analysis and Model Building
Essentials of Modern Business Statistics (7e)
Multivariate Regression
Chapter 12: Regression Diagnostics
Chapter 13 Simple Linear Regression
Checking Regression Model Assumptions
Pure Serial Correlation
Goodness-of-Fit Tests
Diagnostics and Transformation for SLR
Checking Regression Model Assumptions
Autocorrelation.
Residuals The residuals are estimate of the error
I271b Quantitative Methods
Serial Correlation and Heteroscedasticity in
Multiple Regression Chapter 14.
Regression Diagnostics
Chapter 7: The Normality Assumption and Inference with OLS
Multiple Linear Regression
Regression Forecasting and Model Building
BEC 30325: MANAGERIAL ECONOMICS
The Examination of Residuals
Chapter Fourteen McGraw-Hill/Irwin
Chapter 13 Additional Topics in Regression Analysis
Diagnostics and Remedial Measures
Linear Regression and Correlation
Essentials of Statistics for Business and Economics (8e)
Autocorrelation.
Diagnostics and Transformation for SLR
Multicollinearity What does it mean? A high degree of correlation amongst the explanatory variables What are its consequences? It may be difficult to separate.
Autocorrelation MS management.
BEC 30325: MANAGERIAL ECONOMICS
Serial Correlation and Heteroscedasticity in
Presentation transcript:

Violations of Regression Assumptions Copyright (c) 2008 by The McGraw-Hill Companies. This spreadsheet is intended solely for educational purposes by licensed users of LearningStats. It may not be copied or resold for profit.

Minitab Copyright Notice Copyright Notice Portions of MINITAB Statistical Software input and output contained in this document are printed with permission of Minitab, Inc. MINITABTM is a trademark of Minitab Inc. in the United States and other countries and is used herein with the owner's permission.

Regression Assumptions yi = b0 + b1x1i + b2x2i + … + bpxpi + ei Correct model specified (no variables omitted) Appropriate model form (e.g., linear) Predictors are non-stochastic and independent Errors (disturbances) are random zero mean normally distributed homoscedastic (constant variance) mutually independent (non-autocorrelated)

Errors are normally distributed Errors have constant variance s2 Standard Notation ei ~ N(0,s2) Errors are normally distributed Errors have constant variance s2 Errors have zero mean

Violations of Assumptions yi = b0 + b1x1i + b2x2i + … + bpxpi + ei Relevant predictors were omitted Wrong model form specified (e.g., linear) Collinear predictors (i.e., correlated Xj and Xk) Non-normal errors (e.g., skewed, outliers) Heteroscedastic errors (non-constant variance) Autocorrelated errors (non-independent)

What Is Specification Bias? Wrong model form or the wrong variables Example of Wrong Model Form: You said Y = a + bX, but actually Y = a + bX + cX2 Example of Wrong Variables: You said Y = a + bX but actually Y = a + bX +cZ

Specification Bias a linear model was specified, but the data actually are non-linear

Detecting Specification Bias In a bivariate model: Plot Y against X Plot residuals against estimated Y In a multivariate model: Plot residuals against actual Y Plot fitted Y against actual Y Look for patterns (there should be none)

Residuals Plotted on Y residuals are correlated with Y (suggests incorrect specification)

What Is Multicollinearity? The "independent" variables are related Collinearity: Correlation between any two predictors Multicollinearity: Relationship among several predictors

Effects of Multicollinearity Estimates may be unstable Standard errors may be misleading Confidence intervals generally too wide High R2 yet t statistics insignificant

Variance Inflation Factor VIFs give a simple multicollinearity test. Each predictor has a VIF. For predictor j, the VIF is where Rj2 is the coefficient of determination when predictor j is regressed against all the other predictors.

Variance Inflation Factor Example A: If Rj2 =.00 then VIFj = 1: MegaStat and MINITAB will calculate the VIF for each predictor if you request it Example B: If Rj2 = .90 then VIFj = 10:

Evidence of Multicollinearity Any VIF > 10 Sum of VIFs > 10 High correlation for pairs of predictors Xj and Xk Unstable estimates (i.e., the remaining coefficients change sharply when a suspect predictor is dropped from the model)

Example: Estimating Body Fat Problem: Several VIFs exceed 10.

Correlation Matrix of Predictors Age and Height are relatively independent of other predictors. Problem: Neck, Chest, Abdomen, and Thigh are highly correlated.

Solution: Eliminate Some Predictors R2 is reduced slightly, but all VIFs are now below 10.

Stability Check for Coefficients There are large changes in estimated coefficients as high VIF predictors are eliminated, revealing that the original estimates were unstable. But the “fit” deteriorates when we eliminate predictors.

Example: College Graduation Rates Minor problem? The sum of the VIFs exceeds 10 (but few statisticians would worry since no single VIF is very large).

Remedies for Multicollinearity Drop one or more predictors But this may create specification error Transform some variables (e.g., log X) Enlarge the sample size (if you can) Tip If they feel the model is correctly specified, statisticians tend to ignore multicollinearity unless its influence on the estimates is severe.

What Is Heteroscedasticity? Non-constant error variance Homoscedastic: Errors have the same variance for all values of the predictors (or Y) Heteroscedastic Error variance changes with the values of the predictors (or Y)

How to Detect Heteroscedasticity Excel and MegaStat and MINITAB will do these residual plots if you request them Plot residuals against each predictor (a bit tedious) Plot residuals against estimated Y (quick check) There are more general tests, but they are complex

Homoscedastic Residuals

Heteroscedastic Residuals To detect heteroscedasticity, we plot the residuals against each predictor. Some predictors may show a problem, while others are O.K. A quick overall test is to plot the residuals only against estimated Y.

Effects of Heteroscedasticity Happily ... OLS coefficients bj are still unbiased OLS coefficients bj are still consistent But ... Std errors of b's are biased (bias may be + or -) t values and CI for b's may be unreliable May indicate incorrect model specification

Remedies for Heteroscedasticity Avoid totals (e.g., use per capita data) Transform some variables (e.g., log X) Don't worry about it (may not be serious)

What Is Autocorrelation? The errors are not independent Independent errors: et does not depend on et-1 (r = 0) Autocorrelated errors: et depends on et-1 (r 0) Good News Autocorrelation is a worry in time-series models (the subscript t = 1, 2, ..., n denotes time) but generally not in cross-sectional data.

What Is Autocorrelation? Assumed Model: et = r et-1 + ut where ut is assumed non-autocorrelated Independent errors: et does not depend on et-1 (r = 0) Autocorrelated errors: et depends on et-1 (r 0)  The residuals will show a pattern over time

Autocorrelated Residuals Common Rare When a residual tends to be followed by another of the same sign, we have positive autocorrelation When a residual tends to be followed by another of opposite sign, we have negative autocorrelation

How to Detect Autocorrelation Look for pattern in residuals plotted against time Look for cycles of of + + + + followed by - - - - Look for alternating + - + - pattern Calculate the correlation between et and et-1 This is called the “autocorrelation coefficient” It should not differ significantly from) Check Durbin-Watson statistic DW = 2 indicates absence of autocorrelation DW < 2 indicates positive autocorrelation (common) DW > 2 indicates negative autocorrelation (rare)

residuals are autocorrelated Residual Time Plot residuals are autocorrelated problem: runs of + + + + and - - - -

Durbin-Watson Test

Common Types of Autocorrelation Errors are autocorrelated (relatively minor) Lagged Y used as predictor (OK if large n) First Order Autocorrelation Yt = b0 + b1Xt1 + b2Xt2 + et where et = ret-1 + ut and ut is N(0,s2) Lagged Predictor Yt = b0 + b1Xt-1 + b2Yt-1 + et

Effects of Simple First-Order Autocorrelation OLS coefficients bj are still unbiased OLS coefficients bj are still consistent If r > 0 (the typical situation) then the standard errors of bj is underestimated computed t values will be too high C.I. for bj will be too narrow

General Effects of Autocorrelation

Data Transformations for Autocorrelation Use first differences: DY = f(DX1, DX2) Use Cochrane-Orcutt transformation DYt = g0 + b1DX1 + b2DX2 + et Comment Simple, but only suffices when r is near 1. Yt* = Yt - rYt-1 Xt* = Xt - rXt-1 Comment We must estimate the sample autocorrelation coeffficient and use it to estimate r.

The errors are normally distributed What Is Non-Normality? The errors are normally distributed Normal errors: The histogram of residuals is "bell-shaped" There are no outliers in the residuals The probability plot is linear Non-normal errors Any violations of the above

Residual Histogram histogram should be symmetric and bell-shaped there are outliers beyond 3 s

Residual Probability Plot If normal, dots should be linear (45o line) possible outlier

Effects of Non-Normal Errors Confidence intervals for Y may be incorrect May indicate outliers May indicate incorrect model specification But usually not considered a serious problem

Detection of Non-Normal Errors Look at histogram of residuals Should be symmetric Should be bell-shaped Look for outliers or asymmetry Outliers are a serious violation Mild asymmetry is common Look at probability plot of residuals Should be linear Look for outliers

Remedies for Non-Normal Errors Avoid totals (e.g., use per capita data) Transform some variables (e.g., log X) Enlarge the sample (asymptotic normality)

Influential Observations High "leverage" of certain data points These are data points with extreme X values Sometimes called “high leverage” observations One case may strongly affect the estimates

How to Detect Influential Observations In MINITAB, look for observations denoted "X" (These are observations with unusual X values) In MINITAB, look for observations denoted "R" (These are observations with unusual residuals) Do your own tests (MINITAB does them automatically)

Rules for Finding Influential Observations Unusual X: look at hat matrix for hii > 2p/n where p is the number of coefficients in the model n is the number of observations Unusual Y: look for large studentized deleted residuals Use z values as a reference if n is large Use t values for d.f. = n - p - 1 if n is small Unusual X and Y: use Cook’s distance measure Use F(p,n-p) as critical value Unusual X and Y: use MINITAB’s Dfits measure Rule of thumb is Dfits > 2{p/n}.5

Remedies for Influential Observations Discard the observation only if you have logical reasons for thinking the observation is flawed Use method of least absolute deviations (but Minitab and Excel don’t calculate absolute deviations) Call a professional statistician

Assessing Fit The fit of a model can be assessed by the R2 and R2adj, There are many ways The fit of a model can be assessed by the R2 and R2adj, the F statistic in ANOVA table the standard error syIx plot of fitted Y against actual Y

Overall Fit: Actual Y versus Fitted Y The correlation between actual Y and fitted Y is the multiple correlation coefficient The closer to a 45o line, the better the fit

Summing It Up Computers do most of the work Regression is somewhat robust Be careful but don't panic Excelsior!