Regression Diagnostics

Slides:

Advertisements

Similar presentations

Managerial Economics in a Global Economy

Advertisements

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.

Introduction and Overview

CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.

LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.

Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.

Multivariate Data Analysis Chapter 4 – Multiple Regression.

Lecture 24 Multiple Regression (Sections )

Regression Diagnostics Checking Assumptions and Data.

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.

Business Statistics - QBM117 Statistical inference for regression.

Correlation and Regression Analysis

Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.

Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.

Objectives of Multiple Regression

ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?

© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Multiple Collinearity, Serial Correlation,

Specification Error I.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)

Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.

Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.

I271B QUANTITATIVE METHODS Regression and Diagnostics.

Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.

D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.

Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.

Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.

Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.

Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.

Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.

Stats Methods at IC Lecture 3: Regression.

University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 5 Multiple Regression

Chapter 15 Multiple Regression Model Building

Step 1: Specify a null hypothesis

Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.

Inference for Least Squares Lines

Multiple Regression Prof. Andy Field.

Chapter 9 Multiple Linear Regression

Kakhramon Yusupov June 15th, :30pm – 3:00pm Session 3

Multiple Regression Analysis and Model Building

Essentials of Modern Business Statistics (7e)

Regression Analysis Simple Linear Regression

INFERENTIAL STATISTICS: REGRESSION ANALYSIS AND STANDARDIZATION

Fundamentals of regression analysis

Multivariate Analysis Lec 4

Pure Serial Correlation

Diagnostics and Transformation for SLR

CHAPTER 29: Multiple Regression*

I271b Quantitative Methods

Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of

CH2. Cleaning and Transforming Data

Incremental Partitioning of Variance (aka Hierarchical Regression)

Chapter 4, Regression Diagnostics Detection of Model Violation

Chapter 7: The Normality Assumption and Inference with OLS

Product moment correlation

Regression Forecasting and Model Building

Chapter 13 Additional Topics in Regression Analysis

Diagnostics and Remedial Measures

Diagnostics and Transformation for SLR

Multicollinearity What does it mean? A high degree of correlation amongst the explanatory variables What are its consequences? It may be difficult to separate.

Model Adequacy Checking

Diagnostics and Remedial Measures

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Regression Diagnostics SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Prior to interpreting your regression results, you should examine your data for potential problems that could affect your findings using various diagnostic techniques SRM 625 Applied Multiple Regression, Hutchinson

Types of possible problems Assumption violations Outliers and influential cases Multicollinearity SRM 625 Applied Multiple Regression, Hutchinson

Regression Assumptions Error-free measurement Correct model specification Assumptions about residuals SRM 625 Applied Multiple Regression, Hutchinson

Assumption that variables are measured without error Presence of measurement error in Y leads to increase in standard error of estimate If standard error of estimate is inflated what happens to the F test for R2? (hint: think about the relationship between the standard error and mean square error) SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson In a bivariate regression, measurement error in X always leads to underestimation of regression coefficient What are the implications of this for interpreting results regarding X? SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson What are the possible consequences of measurement error when one or more IVs has poor reliability in a multiple regression model? SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Evidence to assess violation of the assumption of error-free measurement Reliability estimates for your independent and dependent variables What would constitute "acceptable" reliability? SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson How might you attempt to minimize violation of the assumption during the design and planning phase of your study? SRM 625 Applied Multiple Regression, Hutchinson

Assumption that the regression model has been correctly specified Linearity Inclusion of all relevant independent variables Exclusion of irrelevant independent variables SRM 625 Applied Multiple Regression, Hutchinson

Assumption of Linearity Violation of this assumption can lead to downward bias of regression coefficients If data are curvilinearly related there are methods for dealing with curvilinear data Require use of multiple regression and transformation of variables Note: we will discuss methods for addressing nonlinear relationships later in the course SRM 625 Applied Multiple Regression, Hutchinson

Detecting nonlinearity In bivariate, can examine scatterplots of X and Y Not sufficient in multiple regression However, can examine partial regression plots between each IV and the DV, controlling for other IVs In multiple regression, residuals plots are primarily used SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Residuals plots Typically involve scatterplots with either standardized, studentized, or unstandardized residuals plotted against predicted Y, i.e., versus SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson A residuals scatterplot should reflect a broad horizontal band of points (i.e., should look like scatterplot for r = 0). If plot forms some type of pattern, it could indicate an assumption violation Specifically, for nonlinearity the plot would reflect a curve SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Sample residuals plot Does this appear to be a correlation = 0? SRM 625 Applied Multiple Regression, Hutchinson

Sample partial regression plot SRM 625 Applied Multiple Regression, Hutchinson

Assumption that all important independent variables have been included If omitted variables are correlated with variables in equation, violation of this assumption can lead to biased parameter estimates (e.g., incorrect values of regression coefficients) Fairly serious violation SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Violation can also lead to non-random residuals (i.e., residuals that include systematic variance associated with the omitted variables) If omitted variables are not correlated with variables in the model, parameter estimates are not biased, but standard errors associated with the independent variables are biased upward (i.e., inflated) SRM 625 Applied Multiple Regression, Hutchinson

For example: Error includes: autonomy task enjoyment working conditions etc. Job Satisf Salary Therefore, if autonomy, task enjoyment, etc. are correlated with job satisfaction, residuals (which reflect autonomy, task enjoyment, etc.), would be correlated with predicted job satisfaction

How do we determine if this assumption is violated? Can examine residuals plots Again, plot residuals against predicted values of Y Again, hope to see a broad horizontal band of points If plot reflects some type of discernable pattern, e.g., a linear pattern, it could suggest omitted variables SRM 625 Applied Multiple Regression, Hutchinson

What can you do if it appears you have violated this assumption? SRM 625 Applied Multiple Regression, Hutchinson

How might we attempt to prevent violation of this assumption? SRM 625 Applied Multiple Regression, Hutchinson

Assumption that no irrelevant independent variables have been included Will lead to inflated standard errors for the regression coefficients (not just those corresponding to the irrelevant variables) What effect could this have on conclusions you draw about the contributions of your independent variables? SRM 625 Applied Multiple Regression, Hutchinson

How can you determine if you have violated this assumption? SRM 625 Applied Multiple Regression, Hutchinson

What might you do to avoid this potential assumption violation? SRM 625 Applied Multiple Regression, Hutchinson

Assumptions about errors Residuals have mean of zero Residuals are random Residuals are normally distributed Residuals have equal variance (i.e., homoscedasticity) SRM 625 Applied Multiple Regression, Hutchinson

Residuals (or errors) are random Residuals should be uncorrelated with both Y and predicted Y Residuals should be uncorrelated with independent variables Residuals should be uncorrelated with one another This is comparable to the independence of observations assumption What this means is that the reason for prediction error for one person should be unrelated to the reason for prediction error for another person SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson If violate, tests of significance cannot be trusted F and t tests are not robust to violations of this assumption This assumption is most likely to be violated: in longitudinal studies, or when important variables have been left out of the equation, or if observations are clustered, e.g., When subjects are sampled from intact groups or in cluster sampling SRM 625 Applied Multiple Regression, Hutchinson

Residuals are normally distributed Residuals are assumed to be normally distributed around the regression line for all values of X This is analogous to the normality assumption in a t-test or ANOVA SRM 625 Applied Multiple Regression, Hutchinson

Illustration of data which violate assumption of normality

Normal probability plot of residuals SRM 625 Applied Multiple Regression, Hutchinson

Residuals have equal variance Residuals should be evenly spread around the regression line Known as the assumption of homoscedasticity Same as assumption of homogeneity of variance in ANOVA but with equal variances on Y for each value of X SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Illustration of homoscedastic data SRM 625 Applied Multiple Regression, Hutchinson

Illustration of heteroscedasticity SRM 625 Applied Multiple Regression, Hutchinson

Further evidence of heteroscedasticity and nonnormality SRM 625 Applied Multiple Regression, Hutchinson

Why is violation of the homoscedasticity assumption a problem? SRM 625 Applied Multiple Regression, Hutchinson

What can you do if your data are heteroscedastic? Can use weighted least squares instead of ordinary least squares as your estimation procedure WLS weights each case so that cases with larger error variances receive less weight (in OLS each case is weighted 1) SRM 625 Applied Multiple Regression, Hutchinson

Outliers and Influential Cases Influential observations Leverage Extreme on both X and Y SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson What is an outlier? A case with an extreme value of Y Presence of outliers can be detected by examination of residuals SRM 625 Applied Multiple Regression, Hutchinson

Types of residuals used in outlier detection Standardized residuals Studentized residuals Studentized deleted residuals SRM 625 Applied Multiple Regression, Hutchinson

Standardized Residuals Unstandardized residuals that have been converted to z-scores Not recommended by some because their calculation makes the assumption that all residuals have the same variance (as measured by the overall Sy.x) SRM 625 Applied Multiple Regression, Hutchinson

Studentized Residuals Similar to standardized residuals but use different standard deviations for each residual Generally more sensitive than standardized residuals Follow an approximate t distribution SRM 625 Applied Multiple Regression, Hutchinson

Studentized Deleted Residuals Studentized deleted residuals are the same as studentized residuals except they remove the case with the extreme value from their calculation Addresses a potential problem of studentized residuals which include the outlier in their calculation (thus increasing risk of inflated standard error) SRM 625 Applied Multiple Regression, Hutchinson

Comparing the three types of residuals SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Leverage Reflects cases with extreme values on one or more of the independent variables May or may not exert influence on the equation SRM 625 Applied Multiple Regression, Hutchinson

How does one identify cases with high leverage? SPSS produces values of leverage (h) which can range between 0 and 1 One "rule of thumb" suggests h > 2(k + 1)/N as a high leverage value Another rule of thumb is that h ≤ .2 indicates trivial leverage whereas values > suggests substantial leverage requiring further examination Other researchers recommend looking at relative differences SRM 625 Applied Multiple Regression, Hutchinson

Leverage Example (based on 3 IVS, N = 171) SRM 625 Applied Multiple Regression, Hutchinson

Mahalanobis distance (D2) A method for detecting multivariate outliers, i.e., cases with unexpected combinations of independent variables Represents the distance of a case from the centroid of the remaining cases, where the centroid represents the intersection of the means of all the variables One rule of thumb suggests high values exceed the 2 critical with degrees of freedom equal to the number of IVs in the model SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Mahalanobis D2 example Note: model based on 6 IVs SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson It should be noted that just because a case is an outlier and/or exhibits high leverage does not necessarily mean it is influential SRM 625 Applied Multiple Regression, Hutchinson

Influential Observations Tend to be outliers on both X and Y (although do not have to be) Are considered influential because their presence (or lack thereof) makes a difference in the regression equation, e.g., coefficients, R2, etc. tend to change when influential observations are versus aren't in the sample SRM 625 Applied Multiple Regression, Hutchinson

How are influential cases identified? DFBETA'S Cook's D SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson DFBeta Represents the estimated change in an unstandardized regression coefficient when a particular case is deleted Note that standardized values of dfbeta can also be requested There will be values of dfbetas for each IV and for each subject/participant Larger values indicate greater influence exerted by a particular case One rule of thumb is to flag values > SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Cook's D A measure of influence that flags observations which might be influential due to their values on one or more X's, Y, or a combination One rule of thumb is to consider values of Cook's D > 1 as indicating potential influence; another is to look for “gaps” SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Cook’s D example SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson If cases are identified as outliers, high leverage cases, or potentially influential observations, what should you do with them? Keep or drop? SRM 625 Applied Multiple Regression, Hutchinson

General Recommendations Identify cases which are outliers on Y check first for coding errors Identify cases which are outliers on X again check for coding errors Identify points that are flagged as potentially influential SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson For those cases flagged as potentially influential, run the regression analysis with and without those points (deleting one at a time) to see what effect they have on the regression results SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson What will you look for? How will you decide what to do with the outlying case(s)? SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Regardless of whether or not an outlier is influential, you should attempt to find out reasons for such extreme scores. How might you do that and why? SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Collinearity In general, collinearity refers to overlap or correlations among 2 independent variables In the extreme case, 2 variables are identical I.e., in a scatterplot observations for the 2 variables would fall exactly on the same line Multicollinearity refers to collinearity among > 2 variables SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson collinearity – cont’d Redundancy and repetitiveness are two related concepts Redundancy indicates two variables that are telling us something similar but which may or may not represent the same concept Repetitiveness occurs when the researcher includes > 1 measure of the same construct In this case, it might be preferable to test the variables as a set rather than as individual variables SRM 625 Applied Multiple Regression, Hutchinson

Effects of Collinearity Can produce misleading regression results, e.g., where 2 (highly correlated) independent variables correlate similarly with the dependent variable, but only one is statistically significant in the multiple regression Can lead to underestimates of regression coefficients Can inflate standard errors of regression coefficients Standard errors are at a minimum when IVs are completely uncorrelated When r = 1 between 2 or more IVs, standard errors cannot be computed Determinant of matrix = 0, matrix cannot be inverted SRM 625 Applied Multiple Regression, Hutchinson

Detection of Collinearity Bivariate correlations inadequate in detecting multicollinearity Large changes in regression coefficients as variables are added to (or deleted from) the model Presence of large standard errors or signs of coefficients in unexpected directions VIF Tolerance Condition numbers SRM 625 Applied Multiple Regression, Hutchinson

VIF (Variance Inflation Factor) Indicates inflation in the variance of b’s or betas as a result of collinearity among independent variables Larger VIF values indicate greater levels of collinearity VIF = 1 (its lowest value) when r = 0 among IVs Some have suggested VIF > 10 as indicating collinearity; however, problematic collinearity occurs even with VIF considerably < 10 VIF = 1 / tolerance SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson Tolerance For any given independent variable, tolerance reflects the proportion of variance that is NOT accounted for in the remaining independent variables Therefore, small numbers indicate collinearity SPSS uses .0001 as its default for halting analyses on the basis of collinearity however, collinearity will lead to problems long before tolerance reaches such an extreme level As tolerance values become small, problems will occur in the accuracy of calculating the parameter estimates SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson tolerance – cont’d SRM 625 Applied Multiple Regression, Hutchinson

Condition Numbers and Eigenvalues Eigenvalues can also be used as a diagnostic for collinearity with smaller eigenvalues indicating greater collinearity An eigenvalue of 0 indicates linear dependency An index based on eigenvalues is the Condition Number Larger values indicate greater collinearity with > 15 suggesting some collinearity and values > 30 suggesting a serious problem SRM 625 Applied Multiple Regression, Hutchinson

condition number – cont’d SRM 625 Applied Multiple Regression, Hutchinson

What to do if faced with collinearity Could omit one of the “problem” variables but might then risk model misspecification Avoid multiple indicators of the same construct If not too correlated could test as a block of variables But if correlations between indicators are excessively high the collinearity could still cause problems for other variables in the model SRM 625 Applied Multiple Regression, Hutchinson

SRM 625 Applied Multiple Regression, Hutchinson If it makes conceptual sense to do so, you can combine or aggregate the correlated independent variables Use another type of regression such as ridge regression for which collinearity is not as much of a problem Could use centering but only appropriate for non-essential collinearity SRM 625 Applied Multiple Regression, Hutchinson