Regression Diagnostics SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Prior to interpreting your regression results, you should examine your data for potential problems that could affect your findings using various diagnostic techniques SRM 625 Applied Multiple Regression, Hutchinson
Types of possible problems Assumption violations Outliers and influential cases Multicollinearity SRM 625 Applied Multiple Regression, Hutchinson
Regression Assumptions Error-free measurement Correct model specification Assumptions about residuals SRM 625 Applied Multiple Regression, Hutchinson
Assumption that variables are measured without error Presence of measurement error in Y leads to increase in standard error of estimate If standard error of estimate is inflated what happens to the F test for R2? (hint: think about the relationship between the standard error and mean square error) SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson In a bivariate regression, measurement error in X always leads to underestimation of regression coefficient What are the implications of this for interpreting results regarding X? SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson What are the possible consequences of measurement error when one or more IVs has poor reliability in a multiple regression model? SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Evidence to assess violation of the assumption of error-free measurement Reliability estimates for your independent and dependent variables What would constitute "acceptable" reliability? SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson How might you attempt to minimize violation of the assumption during the design and planning phase of your study? SRM 625 Applied Multiple Regression, Hutchinson
Assumption that the regression model has been correctly specified Linearity Inclusion of all relevant independent variables Exclusion of irrelevant independent variables SRM 625 Applied Multiple Regression, Hutchinson
Assumption of Linearity Violation of this assumption can lead to downward bias of regression coefficients If data are curvilinearly related there are methods for dealing with curvilinear data Require use of multiple regression and transformation of variables Note: we will discuss methods for addressing nonlinear relationships later in the course SRM 625 Applied Multiple Regression, Hutchinson
Detecting nonlinearity In bivariate, can examine scatterplots of X and Y Not sufficient in multiple regression However, can examine partial regression plots between each IV and the DV, controlling for other IVs In multiple regression, residuals plots are primarily used SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Residuals plots Typically involve scatterplots with either standardized, studentized, or unstandardized residuals plotted against predicted Y, i.e., versus SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson A residuals scatterplot should reflect a broad horizontal band of points (i.e., should look like scatterplot for r = 0). If plot forms some type of pattern, it could indicate an assumption violation Specifically, for nonlinearity the plot would reflect a curve SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Sample residuals plot Does this appear to be a correlation = 0? SRM 625 Applied Multiple Regression, Hutchinson
Sample partial regression plot SRM 625 Applied Multiple Regression, Hutchinson
Assumption that all important independent variables have been included If omitted variables are correlated with variables in equation, violation of this assumption can lead to biased parameter estimates (e.g., incorrect values of regression coefficients) Fairly serious violation SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Violation can also lead to non-random residuals (i.e., residuals that include systematic variance associated with the omitted variables) If omitted variables are not correlated with variables in the model, parameter estimates are not biased, but standard errors associated with the independent variables are biased upward (i.e., inflated) SRM 625 Applied Multiple Regression, Hutchinson
For example: Error includes: autonomy task enjoyment working conditions etc. Job Satisf Salary Therefore, if autonomy, task enjoyment, etc. are correlated with job satisfaction, residuals (which reflect autonomy, task enjoyment, etc.), would be correlated with predicted job satisfaction
How do we determine if this assumption is violated? Can examine residuals plots Again, plot residuals against predicted values of Y Again, hope to see a broad horizontal band of points If plot reflects some type of discernable pattern, e.g., a linear pattern, it could suggest omitted variables SRM 625 Applied Multiple Regression, Hutchinson
What can you do if it appears you have violated this assumption? SRM 625 Applied Multiple Regression, Hutchinson
How might we attempt to prevent violation of this assumption? SRM 625 Applied Multiple Regression, Hutchinson
Assumption that no irrelevant independent variables have been included Will lead to inflated standard errors for the regression coefficients (not just those corresponding to the irrelevant variables) What effect could this have on conclusions you draw about the contributions of your independent variables? SRM 625 Applied Multiple Regression, Hutchinson
How can you determine if you have violated this assumption? SRM 625 Applied Multiple Regression, Hutchinson
What might you do to avoid this potential assumption violation? SRM 625 Applied Multiple Regression, Hutchinson
Assumptions about errors Residuals have mean of zero Residuals are random Residuals are normally distributed Residuals have equal variance (i.e., homoscedasticity) SRM 625 Applied Multiple Regression, Hutchinson
Residuals (or errors) are random Residuals should be uncorrelated with both Y and predicted Y Residuals should be uncorrelated with independent variables Residuals should be uncorrelated with one another This is comparable to the independence of observations assumption What this means is that the reason for prediction error for one person should be unrelated to the reason for prediction error for another person SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson If violate, tests of significance cannot be trusted F and t tests are not robust to violations of this assumption This assumption is most likely to be violated: in longitudinal studies, or when important variables have been left out of the equation, or if observations are clustered, e.g., When subjects are sampled from intact groups or in cluster sampling SRM 625 Applied Multiple Regression, Hutchinson
Residuals are normally distributed Residuals are assumed to be normally distributed around the regression line for all values of X This is analogous to the normality assumption in a t-test or ANOVA SRM 625 Applied Multiple Regression, Hutchinson
Illustration of data which violate assumption of normality
Normal probability plot of residuals SRM 625 Applied Multiple Regression, Hutchinson
Residuals have equal variance Residuals should be evenly spread around the regression line Known as the assumption of homoscedasticity Same as assumption of homogeneity of variance in ANOVA but with equal variances on Y for each value of X SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Illustration of homoscedastic data SRM 625 Applied Multiple Regression, Hutchinson
Illustration of heteroscedasticity SRM 625 Applied Multiple Regression, Hutchinson
Further evidence of heteroscedasticity and nonnormality SRM 625 Applied Multiple Regression, Hutchinson
Why is violation of the homoscedasticity assumption a problem? SRM 625 Applied Multiple Regression, Hutchinson
What can you do if your data are heteroscedastic? Can use weighted least squares instead of ordinary least squares as your estimation procedure WLS weights each case so that cases with larger error variances receive less weight (in OLS each case is weighted 1) SRM 625 Applied Multiple Regression, Hutchinson
Outliers and Influential Cases Influential observations Leverage Extreme on both X and Y SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson What is an outlier? A case with an extreme value of Y Presence of outliers can be detected by examination of residuals SRM 625 Applied Multiple Regression, Hutchinson
Types of residuals used in outlier detection Standardized residuals Studentized residuals Studentized deleted residuals SRM 625 Applied Multiple Regression, Hutchinson
Standardized Residuals Unstandardized residuals that have been converted to z-scores Not recommended by some because their calculation makes the assumption that all residuals have the same variance (as measured by the overall Sy.x) SRM 625 Applied Multiple Regression, Hutchinson
Studentized Residuals Similar to standardized residuals but use different standard deviations for each residual Generally more sensitive than standardized residuals Follow an approximate t distribution SRM 625 Applied Multiple Regression, Hutchinson
Studentized Deleted Residuals Studentized deleted residuals are the same as studentized residuals except they remove the case with the extreme value from their calculation Addresses a potential problem of studentized residuals which include the outlier in their calculation (thus increasing risk of inflated standard error) SRM 625 Applied Multiple Regression, Hutchinson
Comparing the three types of residuals SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Leverage Reflects cases with extreme values on one or more of the independent variables May or may not exert influence on the equation SRM 625 Applied Multiple Regression, Hutchinson
How does one identify cases with high leverage? SPSS produces values of leverage (h) which can range between 0 and 1 One "rule of thumb" suggests h > 2(k + 1)/N as a high leverage value Another rule of thumb is that h ≤ .2 indicates trivial leverage whereas values > suggests substantial leverage requiring further examination Other researchers recommend looking at relative differences SRM 625 Applied Multiple Regression, Hutchinson
Leverage Example (based on 3 IVS, N = 171) SRM 625 Applied Multiple Regression, Hutchinson
Mahalanobis distance (D2) A method for detecting multivariate outliers, i.e., cases with unexpected combinations of independent variables Represents the distance of a case from the centroid of the remaining cases, where the centroid represents the intersection of the means of all the variables One rule of thumb suggests high values exceed the 2 critical with degrees of freedom equal to the number of IVs in the model SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Mahalanobis D2 example Note: model based on 6 IVs SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson It should be noted that just because a case is an outlier and/or exhibits high leverage does not necessarily mean it is influential SRM 625 Applied Multiple Regression, Hutchinson
Influential Observations Tend to be outliers on both X and Y (although do not have to be) Are considered influential because their presence (or lack thereof) makes a difference in the regression equation, e.g., coefficients, R2, etc. tend to change when influential observations are versus aren't in the sample SRM 625 Applied Multiple Regression, Hutchinson
How are influential cases identified? DFBETA'S Cook's D SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson DFBeta Represents the estimated change in an unstandardized regression coefficient when a particular case is deleted Note that standardized values of dfbeta can also be requested There will be values of dfbetas for each IV and for each subject/participant Larger values indicate greater influence exerted by a particular case One rule of thumb is to flag values > SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Cook's D A measure of influence that flags observations which might be influential due to their values on one or more X's, Y, or a combination One rule of thumb is to consider values of Cook's D > 1 as indicating potential influence; another is to look for “gaps” SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Cook’s D example SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson If cases are identified as outliers, high leverage cases, or potentially influential observations, what should you do with them? Keep or drop? SRM 625 Applied Multiple Regression, Hutchinson
General Recommendations Identify cases which are outliers on Y check first for coding errors Identify cases which are outliers on X again check for coding errors Identify points that are flagged as potentially influential SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson For those cases flagged as potentially influential, run the regression analysis with and without those points (deleting one at a time) to see what effect they have on the regression results SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson What will you look for? How will you decide what to do with the outlying case(s)? SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Regardless of whether or not an outlier is influential, you should attempt to find out reasons for such extreme scores. How might you do that and why? SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Collinearity In general, collinearity refers to overlap or correlations among 2 independent variables In the extreme case, 2 variables are identical I.e., in a scatterplot observations for the 2 variables would fall exactly on the same line Multicollinearity refers to collinearity among > 2 variables SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson collinearity – cont’d Redundancy and repetitiveness are two related concepts Redundancy indicates two variables that are telling us something similar but which may or may not represent the same concept Repetitiveness occurs when the researcher includes > 1 measure of the same construct In this case, it might be preferable to test the variables as a set rather than as individual variables SRM 625 Applied Multiple Regression, Hutchinson
Effects of Collinearity Can produce misleading regression results, e.g., where 2 (highly correlated) independent variables correlate similarly with the dependent variable, but only one is statistically significant in the multiple regression Can lead to underestimates of regression coefficients Can inflate standard errors of regression coefficients Standard errors are at a minimum when IVs are completely uncorrelated When r = 1 between 2 or more IVs, standard errors cannot be computed Determinant of matrix = 0, matrix cannot be inverted SRM 625 Applied Multiple Regression, Hutchinson
Detection of Collinearity Bivariate correlations inadequate in detecting multicollinearity Large changes in regression coefficients as variables are added to (or deleted from) the model Presence of large standard errors or signs of coefficients in unexpected directions VIF Tolerance Condition numbers SRM 625 Applied Multiple Regression, Hutchinson
VIF (Variance Inflation Factor) Indicates inflation in the variance of b’s or betas as a result of collinearity among independent variables Larger VIF values indicate greater levels of collinearity VIF = 1 (its lowest value) when r = 0 among IVs Some have suggested VIF > 10 as indicating collinearity; however, problematic collinearity occurs even with VIF considerably < 10 VIF = 1 / tolerance SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson Tolerance For any given independent variable, tolerance reflects the proportion of variance that is NOT accounted for in the remaining independent variables Therefore, small numbers indicate collinearity SPSS uses .0001 as its default for halting analyses on the basis of collinearity however, collinearity will lead to problems long before tolerance reaches such an extreme level As tolerance values become small, problems will occur in the accuracy of calculating the parameter estimates SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson tolerance – cont’d SRM 625 Applied Multiple Regression, Hutchinson
Condition Numbers and Eigenvalues Eigenvalues can also be used as a diagnostic for collinearity with smaller eigenvalues indicating greater collinearity An eigenvalue of 0 indicates linear dependency An index based on eigenvalues is the Condition Number Larger values indicate greater collinearity with > 15 suggesting some collinearity and values > 30 suggesting a serious problem SRM 625 Applied Multiple Regression, Hutchinson
condition number – cont’d SRM 625 Applied Multiple Regression, Hutchinson
What to do if faced with collinearity Could omit one of the “problem” variables but might then risk model misspecification Avoid multiple indicators of the same construct If not too correlated could test as a block of variables But if correlations between indicators are excessively high the collinearity could still cause problems for other variables in the model SRM 625 Applied Multiple Regression, Hutchinson
SRM 625 Applied Multiple Regression, Hutchinson If it makes conceptual sense to do so, you can combine or aggregate the correlated independent variables Use another type of regression such as ridge regression for which collinearity is not as much of a problem Could use centering but only appropriate for non-essential collinearity SRM 625 Applied Multiple Regression, Hutchinson