Outliers and Influential Data Points in Regression Analysis James P. Stevens sujin jang november 10, 2008
Beware of Outliers Regression is sensitive to outliers – Important to detect outliers and influential points Summary stats can be misleading… – Important to explore the data, rather than relying on just 1-2 summary stats
Look at your Data! – For all three plots, r, means, and SD are equal
But it’s not enough to look…
So what should we do? Ways of Detecting Outliers: – Studentized residuals for outliers on y – Mahalanobis distance &Hat matrix for outliers in the space of predictors
Types of Outliers Classifying Outliers: - Outliers in the space of outcomes (outliers on y) - Outliers in the space of predictors (outliers on x)
So what should we do? Ways of Detecting Outliers: – Studentized residuals for outliers on y – Mahalanobis distance &Hat matrix for outliers in the space of predictors
So what should we do? Ways of Detecting Outliers: – Studentized residuals for outliers on y – Mahalanobis distance &Hat matrix for outliers in the space of predictors BUT… The points they identify will not necessarily be influential in affecting the regression coefficients…
Outliers and Influential Points outliers influential points
Example: Influential Points Non-influential Influential
Cook’s Distance: Identifying Influential Points A measure of the change in the regression coefficients that would occur if the case was omitted. – Affected by both the case being an outlier on y and in the set of predictors – Measures the joint (combined) influence on the case being an outlier on y and on x
Now what? Step 1. Detect Step 2. Isolate Step 3. Examine -Are they qualitatively different? -Are they influential? Another thing to consider: influential “clusters”?
Example: Groups of Cases
Now what? Step 1. Detect Step 2. Isolate Step 3. Examine -Are they qualitatively different? -Are they influential? Step 4. Delete or retain as you see fit … Or try both
The End