Outliers and influential data points
No outliers?
An outlier? Influential?
Impact on regression analyses Not every outlier strongly influences the estimated regression function. Always determine if estimated regression function is unduly influenced by one or a few cases. Simple plots for simple linear regression. Summary measures for multiple linear regression.
The hat matrix H
Least squares estimates The regression model Fitted values
Identifying outlying Y values
Residuals Standardized residuals –also called internally studentized residuals Deleted residuals Deleted t residuals –also called studentized deleted residuals –also called externally studentized residuals
Residuals Ordinary residuals defined for each observation, i = 1, …, n: Using matrix notation:
Variance of the residuals Residual vector Variance matrix Variance of the i th residual Estimated variance of the i th residual
Standardized residuals Standardized residuals defined for each observation, i = 1, …, n: Standardized residuals quantify how large the residuals are in standard deviation units. Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.
An outlying y value?
x y FITS1 HI1 s(e) RESI1 SRES S = Unusual Observations Obs x y Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual
Deleted residuals If observed y i is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual. Delete the i th case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the i th case. Deleted residual
Deleted t residuals A deleted t residual is just a standardized deleted residual: The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.
x y RESI1 TRES
Row x y RESI1 SRES1 TRES
Identifying outlying X values
Use the diagonal elements, h ii, of the hat matrix H to identify outlying X values. The h ii are called leverages.
Properties of the leverages (h ii ) The h ii is a measure of the distance between the X values for the i th case and the means of the X values for all n cases. The h ii is a number between 0 and 1, inclusive. The sum of the h ii equals p, the number of parameters.
HI Sum of HI1 =
Properties of the leverages (h ii ) If the i th case is outlying in terms of its X values, it has a large leverage value h ii, and therefore exercises substantial leverage in determining the fitted value.
Using leverages to identify outlying X values Minitab flags any observations whose leverage value, h ii, is more than 3 times larger than the mean leverage value…. …or if it’s greater than 0.99.
Unusual Observations Obs x y Fit SE Fit Residual St Resid X X denotes an observation whose X value gives it large influence. x y HI
x y HI Unusual Observations Obs x y Fit SE Fit Residual St Resid RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.
Identifying influential cases
Influence A case is influential if its exclusion causes major changes in the estimated regression function.
Identifying influential cases Difference in fits, DFITS Cook’s distance measure
DFITS The difference in fits … … represent the number of standard deviations that the fitted value increases or decreases when the i th case is included.
DFITS A case is influential if the absolute value of its DFIT value is … … greater than 1 for small to medium data sets …greater than for large data sets
x y DFIT
x y DFIT
Cook’s distance Cook’s distance measure … … considers the influence of the i th case on all n fitted values.
Cook’s distance Relate D i to the F(p, n-p) distribution. If D i is greater than the 50th percentile, F(0.50, p, n-p), then the i th case has lots of influence.
x y COOK
x y COOK