1 Outliers and Influential Observations KNN Ch. 10 (pp )
2 At times data sets have observations that are outlying or extreme. These outliers usually have a strong effect on the regression analysis. We have to identify such observations and then decide if they need to be eliminated or if their influence needs to be reduced. When dealing with more than one variable, simple plots (boxplots, scatterplots etc.) may not be useful to identify outliers and we have to use the residuals or functions of residuals. We will now look at some of these functions. Outlying Observations
3 Previously, we examined: Residuals Semistudentized Residuals We will now introduce a few refinements that are more effective in identifying Y outliers. First we need to recall the Hat Matrix. Residuals and Semistudentized Residuals
4 Leverages We previously defined the Hat matrix as H = X(X’X) -1 X’ Using the hat matrix, and e = (I-H)Y The diagonal elements of the hat matrix, h ii, 0< h ii < 1, are called Leverages These are used to detect influential X observations. Leverage values are useful for detecting hidden extrapolations when p > 3
5 Measures for Y-outlier detection An estimator of the st. deviation of the i-th residual is Therefore, dividing each residual by its st. deviation we obtain the Studentized Residuals:
6 Measures for Y-outlier detection Another effective measure for Y outlier identification is obtained when we delete observation i, fit the regression function to the remaining n – 1 observations, and obtain the expected value for that observation given its X levels. The differences between the predicted and the actually observed value produces a deleted residual. This can be also expressed using a leverage value. Deleted Residuals: Studentized Deleted Residuals
7 Criterion for Outliers: In order to establish that the i th observation is an outlier we have to compare the value of t i with t, where t is the 100*(1- /2n) th percentile of the t distribution with (n-p-1) degrees of freedom. Detection of outlying Y Observations
8 Outlying X Observations The average value is Criterion for Outliers: If h ii > 2 p/n, then observation i is an X outlier.
9 X1 X3 X2 Y A Simple Example
10 Regression Analysis The regression equation is Y = X X X3 Predictor Coef StDev T P Constant X X X S = R-Sq = 95.7% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Y Pred YResid.Stud.Res.Del. Stud. Res.h ii A Simple Example (continued)
11 Influence of Outlying X/Y Observations Influence on single fitted value: influence that case i has on the fitted value. Omission is the test. Exclusion causes major changes in fitted regression function; then a case is indeed influential. Criteria for Influential observations: if |DFFITS i | >1 (small to medium data sets) Or if |DFFITS i | > (large data sets) Where:
12 Influence of Outlying X/Y Observations An aggregate measure is also required: One which measures the effect of omission of case i on all n “fitted” values, not just the i-th fitted value. Statistic is Cook’s Distance: Criterion for Influential Observations: Compare D i with the F distribution with (p, n-p) degrees of freedom. If the percentile (that D i cuts off from the left side of the distribution curve) is 10 or 20 the observation has little influence, if this percentile is 50 or more the influence is large.
13 Another measure is required: One which measures the effect of omission of case i on OLS estimates of regression coefficients (betas). Here, c kk is the k-th diagonal element of (X’X) -1 Criteria for Influential observations: if |DFBETAS i | > 1 for small data sets, or if |DFBETAS i | > for large data sets. Influence of outliers on betas
14 Regression Analysis The regression equation is Y = X X X3 Predictor Coef StDev T P Constant X X X S = R-Sq = 95.7% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total YResid.Stud.Res.Del. Stud. Res.h ii A Simple Example DFFITSCOOKD