Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ from the majority of cases in the data set. They can be outliers in either –the covariate (x) direction; or –in the response (y) direction Depending upon the dimension, it may be easy or difficult to find outliers in the covariate (x) direction. –one x: easy do a univariate plot (boxplot shows outliers) –two x’s: do a scatterplot of one against the other –multiple x’s: more difficult…
Outliers in the y direction could be due to: –epsilon (the error function) may be unusually large –recording errors (in either x’s or y) –missing covariate(s) Outliers can often be found, but causes and solutions (should they be excluded?) are often difficult. Try fitting the model with and without the outliers - if no substantive change in results, then remove them; if there is a change in the results, then be careful!. Can additional data be collected? Outliers can often be the most interesting cases… –see Figures 6.6 (a-c) on page 185
An easy check for possible outliers in the y- direction is to use the Studentized residuals They are approximately N(0,1), so if |d i |>2.5 or so, then the ith observation is a possible outlier in the response direction. Plots of these residuals will usually show these points clearly…try normal quantile plots of the d i. But be careful: 6.6(a) would clearly show up, but 6.6(b) would not…
An individual observation is influential if the conclusions of the analysis done without the observation is vastly different from the conclusions with the observation included - see Fig. 6.7 (a-b) on page 187. The “hat” matrix gives us information about the leverage that individual points have since we have that; so, large values of h ii (close to 1), relative to the other h’s mean that the ith observation has high leverage in the sense that the ith fitted value is “attracted” to the response of the ith observation. (Think about the simple linear regression case…)
Some properties of the leverage h ii : –it is a function of the explanatory variables but not y – – it is small for cases near the centroid of the X space and large for cases far away. The centroid is – where p=# of explanatory variables; –so, the average leverage is Thus, one way to check for large leverage is to compare h ii with the mean and if h ii is bigger than 2 times h-bar, it’s usually considered a high leverage observation. Your author says: “Cases with high leverage need to be identified and examined carefully”
Another way to check for an influential point is to see what happens when that point is removed and the regression is done without it… there are several statistics we can compute that take this particular idea and use it: The first is called Cook’s D defined as here h ii is the ith leverage and d i is the ith Studentized residual. Note that both a large leverage and a large d i are required to make Cook’s influential. How big does it have to be? Values > 1 (or even >.5) are given in the literature as influential…
Another quantity of interest is called the ith PRESS residual: Here the (i) indicates the ith case is removed; notice that large leverage makes these PRESS residuals large. Only the the original residuals and the leverages from the regression on the full data set are needed to compute these statistics. These are called PRESS residuals because their sum of squares is called the “prediction error sum of squares” Let’s go through the forestry example in section on page …