12/17/ lecture 111 STATS 330: Lecture 11
12/17/ lecture 112 Outliers and high-leverage points An outlier is a point that has a larger or smaller y value that the model would suggest Can be due to a genuine large error Can be caused by typographical errors in recording the data A high leverage point is a point with extreme values of the explanatory variables
12/17/ lecture 113 Outliers The effect of an outlier depends on whether it is also a high leverage point A “high leverage” outlier Can attract the fitted plane, distorting the fit, sometimes extremely In extreme cases may not have a big residual In extreme cases can increase R 2 A “low leverage” outlier Does not distort the fit to the same extent Usually has a big residual Inflates standard errors, decreases R 2
12/17/ lecture 114 No outliers No high- leverage points Low leverage Outlier: big residual High leverage Not an outlier High- leverage outlier
12/17/ lecture 115 Example: the education data (ignoring urban) High leverage point
12/17/ lecture 116 An outlier also? Residual somewhat extreme
12/17/ lecture 117 Measuring leverage It can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y 1,…,y n by the equation The h ij depend on the explanatory variables. The quantities h ii are called “hat matrix diagonals” (HMD’s) and measure the influence y i has on the ith fitted value. They can also be interpreted as the distance between the X-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are
12/17/ lecture 118 Interpreting the HMDs Each HMD lies between 0 and 1 The average HMD is p/n (p=no of reg coefficients, p=k+1) An HMD more than 3p/n is considered extreme
12/17/ lecture 119 Example: the education data educ.lm<-lm(educ~percapita+under18, data=educ.df) hatvalues(educ.lm)[50] > 9/50 [1] 0.18 Clearly extreme! n=50, p=3
12/17/ lecture 1110 Studentized residuals How can we recognize a big residual? How big is big? The actual size depends on the units in which the y-variable is measured, so we need to standardize them. Can divide by their standard deviations Variance of a typical residual e is var(e) = (1-h) 2 where h is the hat matrix diagonal for the point.
12/17/ lecture 1111 Studentized residuals (2) “Internally studentised” (Called “standardised” in R) “Externally studentised” (Called “studentised” in R) s 2 is Usual Estimate of 2 s 2 i is estimate of 2 after deleting the ith data point
12/17/ lecture 1112 Studentized residuals (3) How big is big? Both types of studentised residual are approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t- distribution) Thus, studentised residuals should be between -2 and 2 with approximately 95% probability.
12/17/ lecture 1113 Studentized residuals (4) Calculating in R: library(MASS) # load the MASS library stdres(educ.lm) #internally studentised (standardised in R) studres(educ.lm) #externally studentised (studentised in R) > stdres(educ.lm)[50] > studres(educ.lm)[50]
What does studentised mean? 12/17/ lecture 1114
12/17/ lecture 1115 Recognizing outliers If a point is a low influence outlier, the residual will usually be large, so large residual and a low HMD indicates an outlier If a point is a high leverage outlier, then a large error usually will cause a large residual. However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not.
High-leverage outlier 12/17/ lecture 1116 Small residual!
12/17/ lecture 1117 Leverage-residual plot Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot) Point 50 is high leverage, big residual, is an outlier plot(educ.lm, which=5)
Interpreting LR plots 12/17/ lecture 1118 leverage Standardised residualStandardised residual High leverage outlier High leverage, outlier Possible high leverage outlier OK Low leverage outlier 3p/n
12/17/ lecture 1119 Residuals and HMD’s No big studentized residuals, no big HMD’s (3p/n = 0.2 for this example)
12/17/ lecture 1120 Residuals and HMD’s (2) One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit Point 24
12/17/ lecture 1121 Residuals and HMD’s (3) No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential. Point 1
12/17/ lecture 1122 Residuals and HMD’s (4) One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential Point 1
12/17/ lecture 1123 Residuals and HMD’s (5) No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential Point 1
12/17/ lecture 1124 Influential points How can we tell if a high-leverage point/outlier is affecting the regression? By deleting the point and refitting the regression: a large change in coefficients means the point is affecting the regression Such points are called influential points Don’t want analysis to be driven by one or two points
12/17/ lecture 1125 “Leave-one out” measures We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as Coefficients Fitted values Standard errors We discuss each in turn
12/17/ lecture 1126 Example: education data With point 50Without point 50 Const percap under
12/17/ lecture 1127 Standardized difference in coefficients: DFBETAS Formula: Problem when: Greater than 1 in absolute value This is the criterion coded into R
12/17/ lecture 1128 Standardized difference in fitted values: DFFITS Formula: Problem when: Greater than (p/(N-p) ) in absolute value (p=number of regression coefficients)
12/17/ lecture 1129 COV RATIO & Cooks D Cov Ratio: Measures change in the standard errors of the estimated coefficients Problem indicated: when Cov Ratio more than 1+3p/n or less than 1-3p/n Cook’s D Measures overall change in the coefficients Problem when: More than qf(0.50, p,n-p) (lower 50% of F distribution), roughly 1 in most cases
12/17/ lecture 1130 Calculations > influence.measures(educ.lm) Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df) dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf e * e * e * > p=3, n=50, 3p/n=0.18, 3 (p/(n-p)) =0.758, qf(0.5,3,47)=
12/17/ lecture 1131 Plotting influence # set up plot window with 2 x 4 array of plots par(mfrow=c(2,4)) # plot dffbetas, dffits, cov ratio, # cooks D, HMD’s influenceplots(educ.lm)
12/17/ lecture 1132
12/17/ lecture 1133 Remedies for outliers Correct typographical errors in the data Delete a small number of points and refit (don’t want fitted regression to be determined by one or two influential points) Report existence of outliers separately: they are often of scientific interest Don’t delete too many points (1 or 2 max)
12/17/ lecture 1134 Summary: Doing it in R LR plot: plot(educ.lm, which=5) Full diagnostic display plot(educ.lm) Influence measures: influence.measures(educ.lm) Plots of influence measures par(mfrow=c(2,4)) influenceplots(educ.lm)
12/17/ lecture 1135 HMD Summary Hat matrix diagonals Measure the effect of a point on its fitted value Measure how outlying the x-values are (how “high –leverage” a point is) Are always between 0 and 1 with bigger values indicating high leverage Points with HMD’s more than 3p/n are considered “high leverage”