Three Measures of Influence Lecture 16 Outline: Review of Lecture 15 Masking and Swamping Problems Three Measures of Influence 2/22/2019 ST3131, Lecture 16
Review of Lecture 15 3.Influential Points Leverage, Influence, and Outliers 1.High Leverage points /Outliers in the Predictor variables (in X-direction) Observations with larger are called High Leverage points. High Leverage points are also called outliers in the Predictor variables. 2.Outliers in the Response variable(in Y-direction) Observations with absolute standardized Residuals greater than 2 or 3 are usually called outliers 3.Influential Points A point is an Influential Point if its deletion, singly or in combination with others (2 or 3) , causes substantial changes in the fitted model ( estimation, fitted values, t-test, etc) 2/22/2019 ST3131, Lecture 16
Masking and Swamping Problems Standardized residuals provide useful information for validating linearity and normality assumptions and for identifying the outliers. However, these methods may fail to detect outliers and influential observations for the following reasons: The Presence of high leverage points The ordinary residuals, and the leverage values, have the following relationship: This implies that the high leverage points tend to have small residuals. Thus, the standardized residuals-based methods may fail to detect the outliers with high leverage data points. 2/22/2019 ST3131, Lecture 16
The masking and swamping problems Masking happens when we fail to detect some outliers that are hidden by other outliers. Swamping happens when we “detect” some non-outliers as outliers. 2/22/2019 ST3131, Lecture 16
The above plots fail to detect Observation 5 as an outlier since it is an outlier. It is masked. Thus, it is necessary to define other measures that can be used to detect such outliers should be defined. 2/22/2019 ST3131, Lecture 16
The influence of an observation is measured by the effects it produces Measures of Influence The influence of an observation is measured by the effects it produces on the fit when it is deleted in the fitting process. Let denote the regression coefficients obtained when the th observation is deleted. So are for fitted values and noise variance estimator. Influence measures look at the differences produced in the quantities such as Three measures will be defined in later slides. 2/22/2019 ST3131, Lecture 16
Cook’s Distance measures the influence of the i-th observation as which can be expressed as This is a multiplicative function of the squared standardized residuals and the potential function of the leverage values. The first term is large when the i-th observation is an outlier while the second quantity is large when the i-th observation is a high leverage point. It is suggested that observations with Ci greater F(p+1,n-p-1, .5) are classified as influential points. In practice, a dot plot or index plot of Ci is used to flag influential points. 2/22/2019 ST3131, Lecture 16
Welsch and Kuh Measure DFITS is defined as which can be written as When is replaced by , this measure is equal to . Points with |DFITS| greater than 2[(p+1)/(n-p-1)]^[1/2] are usually classified as Influential Points. In practice, a dot plot or index plot of DFITSi is used to flag influential points. Ci and DFITSi are approximately monotonically transformed from each other and hence they give similar answers for detecting influential points. 2/22/2019 ST3131, Lecture 16
Hadi’s Influence Measure As is seen, the Cook’s distance and the Welsch and Kuh Measure are multiplicative functions of standardized residuals and potential function. Hadi’s Influence Measure is a sum of potential function and scaled residuals defined as The first term is large for outliers in the X-direction/high leverage outliers while the second term is large for the outliers in the Y-direction. The index plot of is often used to detect influential points. 2/22/2019 ST3131, Lecture 16
The Potential-Residual Plot The index plot of a measure can be used to detect one kind of unusual observations, e.g. The potential-residual plot can be used to detect two different kinds of unusual observations: The P-R plot is obtained via plotting the potential function: against the scaled residual function: 2/22/2019 ST3131, Lecture 16
It is clear that some observations may be flagged as high leverage points, outliers or influential points. All these points should be carefully examined for accuracy (gross error, transcription error) , relevancy (whether it belongs to the data), and special significance (abnormal condition, unique situation). Points with high leverage that are not influential do not cause problems. Points with high leverage that are influential should be investigated. Examples with MLR The above examples are based on one response Y and one predictor variable (X4) for simplicity of presentation. Actually, the above results are valid for any number of predictor variables. For the New York Rivers Data, if all 4 predictor variables are included, we can draw the above index plots similarly and analyze the plots similarly. 2/22/2019 ST3131, Lecture 16
This is the matrix plot for the New York Rivers Data This is the matrix plot for the New York Rivers Data. See Page 6 for the data description, and Page 10 of the textbook for the data 2/22/2019 ST3131, Lecture 16
These are residual plots. 2/22/2019 ST3131, Lecture 16
These are the index plots of 2/22/2019 ST3131, Lecture 16
2/22/2019 ST3131, Lecture 16
2/22/2019 ST3131, Lecture 16