Outliers and Influential Points Erik Johnson AP Statistics 5/25/04 erik.PPT
Definitions Outlier: A value in a set of data that does not fit with the rest of the data Influential point: A point in a data set that has leverage on the regression coefficient Leverage: A point which, when removed, the regression line changes substantially is said to have leverage Q1, Q3: the boundaries in which approximately half of the data is contained Interquartile range: Q3-Q1 erik.PPT
Outliers Data points more than 2 standard deviations away from the mean of the data set Data points that do not fit the pattern governed by the rest of the data In regression, any data point that has an unusually large residual erik.PPT
How can I tell if a point in my data set is an outliers? Take the IQR (interquartile range) of your data set and multiply it by 1.5. Subtract that number from Quartile 1 and then from Quartile 3. Any number lying outside these points can be considered an outlier. Now you try a sample problem on outliers! erik.PPT
Sample Problem on IQR In a data set with 5 number summary [12,18,19,21,25], how many values can be considered outliers? A) None B) Exactly 1 C) At least 1 D) Exactly 2 E) At least 2 erik.PPT
IF YOU ANSWERED C….. YOU’RE RIGHT!!!!! The interquartile range for this set of data is 3, and when multiplied by 1.5 you get 4.5. Adding this number to 21 gives you 25.5, which is larger than the maximum value of the data set. This means that there are no outliers on the upper side of the data. When you subtract 4.5 from 18, you get 13.5. The minimum value of 12 is outside this number, meaning that there is at least 1 outlier in the set of data. erik.PPT
Influential Points Influential points are normally outliers in the X direction, but are not always outliers in terms of regression A point is said to influence the data if it is responsible for changes to the LSR line. Any point that has leverage on a set of data is an influential point erik.PPT
There are no outliers on either the X or Y axis To the right is a chart of a data set with a perfect linear regression of r^2=1 and an equation of Y=X There are no outliers on either the X or Y axis erik.PPT
Now look at this graph. The X value previously at 5 has been moved to 8. The equation has changed and the r^2 value has significantly decreased erik.PPT
The point (8,5) is an influential point in this data set Watch how the regression line changes as the point (8,5) is added The point (8,5) is an influential point in this data set erik.PPT
Sample Problem on Influential Points Given the plot below, which of the following can you conclude about the data point in the upper right-hand corner? A) It is an Outlier in the Regression B) It is an Influential Point C) It does not fit the pattern of the data D) It has a large residual E) All of the Above erik.PPT
The correct answer is…… B erik.PPT
Explanation for Sample Question Since the data point in question seems to fit the general pattern of the other observations in the data set, there is no evidence to call it an outlier in terms of regression. Likewise, it will not have a large residual when a LSR line is fit to the data. This data point IS an influential point, because it has an X value differing greatly from the others in the set. erik.PPT
THE END erik.PPT