Download presentation
Presentation is loading. Please wait.
Published byDavid Norton Modified over 9 years ago
1
Least-Squares Regression--- Prediction, Outliers, Influential Points and Extrapolation Section 3.3----Part IISection 3.3----Part II
2
What You’ll Learn: How to use the LSRL for predictionHow to use the LSRL for prediction Why we need to be cautious when predicting outside our original dataWhy we need to be cautious when predicting outside our original data How to spot an “outlier”How to spot an “outlier” The effect of an “influential point”The effect of an “influential point”
3
Using the LSRL for prediction Once we have determined we have a useful model, we tend to use the model for prediction. Let’s consider a new set of data: The following is the distance (in miles) as well as the airfare for twelve destinations from Baltimore Maryland. A scatterplot is included.
4
Traveling from Baltimore Based on this scatterplot, does it seem that knowing the distance to a destination would be useful for predicting the airfare?
5
Should we use a linear model? The scatterplot looks promising.The scatterplot looks promising. Let’s check out the correlation coefficient and a plot of the residuals.Let’s check out the correlation coefficient and a plot of the residuals. r =.795, r 2 =.632r =.795, r 2 =.632 The residual plot does not appear to have a pattern, so it looks like we can use our linear model.The residual plot does not appear to have a pattern, so it looks like we can use our linear model. Simple linear regression results: Dependent Variable: Airfare Independent Variable: Distance Airfare = 83.26736 + 0.11737509 Distance Sample size: 12 R (correlation coefficient) = 0.795 R-sq = 0.63200194 Estimate of error standard deviation: 37.827023
6
Using the model for prediction LSRL for Distance vs AirfareLSRL for Distance vs Airfare Airfare = 83.26736 + 0.11737509( Distance)Airfare = 83.26736 + 0.11737509( Distance) When we use our regression equation for prediction, remember we are finding the “average” response value for a particular explanatory value.When we use our regression equation for prediction, remember we are finding the “average” response value for a particular explanatory value. This means that our predicted values will not always agree with actual observed values. We will under-predict for some and over-predict for others.This means that our predicted values will not always agree with actual observed values. We will under-predict for some and over-predict for others. To see this in action, let’s consider predicting for one of the distances we used to compute the LSRLTo see this in action, let’s consider predicting for one of the distances we used to compute the LSRL
7
Prediction Consider flying from Baltimore to Atlanta, 576 miles. This point is circled on both the fitted line plot and the residual plot.Consider flying from Baltimore to Atlanta, 576 miles. This point is circled on both the fitted line plot and the residual plot. The actual (observed) airfare for this flight is $178.00The actual (observed) airfare for this flight is $178.00 Our line predictsOur line predicts Airfare = 83.26736 + 0.11737509( 576) = $150.88 = $150.88 The residual = obs – predThe residual = obs – pred = 178.00-150.88 = 178.00-150.88 = $27.12 = $27.12
8
Prediction Notice these three things:Notice these three things: The actual point is above the fitted lineThe actual point is above the fitted line The residual is above the “zero” lineThe residual is above the “zero” line The value of the residual is positiveThe value of the residual is positive All these indicate that our prediction line will under-predict for this particular airfareAll these indicate that our prediction line will under-predict for this particular airfare
9
Over and Under Predictions Over-predictions:Over-predictions: Point lies below the regression linePoint lies below the regression line Residual lies below the “zero” lineResidual lies below the “zero” line Value of the residual is negativeValue of the residual is negative Under-predictions:Under-predictions: Point lies above the regression linePoint lies above the regression line Residual lies above the regression lineResidual lies above the regression line Value of the residual is positiveValue of the residual is positive
10
Prediction Errors Why does this happen?Why does this happen? A relationship between two variables does NOT indicate that the explanatory variable causes changes in the response variable, it just gives us the relationship between them.A relationship between two variables does NOT indicate that the explanatory variable causes changes in the response variable, it just gives us the relationship between them. In this case, r 2 =.63, which means that about 63% of the variation we see in airfare can be explained by the variation we see in distance traveled.In this case, r 2 =.63, which means that about 63% of the variation we see in airfare can be explained by the variation we see in distance traveled. This means that about 37% of the variation in price has yet to be explained. In other words, we may want to explore other variables that may affect the cost of the ticket, such as type of airport, season, ect.This means that about 37% of the variation in price has yet to be explained. In other words, we may want to explore other variables that may affect the cost of the ticket, such as type of airport, season, ect.
11
So should we predict? As long as we recognize that our predictions are an average response value for a given explanatory variable, we will have some valuable information.As long as we recognize that our predictions are an average response value for a given explanatory variable, we will have some valuable information. Let’s use our model to predict for a destination that is 900 miles from BaltimoreLet’s use our model to predict for a destination that is 900 miles from Baltimore Airfare = 83.26736 + 0.11737509( 900) = $188.90 = $188.90 This means that if we were looking for a 900 mile flight, we would expect to pay about $188.90This means that if we were looking for a 900 mile flight, we would expect to pay about $188.90 Which means that if we find a flight for $200.00, we might keep looking!Which means that if we find a flight for $200.00, we might keep looking!
12
Another Prediction What about the airfare from Baltimore to San Francisco, which is 2842 miles away.What about the airfare from Baltimore to San Francisco, which is 2842 miles away. Airfare = 83.26736 + 0.11737509( 2842)Airfare = 83.26736 + 0.11737509( 2842) = $416.85 = $416.85 Ok, so that’s reasonable, right?Ok, so that’s reasonable, right? Well, in 1998 when this data was gathered, a flight from Baltimore to San Francisco cost only $198.00!!!! So, although we expect some error, this is much more than we are willing to except!Well, in 1998 when this data was gathered, a flight from Baltimore to San Francisco cost only $198.00!!!! So, although we expect some error, this is much more than we are willing to except! Why are we so far off??????Why are we so far off??????
13
Predicting outside our data Consider the distances we used to create the model. They range from New York at 189 miles to Denver, 1502 miles.Consider the distances we used to create the model. They range from New York at 189 miles to Denver, 1502 miles. Our prediction for the flight to San Francisco assumes that the same relationship continues even though this distance is almost twice as far as Denver. We have no way of knowing if this relationship stays the same outside the domain of the original data. Predictions outside this domain is called “extrapolation”. This type of prediction is dangerous and should not be done.
14
Unusual Points (Outliers & Influential Points) Outliers are pieces of data that do not fit the overall pattern.Outliers are pieces of data that do not fit the overall pattern. If a point lies far away from the regression line in the y-direction, it will have a large residual (either positive or negative)If a point lies far away from the regression line in the y-direction, it will have a large residual (either positive or negative) Consider the following data which shows the relationship between the age (in months) at which a child first speaks and their subsequent score on a test for mental ability—Gesell scoreConsider the following data which shows the relationship between the age (in months) at which a child first speaks and their subsequent score on a test for mental ability—Gesell score
15
Unusual Points Notice the circled point.Notice the circled point. This point is far from the regression line. We also notice that it will have a very large residual. Care should be taken to ensure that we have recorded this point correctly.This point is far from the regression line. We also notice that it will have a very large residual. Care should be taken to ensure that we have recorded this point correctly.
16
Unusual Points in the y-direction Consider the regression analysis with this point included.Consider the regression analysis with this point included. Simple linear regression results: Dependent Variable: Score Independent Variable: Age Score = 109.87384 - 1.1269889 Age Sample size: 21 R (correlation coefficient) = -0.6403 R-sq = 0.40997127 Estimate of error standard deviation: 11.022908 Now consider the analysis without this point Simple linear regression results: Dependent Variable: Score Independent Variable: Age Score = 109.30468 - 1.1933107 Age Sample size: 20 R (correlation coefficient) = -0.7561 R-sq = 0.571631 Estimate of error standard deviation: 8.628196 Notice that although the y-intercept and slope changed slightly the biggest change occurred in the value of the correlation coefficient, “r” and thus “r 2 ”. An outlier in the y-direction will weaken the strength of the linear relationship
17
Unusual Points in the X-direction Again consider the data set for age vs score, and notice that a second unusual point exists. However, this point is extreme in the x-direction.Again consider the data set for age vs score, and notice that a second unusual point exists. However, this point is extreme in the x-direction. Notice that this point is close to the regression line and will not have a large residual.
18
Unusual Points in the X-direction So how does a point like this affect the regression?So how does a point like this affect the regression? Consider the regression analysis with this point included. Consider the regression analysis with this point included. Simple linear regression results: Dependent Variable: Score Independent Variable: Age Score = 109.87384 - 1.1269889 Age Sample size: 21 R (correlation coefficient) = -0.6403 R-sq = 0.40997127 Estimate of error standard deviation: 11.022908 Now consider the analysis without this point Simple linear regression results: Dependent Variable: Score Independent Variable: Age Score = 105.62987 - 0.77922076 Age Sample size: 20 R (correlation coefficient) = -0.3349 R-sq = 0.112162925 Estimate of error standard deviation: 11.106756 Notice that the LSRL y-intercept and slope change a great deal when this point is removed and the regression is repeated. Let’s look at scatterplots of both scenarios.
19
How an unusual point in the X-direction affects the LSRL When we removed the point, our line changed a great deal. When a point has this type of affect on a LSRL, we call it an “influential point”
20
Additional Resources The Practice of Statistics—YMM Pg 151-159The Practice of Statistics—YMM Pg 151-159 The Practice of Statistics—YMS Pg 167-173The Practice of Statistics—YMS Pg 167-173
21
What you learned How to use the LSRL for predictionHow to use the LSRL for predictionHow to use the LSRL for predictionHow to use the LSRL for prediction Why we need to be cautious when predicting outside our original dataWhy we need to be cautious when predicting outside our original dataWhy we need to be cautious when predicting outside our original dataWhy we need to be cautious when predicting outside our original data How to spot an “outlier”How to spot an “outlier”How to spot an “outlier”How to spot an “outlier” The effect of an “influential point”The effect of an “influential point”The effect of an “influential point”The effect of an “influential point”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.