Download presentation
Presentation is loading. Please wait.
Published byAlexander Stafford Modified over 6 years ago
1
Residuals, Residual Plots, Coefficient of determination, & Influential points
2
residual = observed y – predicted y residual = y - ŷ
Residuals In most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of the points from the line as small as possible. Definition: A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is, residual = observed y – predicted y residual = y - ŷ Positive residuals (above line) Negative residuals (below line) residual
3
Residuals (error) - Think AP: Actual y – predicted y
The vertical deviation between the observations & the LSRL the sum of the residuals from the LSRL is always zero error = observed – actual Think AP: Actual y – predicted y
4
Residual Plots Definition: A residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data.
5
Residual plot A scatterplot of the (x, residual) pairs.
Residuals can be graphed against other statistics besides x Purpose is to tell if a linear association exist between the x & y variables – how well a regression line fits the data
6
Residual plot If no obvious pattern exists between the points in the residual plot, then the association is linear. (no curves or fanning) The residuals should be relatively small in size.
7
Linear Not linear
8
Linear model not appropriate
Interpreting Residual Plots A residual plot magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns. The residual plot should show no obvious patterns The residuals should be relatively small in size. Pattern in residuals Linear model not appropriate Can’t just look at correlation coefficient r !
9
Does the formula look familiar?
Standard Deviation of the residuals (s) Definition: If we use a least-squares regression line to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals (s) is given by This value gives the approximate size of a “typical” or “average” prediction error (residual). Does the formula look familiar?
10
Graph the data in a scatterplot. Find the LSRL:
Age Range of Motion One measure of the success of knee surgery is post-surgical range of motion for the knee joint following a knee dislocation. Is there a linear relationship between age & range of motion? Graph the data in a scatterplot. Find the LSRL: Predicted range of motion = (age)
11
Find the predicted y’s:
Age Range of Motion Predicted range of motion = (age) Find the predicted y’s: Find the residuals:
12
One measure of the success of knee surgery is post-surgical range of motion for the knee joint following a knee dislocation. Is there a linear relationship between age & range of motion? Graph a residual plot. Age Range of Motion x Residuals Since there is no pattern in the residual plot, there is a linear relationship between age and range of motion
13
Age Range of Motion Plot the residuals against the y-hats. How does this residual plot compare to the previous one? Residuals
14
Residual plots are the same no matter if plotted against x or y-hat.
Residuals Residuals Residual plots are the same no matter if plotted against x or y-hat.
15
Find the standard deviation of the residuals:
Age Range of Motion Predicted range of motion = (age) Find the standard deviation of the residuals: Use 1 variable statistics to find standard deviation of residuals: sx=9.93 Why the difference?
16
Coefficient of determination-
gives the approximate proportion of variation in the values in y that can be accounted for by the least squares regression line between x & y remains the same no matter which variable is labeled x and y
17
Interpretation of r2 Approximately r2% of the variation in y can be explained by the LSRL of x & y.
18
Least-Squares Regression
The Role of r2 in Regression The standard deviation of the residuals gives us a numerical estimate of the average size of our prediction errors. There is another numerical quantity that tells us how well the least- squares regression line predicts values of the response y. Least-Squares Regression Definition: The coefficient of determination r2 is the fraction of the variation in the values of y that is accounted for by the least-squares regression line of y on x. We can calculate r2 using the following formula: where and
19
Least-Squares Regression
The Role of r2 in Regression r 2 tells us how much better the LSRL does at predicting values of y than simply guessing the mean y for each value in the dataset. Consider the example on page If we needed to predict a backpack weight for a new hiker, but didn’t know each hikers weight, we could use the average backpack weight as our prediction. Least-Squares Regression If we use the mean backpack weight as our prediction, the sum of the squared residuals is SST = 83.87 If we use the LSRL to make our predictions, the sum of the squared residuals is SSE = 30.90 SSE/SST = 30.97/83.87 SSE/SST = 0.368 Therefore, 36.8% of the variation in pack weight is unaccounted for by the least-squares regression line. 1 – SSE/SST = 1 – 30.97/83.87 r2 = 0.632 63.2 % of the variation in backpack weight is accounted for by the linear model relating pack weight to body weight.
20
Age Range of Motion How well does age predict the range of motion after knee surgery? Approximately 30.6% of the variation in range of motion after knee surgery can be explained by the linear regression of age and range of motion.
21
Interpreting Computer Regression Output
A number of statistical software packages produce similar regression output. Be sure you can locate the slope b, the y intercept a, and the values of s and r2.
22
Correlation and regression must be interpreted with caution
Correlation and regression must be interpreted with caution. Plot the data to be sure the relationship is roughly linear and to detect outliers and influential points. Outlier – An observation that lies outside the overall pattern of the other observations In a regression setting, an outlier is a data point with a large residual
23
Influential point- A point that influences where the LSRL is located
If removed, it will significantly change the slope of the LSRL Usually small residual (or 0)
24
Outliers and Influential points
25
Racket Resonance Acceleration
(Hz) (m/sec/sec) One factor in the development of tennis elbow is the impact-induced vibration of the racket and arm at ball contact. Sketch a scatterplot of these data. Calculate the LSRL & correlation coefficient. Does there appear to be an influential point? If so, remove it and then calculate the new LSRL & correlation coefficient.
26
Predicted acceleration = 42.37 - .06(resonance) r = -.775 r2 = 60.1%
(189,30) could be influential. Remove & recalculate LSRL Predicted acceleration = (resonance) r = r2 = 60.1%
27
Predicted acceleration = 38.81 - .033(resonance) r = -.174 r2 = 3%
(189,30) was influential since it moved the LSRL Predicted acceleration = (resonance) r = r2 = 3%
28
Which of these measures are resistant?
LSRL Correlation coefficient Coefficient of determination NONE – all are affected by outliers
29
Correlation and Regression Wisdom
Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, be aware of their limitations 1. The distinction between explanatory and response variables is important in regression.
30
4. Association does not imply causation.
2. Correlation and regression lines describe only linear relationships. 3. Correlation and least-squares regression lines are not resistant. 4. Association does not imply causation.
31
Association Does Not Imply Causation
A serious study once found that people with two cars live longer than people who only own one car. Owning three cars is even better, and so on. There is a substantial positive correlation between number of cars x and length of life y. Why? An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. Association Does Not Imply Causation
32
The table shows the tuition rates from the U of A for the years 2002-2009.
$4228 2003 $4768 2004 $5179 2005 $5495 2006 $5808 2007 $6038 2008 $6299 2009 $6459 Make a scatterplot of the data:
33
Find the correlation coefficient and describe the relationship.
Year Tuition 2002 $4228 2003 $4768 2004 $5179 2005 $5495 2006 $5808 2007 $6038 2008 $6299 2009 $6459 r = .9861 There is a strong, positive, linear relationship between tuition and year at the UofA. Find the LSRL: Predicted tuition = (year) Interpret the slope. For each 1 year increase, UA tuition goes up by an average of $ Find the coefficient of determination. Interpret in context of problem. r2 = 97.2% 97.2% of the variation in tuition can be explained by the linear relationship between tuition and year at the UofA.
34
Make a residual plot of (x, residuals) and
, residuals). Sketch and compare. Year Tuition 2002 $4228 2003 $4768 2004 $5179 2005 $5495 2006 $5808 2007 $6038 2008 $6299 2009 $6459 x Linear not best model. Definite curved pattern in residual plot!
35
Association does not imply causation!
Do turnovers increase scoring in the NBA? In the National Basketball Association, there is a strong positive association between the number of turnovers a player has and the number of points that he scores. A turnover is when a player loses the ball to the other team. Could a player increase his point totals by turning the ball over more frequently? No! Turning the ball over to the other team doesn’t cause a player to score more points. Instead, there is an important lurking variable that influences both variables: playing time. Players who are on the court more often tend to score more points and have more turnovers than players who don’t get much playing time.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.