Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 5 Lesson 5.3 Summarizing Bivariate Data

Similar presentations


Presentation on theme: "Chapter 5 Lesson 5.3 Summarizing Bivariate Data"— Presentation transcript:

1 Chapter 5 Lesson 5.3 Summarizing Bivariate Data
5.3: Assessing the Fit of a Line

2 Assessing the fit of the LSRL
Once the LSRL is obtained, the next step is to examine how effectively the line summarizes the relationship between x and y. Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

3 In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S= R-Sq = 32.0% R-Sq(adj) = 22.3%

4 If the point is above the line, the residual will be positive.
In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. The vertical deviation between the point and the LSRL is called the residual. If the point is above the line, the residual will be positive. Residuals = Actual – Predicted (think AP!) If the point is below the line the residual will be negative. x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Distance traveled Distance to debris

5 Residual plots Is a scatterplot of the (x, residual) pairs.
Residuals can also be graphed against the predicted y-values (y-hat, residual) The purpose is to determine if a linear model is the best way to describe the relationship between the x & y variables If no pattern exists between the points in the residual plot, then the linear model is appropriate.

6 This residual shows no pattern so it indicates that the linear model is appropriate.
This residual shows a curved pattern so it indicates that the linear model is not appropriate.

7 Plot the residuals against the distance from debris (x)
In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Use the values in this table to create a residual plot for this data set. Is a linear model appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food? Distance from debris Distance traveled (y) Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 This should remind students of the fact that the sum of the deviations from the mean is always zero. Plot the residuals against the distance from debris (x)

8 Now plot the residuals against the predicted distance from food.
Since the residual plot displays no pattern, a linear model is appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food. Now plot the residuals against the predicted distance from food.

9 What do you notice about the general scatter of points on this residual plot versus the residual plot using the x-values? Residual plots can be plotted against either the x-values or the predicted y-values.

10 Getting the “Bends” (cont.)
The curved relationship between fuel efficiency and weight is more obvious in the plot of the residuals than in the original scatterplot:

11 Assessing the fit of the LSRL
Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

12 Outliers, Leverage, and Influence
Outlying points can strongly influence a regression. Even a single point far from the body of the data can dominate the analysis. Any point that stands away from the others can be called an outlier and deserves your special attention.

13 Outliers, Leverage, and Influence (cont.)
The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election…

14 Outliers, Leverage, and Influence (cont.)
The red line shows the effects that one unusual point can have on a regression:

15 Outliers, Leverage, and Influence (cont.)
A data point can also be unusual if its x-value is far from the mean of the x-values. Such points are said to have high leverage.

16 Outliers, Leverage, and Influence (cont.)
A point with high leverage has the potential to change the regression line. We say that a point is influential if omitting it from the analysis gives a very different model.

17 Outliers, Leverage, and Influence (cont.)

18 Outliers, Leverage, and Influence (cont.)
Warning: Influential points can hide in plots of residuals. Points with high leverage pull the line close to them, so they often have small residuals. You’ll see influential points more easily in scatterplots of the original data or by finding a regression model with and without the points.

19 Assessing the fit of the LSRL
Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

20 Coefficient of determination-
Denoted by r2 or R2 Gives the proportion or percentage of variation/differences in y that can be 1) attributed to the variation in x 2) accounted for by the LSRL model Example: Target Microwaves Hd R2=.6543 means… 65.43% of the differences in microwave prices is attributed to the differences in wattage.

21 Fat Versus Protein: An Example
The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:

22 Residuals Revisited (cont.)
The residuals for the BK menu regression look appropriately boring:

23 Fat Versus Protein: An Example
The regression line for the Burger King data fits the data well: The equation is r = .83 so R-squared = ? 69% of the variation in total fat is accounted for by the model. 69% of the variation in total fat is accounted for by the variation in protein.

24 How Big Should R2 Be? R2 is always between 0% and 100%.
1 - R2 is the % of the variation left in the residuals The standard deviation of the residuals (se) can give us more information about the usefulness of the regression by telling us how much scatter there is around the line.

25 Let’s review the values from this output and their meanings.
Partial output from the regression analysis of deer mouse data: Let’s review the values from this output and their meanings. Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S = R-sq = 32.0% R-sq(adj) = 22.3% What does this number represent? The y-intercept (a): This value has no meaning in context since it doesn't make sense to have a negative distance. The slope (b): For every increase of 1 meter to the nearest debris pile, the predicted distance traveled to food increases by approxiamtely meters The standard deviation (s): This is the typical amount by which an observation deviates from the least squares regression line. The coefficient of determination (R2) 32% of the variation in distance traveled to food is accounted for by the model.

26 Homework Pg.260: # , 5.36


Download ppt "Chapter 5 Lesson 5.3 Summarizing Bivariate Data"

Similar presentations


Ads by Google