Chapter 5 Lesson 5.3 Summarizing Bivariate Data

Slides:



Advertisements
Similar presentations
 Objective: To identify influential points in scatterplots and make sense of bivariate relationships.
Advertisements

Chapter 8 Linear regression
Chapter 8 Linear regression
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
CHAPTER 3 Describing Relationships
AP STATISTICS LESSON 3 – 3 LEAST – SQUARES REGRESSION.
Summarizing Bivariate Data
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?
LEAST-SQUARES REGRESSION 3.2 Least Squares Regression Line and Residuals.
Chapter 8 Linear Regression. Fat Versus Protein: An Example 30 items on the Burger King menu:
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 9 Regression Wisdom.
1. Analyzing patterns in scatterplots 2. Correlation and linearity 3. Least-squares regression line 4. Residual plots, outliers, and influential points.
Warm-up Get a sheet of computer paper/construction paper from the front of the room, and create your very own paper airplane. Try to create planes with.
Training Activity 4 (part 2)
CHAPTER 3 Describing Relationships
Statistics 101 Chapter 3 Section 3.
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Finding the Best Fit Line
Unit 4 Lesson 4 (5.4) Summarizing Bivariate Data
The scatterplot shows the advertised prices (in thousands of dollars) plotted against ages (in years) for a random sample of Plymouth Voyagers on several.
Chapter 8 Linear Regression.
Regression and Residual Plots
Finding the Best Fit Line
Chapter 8 Part 2 Linear Regression
Residuals, Residual Plots, and Influential points
1. Describe the Form and Direction of the Scatterplot.
Least-Squares Regression
residual = observed y – predicted y residual = y - ŷ
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
GET OUT p.161 HW!.
Residuals, Residual Plots, & Influential points
Chapter 3 Describing Relationships Section 3.2
Least Squares Regression
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 8 Part 1 Linear Regression
Chapter 3: Describing Relationships
Residuals, Influential Points, and Outliers
Least-Squares Regression
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
3.2 – Least Squares Regression
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
A medical researcher wishes to determine how the dosage (in mg) of a drug affects the heart rate of the patient. Find the correlation coefficient & interpret.
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapters Important Concepts and Terms
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Honors Statistics Review Chapters 7 & 8
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Presentation transcript:

Chapter 5 Lesson 5.3 Summarizing Bivariate Data 5.3: Assessing the Fit of a Line

Assessing the fit of the LSRL Once the LSRL is obtained, the next step is to examine how effectively the line summarizes the relationship between x and y. Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S=8.67071 R-Sq = 32.0% R-Sq(adj) = 22.3%

If the point is above the line, the residual will be positive. In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. The vertical deviation between the point and the LSRL is called the residual. If the point is above the line, the residual will be positive. Residuals = Actual – Predicted (think AP!) If the point is below the line the residual will be negative. x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Distance traveled Distance to debris

Residual plots Is a scatterplot of the (x, residual) pairs. Residuals can also be graphed against the predicted y-values (y-hat, residual) The purpose is to determine if a linear model is the best way to describe the relationship between the x & y variables If no pattern exists between the points in the residual plot, then the linear model is appropriate.

This residual shows no pattern so it indicates that the linear model is appropriate. This residual shows a curved pattern so it indicates that the linear model is not appropriate.

Plot the residuals against the distance from debris (x) In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Use the values in this table to create a residual plot for this data set. Is a linear model appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food? Distance from debris Distance traveled (y) Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 This should remind students of the fact that the sum of the deviations from the mean is always zero. Plot the residuals against the distance from debris (x)

Now plot the residuals against the predicted distance from food. Since the residual plot displays no pattern, a linear model is appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food. Now plot the residuals against the predicted distance from food.

What do you notice about the general scatter of points on this residual plot versus the residual plot using the x-values? Residual plots can be plotted against either the x-values or the predicted y-values.

Getting the “Bends” (cont.) The curved relationship between fuel efficiency and weight is more obvious in the plot of the residuals than in the original scatterplot:

Assessing the fit of the LSRL Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

Outliers, Leverage, and Influence Outlying points can strongly influence a regression. Even a single point far from the body of the data can dominate the analysis. Any point that stands away from the others can be called an outlier and deserves your special attention.

Outliers, Leverage, and Influence (cont.) The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election…

Outliers, Leverage, and Influence (cont.) The red line shows the effects that one unusual point can have on a regression:

Outliers, Leverage, and Influence (cont.) A data point can also be unusual if its x-value is far from the mean of the x-values. Such points are said to have high leverage.

Outliers, Leverage, and Influence (cont.) A point with high leverage has the potential to change the regression line. We say that a point is influential if omitting it from the analysis gives a very different model.

Outliers, Leverage, and Influence (cont.)

Outliers, Leverage, and Influence (cont.) Warning: Influential points can hide in plots of residuals. Points with high leverage pull the line close to them, so they often have small residuals. You’ll see influential points more easily in scatterplots of the original data or by finding a regression model with and without the points.

Assessing the fit of the LSRL Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be?

Coefficient of determination- Denoted by r2 or R2 Gives the proportion or percentage of variation/differences in y that can be 1) attributed to the variation in x 2) accounted for by the LSRL model Example: Target Microwaves Hd R2=.6543 means… 65.43% of the differences in microwave prices is attributed to the differences in wattage.

Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:

Residuals Revisited (cont.) The residuals for the BK menu regression look appropriately boring:

Fat Versus Protein: An Example The regression line for the Burger King data fits the data well: The equation is r = .83 so R-squared = ? 69% of the variation in total fat is accounted for by the model. 69% of the variation in total fat is accounted for by the variation in protein.

How Big Should R2 Be? R2 is always between 0% and 100%. 1 - R2 is the % of the variation left in the residuals The standard deviation of the residuals (se) can give us more information about the usefulness of the regression by telling us how much scatter there is around the line.

Let’s review the values from this output and their meanings. Partial output from the regression analysis of deer mouse data: Let’s review the values from this output and their meanings. Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3% What does this number represent? The y-intercept (a): This value has no meaning in context since it doesn't make sense to have a negative distance. The slope (b): For every increase of 1 meter to the nearest debris pile, the predicted distance traveled to food increases by approxiamtely 3.234 meters The standard deviation (s): This is the typical amount by which an observation deviates from the least squares regression line. The coefficient of determination (R2) 32% of the variation in distance traveled to food is accounted for by the model.

Homework Pg.260: #5.31-5.33, 5.36