Download presentation
Presentation is loading. Please wait.
1
7.3 Best-Fit Lines and Prediction
LEARNING GOAL Become familiar with the concept of a best-fit line for a correlation, recognize when such lines have predictive value and when they may not, understand how the square of the correlation coefficient is related to the quality of the fit, and qualitatively understand the use of multiple regression. Page 307
2
Definition The best-fit line (or regression line) on a scatter diagram is a line that lies closer to the data points than any other possible line (according to a standard statistical measure of closeness). Page 307 Slide
3
Predictions with Best-Fit Lines
Cautions in Making Predictions from Best-Fit Lines Don’t expect a best-fit line to give a good prediction unless the correlation is strong and there are many data points. If the sample points lie very close to the best-fit line, the correlation is very strong and the prediction is more likely to be accurate. If the sample points lie away from the best-fit line by substantial amounts, the correlation is weak and predictions tend to be much less accurate. Don’t use a best-fit line to make predictions beyond the bounds of the data points to which the line was fit. Pages Slide
4
Cautions in Making Predictions from Best-Fit Lines (cont.)
3. A best-fit line based on past data is not necessarily valid now and might not result in valid predictions of the future. 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn. 5. Remember that a best-fit line is meaningless when there is no significant correlation or when the relationship is nonlinear. Page 309 Slide
5
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. You’ve found a best-fit line for a correlation between the number of hours per day that people exercise and the number of calories they consume each day. You’ve used this correlation to predict that a person who exercises 18 hours per day would consume 15,000 calories per day. Solution: No one exercises 18 hours per day on an ongoing basis, so this much exercise must be beyond the bounds of any data collected. Therefore, a prediction about someone who exercises 18 hours per day should not be trusted. Pages Slide
6
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. There is a well-known but weak correlation between SAT scores and college grades. You use this correlation to predict the college grades of your best friend from her SAT scores. Solution: The fact that the correlation between SAT scores and college grades is weak means there is much scatter in the data. As a result, we should not expect great accuracy if we use this weak correlation to make a prediction about a single individual. Pages Slide
7
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. Historical data have shown a strong negative correlation between national birth rates and affluence. That is, countries with greater affluence tend to have lower birth rates. These data predict a high birth rate in Russia. Solution: We cannot automatically assume that the historical data still apply today. In fact, Russia currently has a very low birth rate, despite also having a low level of affluence. Pages Slide
8
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. A study in China has discovered correlations that are useful in designing museum exhibits that Chinese children enjoy. A curator suggests using this information to design a new museum exhibit for Atlanta-area school children. Solution: The suggestion to use information from the Chinese study for an Atlanta exhibit assumes that predictions made from correlations in China also apply to Atlanta. However, given the cultural differences between China and Atlanta, the curator’s suggestion should not be considered without more information to back it up. Pages Slide
9
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. Scientific studies have shown a very strong correlation between children’s ingesting of lead and mental retardation. Based on this correlation, paints containing lead were banned. Solution: Given the strength of the correlation and the severity of the consequences, this prediction and the ban that followed seem quite reasonable. In fact, later studies established lead as an actual cause of mental retardation, making the rationale behind the ban even stronger. Pages Slide
10
EXAMPLE 1 Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. Based on a large data set, you’ve made a scatter diagram for salsa consumption (per person) versus years of education. The diagram shows no significant correlation, but you’ve drawn a best-fit line anyway. The line predicts that someone who consumes a pint of salsa per week has at least 13 years of education. Pages Solution: Because there is no significant correlation, the best-fit line and any predictions made from it are meaningless. Slide
11
The Correlation Coefficient and Best-Fit Lines
Best-Fit Lines and r2 The square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line. Page 311 Slide
12
EXAMPLE 4 Voter Turnout and Unemployment
Political scientists are interested in knowing what factors affect voter turnout in elections. One such factor is the unemployment rate. Data collected in presidential election years since 1964 show a very weak negative correlation between voter turnout and the unemployment rate, with a correlation coefficient of about r = Based on this correlation, should we use the unemployment rate to predict voter turnout in the next presidential election? Solution: The square of the correlation coefficient is r2 = (-0.1)2 = 0.01, which means that only about 1% of the variation in the data is accounted for by the best-fit line. Nearly all of the variation in the data must therefore be explained by other factors. We conclude that unemployment is not a reliable predictor of voter turnout. Page 311. Note that there is a scatter diagram of the voter turnout data on page 312. Slide
13
TIME OUT TO THINK Consider Table 7.1 (reproduced on the next slide). Notice, for example, that Diamonds 4 and 5 have nearly identical weights, but Diamond 4 costs only $4,299 while Diamond 5 costs $9,589. Can differences in their color explain the different prices? Study other examples in Table 7.1 in which two diamonds have similar weights but different prices. Overall, do you think that the correlation with price would be stronger if we used weight and color together instead of either one alone? Explain. Page 312 Slide
14
Table 7.1 is found on page 287. Slide
15
Multiple Regression Definition
The use of multiple regression allows the calculation of a best-fit equation that represents the best fit between one variable (such as price) and a combination of two or more other variables (such as weight and color). The coefficient of determination, R2, tells us the proportion of the scatter in the data accounted for by the best-fit equation. Page 312 Slide
16
Finding Equations for Best-Fit Lines (Optional Section)
If we draw any line on a scatter diagram, we can measure the vertical distance between each data point and that line. One measure of how well the line fits the data is the sum of the squares of these vertical distances. A large sum means that the vertical distances of data points from the line are fairly large and hence the line is not a very good fit. A small sum means the data points lie close to the line and the fit is good. Of all possible lines, the best-fit line is the line that minimizes the sum of the squares of the vertical distances. Page 313 Slide
17
where m is the slope of the line and b is the y-intercept of the line.
You may recall that the equation of any straight line can be written in the general form y = mx + b where m is the slope of the line and b is the y-intercept of the line. The formulas for the slope and y-intercept of the best-fit line are as follows: slope = m = r × y-intercept = b = y – (m × x) In the above expressions, r is the correlation coefficient, sx denotes the standard deviation of the x values (or the values of the first variable), sy denotes the standard deviation of the y values, x represents the mean of the values of the variable x, and y represents the mean of the values of the variable y. sy sx Page 313 Slide
18
Because these formulas are tedious to use by hand, we usually use a calculator or computer to find the slope and y-intercept of best-fit lines. When software or a calculator is used to find the slope and intercept of the best-fit line, results are commonly expressed in the format y = b0 + b1x, where b0 is the intercept and b1 is the slope, so be careful to correctly identify those two values. Page 313 Slide
19
The End Slide
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.