Download presentation
Presentation is loading. Please wait.
Published byMorgan Chambers Modified over 8 years ago
1
Chapter 11 Using Relationships to Make Predictions Objectives SWBAT: 1)Use a model to make predictions 2)Model linear relationships with a least-squares regression line 3)Make predictions using a least-squares regression line 4)Calculate least-squares regression lines 5)Calculate the standard deviation of the residuals
2
Using a Model to Make Predictions Is it possible to use Statistics to predict how many wins a baseball team will have? This is a concept portrayed in the movie “Moneyball.” Moneyball Bill James is an American baseball writer, historian, and statistician that has been writing about baseball since 1977. His approach to baseball is called sabermetrics and it was mainstreamed with the Oakland Athletics, as seen in “Moneyball.”
3
James looked to investigate the relationship between runs scored, runs allowed, and winning percentage. He proposed the Pythagorean expectation of winning percentage.
4
Let’s take a look at this formula in action by examining the 2008 World Champion Philadelphia Phillies. In the 2008 season, they scored 799 runs and allowed 680 runs. In a 162-game season, the predicted number of wins for the Phillies would be:
5
The Phillies were predicted to win 93.96 games. However, they only won 92 games, approximately 2 games below the prediction. The difference between the actual number of wins and predicted number of wins is called a residual. The residual for the 2008 Phillies is: The negative indicates that the Phillies won 1.96 fewer games than expected, based on their runs scored and runs allowed.
6
Here are the residuals for all 2008 MLB teams (pg 400-401). Remember that a negative residual means a team PERFORMED below their prediction, and a positive means a team PERFORMED above their prediction. To help remember the order of subtraction, think of the acronym AP: actual - predicted
7
Choosing the Best Model Why do you think Bill James chose to use the exponent “2” in his Pythagorean model? Why not 3, or 1, or even 17? In other words, how can we pick the “best” exponent to help us make the best predictions (the exponent that produces the smallest residuals)?
8
Let’s examine the same formula, but replace the exponent of 2 with a 1. Then we’ll calculate the prediction for the Phillies. Remember they actually won 92 games, so the new residual now is 4.52. The other model gave a residual of -1.96.
9
The closer a residual is to 0, the more accurate a prediction is. The residual of 4.52 was more than double the residual of -1.96. This indicates that, for the Phillies, the exponent of 2 was more appropriate than the exponent of 1. Did this hold true for all teams? Here are the residuals for all 30 teams when using an exponent of 1. At first glance, these residuals seem larger than the initial ones.
10
To make a comparison between the model with an exponent of 2 and the model with an exponent of 1, let’s square the residuals for each model and then sum them up (squaring them will guarantee that they are positive). The sum is called the sum of squared residuals (SSR). For the exponent of 2: For the exponent of 1: Because the SSR is smaller when using an exponent of 2, that makes it a better model than using an exponent of 1.
11
The Concept of Least Squares In general, when statisticians compare models to see which one makes the best predictions, they compare the sum of squared residuals for each model and choose the model with the smallest sum. This is known as the “least-squares” method.
12
The scatterplot to the right shows the SSR using the Pythagorean formula with different exponents 1.0 to 3.0. As you can see, the best exponent to use is a 2, since it results in the smallest SSR (the most accurate predictions. When there is a linear association between two numerical variables, we can use a linear model to make predictions. Let’s see how exciting that is…
13
Least-Squares Regression Lines A least-squares regression line is used to model a linear relationship between an explanatory variable x and a response variable y. In a previous Algebra class you’ve likely talked about “lines of best fit.” The least-squares regression line best models an association because it uses the smallest sum of squared residuals.
14
Let’s examine the relationship between home runs and runs scored for teams in the 2008 MLB season. One would think that teams that are good at hitting home runs are also good at scoring runs. There is a moderately strong, positive, linear relationship between the number of home runs a team hit and the number of runs they scored. The correlation is r=0.62. We can use a least-squares regression line to model this data. The question becomes what is the equation for this line?
15
Least-squares regression lines are expressed in the following form:
16
Once we have the least-squares regression line, we can use it to make predictions for specific teams. The 2008 Texas Rangers hit 194 home runs. Let’s use that to predict how many runs they would score. We predict they would score approximately 793 runs.
17
This means the Rangers scored 108.5 more runs than predicted, based on the number of home runs they hit.
18
The San Diego Padres hit 154 home runs. Find their predicted number of runs scored. They actually scored 637 runs. Calculate the value of their residual, and then interpret what that means. Since the residual is negative, we know the Padres scored 105.5 fewer runs than predicted, based on the number of home runs they hit.
19
So what exactly do the slope and y-intercept mean in a least-squares regression equation? Remember from algebra that slope is rise over run. The interpretation of the slope is the predicted change in the y variable when the value of the x variable is increased by 1.
20
The y-intercept (a) is the constant (where the least- squares regression line crosses the y axis). The y-intercept gives us the predicted value of y when x is 0. Sticking with our least-squares regression line using home runs to predict runs scored, the constant value is 550. This means that a team with 0 home runs would be predicted to score about 550 runs. Caution: No teams in 2008 hit 0 home runs, or were anywhere near that mark. We really don’t know anything about the relationship between home runs and runs scored for teams with this little power.
21
Extrapolation is trying to make predictions outside of the range of the data we have. This can occur when we try to make a prediction for a value much smaller than the other values (for example 0 with the home run and runs scored example) or much larger than the other values (for example 4000 with the home run and runs scored example). It is risky to make predictions using extrapolation since the association between the variables may not be the same for extremely small or extremely large values of the explanatory variable.
22
Extrapolation also occurs when people try to project how well an athlete or team will do in the future based on previous PERFORMANCES. This is especially true early in a game or season. For example, suppose a running back gains 200 yards in his first game of the season. At that pace, he would be predicted to run for 3200 yards for the entire season. This would likely be inaccurate, especially considering the single-season rushing record is 2101 yards (Eric Dickerson).
23
Calculating the Least-Squares Regression Line There are formulas to manually calculate the least- squares regression line. However, the TI-84 makes this calculation much easier, so we can thank our friends at Texas Instruments for helping us out. Let’s look at how we can calculate the least-squares regression line using our 84. Trig seems to be a fan of this idea.
24
To the right is 2008 MLB data showing every team, the number of hits they had, and the number of runs they had. Let’s calculate the least-squares regression equation relating hits (x) to runs (y). (pg 413) Like with many other calculator functions, the first thing we need to do is enter this data into lists. Enter hits in L1 and runs into L2.
25
Now press STAT, go to CALC, and go to option 8: LinReg(a+bx). Make sure your Xlist is L1 and your Ylist is L2. Then press calculate.
26
To view a graph of the scatterplot, press STAT PLOT (2 nd Y=), choose plot 1, turn it on, choose the first graph type (scatterplot), and make sure Xlist is L1 and Ylist is L2. To graph the least- squares regression line with the scatterplot, press Y= and enter the equation. Then press zoom 9.
27
The same Correlation and Regression applet used in the last chapter can also calculate the least-squares regression line for us. www.tinyurl.com/SRISapplets Observed slope is b and observed intercept is a.
28
Another fantastic applet is the one found here: www.tinyurl.com/SPAa pplets www.tinyurl.com/SPAa pplets Click on Two Quantitative Variables, and then enter the data for the explanatory and response variable. After pressing “begin analysis,” you can utilize other features such as calculating correlation and calculating the least-squares regression line.
29
The Standard Deviation of the Residuals So far we’ve looked at least-squares regression lines to model the relationship between home runs and runs scored, and the relationship between hits and runs scored. Which of the two explanatory variables, home runs or hits, does a better job of predicting the number of runs scored? In order to find out, we have to calculate the standard deviation of the residuals (s).
30
The standard deviation of the residuals (s) estimates the typical distance between the actual values of the response variable and their corresponding predicted values. In this instance, the standard deviation of the residuals measures the typical distance between a team’s actual number of runs scored and their predicted number of runs scored. In other words, it estimates about how far away we should expect the points on the scatterplot to be from the graph of the least-squares regression line.
31
Let’s calculate s for our hits and runs example. (pg 416) This means that when using hits to predict runs, our predicted values will typically be about 51.6 runs away form the actual values.
32
Calculating the standard deviation of the residuals can also be done on the TI-84. 1)Just as before, enter hits in L1, runs in L2, then hit STAT, CALC, 8: LinReg(a+bx). 2)To display the residuals in L3, go into STAT, EDIT, and move the cursor to the heading L3. Now press LIST (2 nd, STAT), scroll down to option 7: RESID, and press ENTER twice. (Note: this only works if you calculate the regression line FIRST).
33
3) To find the sum of the squared residuals, press STAT, CALC, 1: 1-Var Stats, and enter L3 for the list. 4) Finish by substituting into the formula for s. Note: For the first applet, enter the information the same as before, and look for the option “observed s.” This gives you the standard deviation of the residuals. For the second applet, calculate the least- squares regression line and s is listed.
34
We still need to decide which variable (hits or home runs) is a better predictor of runs scored. Let’s compare the standard deviation of the residuals for each relationship. The explanatory variable that provides the smaller standard deviation of the residuals will be the better predictor (since the predictions will be closer to the actual values). As previously calculated, the standard deviation of the residuals using hits was 51.6 runs. The standard deviation of the residuals using home runs is approximately 55.0 runs. Since the model using hits had a smaller s than the model using home runs, we can state that using hits is better than using home runs to predict runs.
35
The table to the right shows a variety of offensive variables that can be used to predict runs for the 2008 MLB season. The variables with the smallest standard deviations of the residuals are on-base percentage (OBP) and slugging average (SLG). Note: These two statistics measure the two fundamental components of scoring runs: getting on base and moving runners around the bases.
36
Recently, a hybrid of on-base percentage and slugging average has been proposed as a way to evaluate a player’s overall ABILITY to help his team score runs. The hybrid is called OPS and is short for on-base percentage plus slugging average OPS=OBP+SLG To the right is a scatterplot of OPS and runs scored for all 30 teams in 2008. As you can see, there is much less variability from the line in this scatterplot. The sum of the squared residuals is only 14,872, resulting in a standard deviation of the residuals of just s=23.0 runs.
37
Influential Points In the previous chapter, we examined the effect that a single observation can have on the value of correlation. The lesson is the same in this chapter. A single observation can have a big effect on the equation of the least-squares regression line, as well as on related measures such as the standard deviation of the residuals.
38
To the right is a scatterplot showing shooting percentage and number of wins for the 30 NBA teams in the 2008- 2009 regular season. The Phoenix Suns made 50.4% of their shots but only won 46 of their games, which seems to be out of the pattern of the rest of the teams. You can see two least-squares regression lines on the scatterplot, one that includes the Suns and one that does not. We can see that the Suns PERFORMANCE had a strong influence on the slope of the equation. Removing the Suns causes the slope to go from 6.8 to 10.8. It also causes the standard deviation of the residuals to go from s=11 wins to s=9 wins, and correlation to go from r=0.65 to r=0.79.
39
Occasionally, some points that seem influential aren’t. Below is a scatterplot showing the relationship between shooting percentage and average points scored per game for the 30 NBA teams in the 2008- 2009 season. The Warriors scored much more than they were expected to, and seem out of the pattern of the rest of the teams.
40
However, as you can see, when the Warriors were removed from the data, the equation of the least-squares regression line barely changed. The slope went from 1.66 to 1.67 and the y- intercept went from 23.8 to 22.8. The change was a little bit more noticeable on the standard deviation of the residuals (from s=3.66 points to s=3.30 points) and the correlation (from r=0.53 to r=0.58). To identify observations that are potentially influential, look for points on the scatterplot that do not follow the pattern of the rest of the data. Points that have especially small or large values of the explanatory variable can be particularly influential on the equation of the least-squares regression line, the standard deviation of the residuals, and the correlation, whereas points that have values of the explanatory variable near the average tend not to influence the equation of the least-squares regression line as much.
41
Regression to the Mean In the beginning of the year, you all tried out for the varsity coin flipping team. Now it’s time to try out for the AAU coin flipping team. Whoever wants to try out, come grab a coin, flip it ten times, and record how many heads you get. Student# of Heads
42
Let’s have our best PERFORMING student(s) and worst PERFORMING student(s) stand up. Who do you think has a better chance of improving their PERFORMANCE if we did 10 more flips? WHY?????
43
At the beginning of every sports season, there are athletes who PERFORM much better than their ABILITY. Often times sportscasters get carried away talking about possible records that can be broken. However, how likely is it that an athlete keeps PERFORMING that much more than their ABILITY for a continued period of time? Thinking about it from the other side, sometimes athletes start off in a slump, PERFORMING worse than their ABILITY. How likely is it that they keep up their poor PERFORMANCE?
44
Mr. Doback has a class of 26 students, and runs the same coin flipping experiment that we just did, except he has the students flip their coins 50 times, and he ran the experiment two days in a row. On the first day, 15 students flipped 25 or more heads, and 11 students flipped less than 25 heads. On the second day, of the 15 students that flipped 25 or more heads on day 1, only 2 of the students improved their heads total, meaning 13 students’ totals went down. However, of the 11 students that flipped less than 25 heads, 8 of them increased their number of heads.
45
Here is a scatterplot showing the number of heads on day 1 and the number of heads on day 2 for the 26 students. The vertical line at 25 represents what the expected number of heads would be. Any point to the left of the line indicates a student that got less than 25 heads on day 1, and any point to the right indicates a student that got more than 25 heads on day 1. The line y=x is a reference. If a point is above the line, then a student did better on day 2. If a point is below the line, then a student did worse on day 2.
46
The graph shows that 87% of students who did well on the first did worse on the second day (ended up below y=x) and 73% of students who did worse on the first day did better on the second (ended up above y=x). When measuring the same variable in two different time periods, the tendency for better PERFORMANCES to follow poor PERFORMANCES and for worse PERFORMANCES to follow good PERFORMANCES is called regression to the mean.
47
Let’s look at another example using batting averages of MLB players. Each dot represents a player and his batting average PERFORMANCES in 2008 and 2009. The vertical lines at 0.260 and 0.300 are to identify what is commonly accepted as bad and good hitting PERFORMANCES. Players above the line y=x are players that improved from 2008 to 2009, players below the line are players that regressed from 2008 to 2009, and players on the line are players that kept the same average.
48
18 of the 24 players that hit over 0.300 in 2008 had worse PERFORMANCES in 2009. 10 of the 15 players who hit worse than 0.260 in 2008 had better PERFORMANCES in 2009. Both groups players’ PERFORMANCES regressed to the mean. In other words, players who were great in 2008 tended to be closer to the average of all players in 2009, and players who were not so great in 2008 also tended to be closer to the average of all players in 2009.
49
Let’s look at the same scatterplot with the least-squares regression line. The least-squares regression line predicts that players that had bad PERFORMANCES in 2008 will improve, but will still be below average. For example, if a player hit 0.220 in 2008, they will be predicted to hit 0.133+0.52(0.220) which is 0.247. This is still below average, but is better than 0.220.
50
Likewise, it predicts players who had good PERFORMANCES in 2008 will do worse in 2009, but still be above average. For example, if a plyer hit 0.350 in 2008, they will be predicted to hit 0.133+0.52(0.350) which is 0.315. This is still above average, but is not as good as 0.350.
51
In general, when the association is strong in a scatterplot, there will be less regression to the mean. Likewise, when the association is weak in a scatterplot, there will be more regression to the mean.
52
The example below shows the strikeout rates for pitchers in 2008 and 2009. There is a very strong association (r=0.76), so there is very little regression to the mean. The example below shows the winning percentage for pitchers in 2008 and 2009. There is a weak association (r=0.31), so there is much more regression to the mean.
53
There are many sports examples of regression to the mean. One of the most famous it the sophomore slump, in which rookies that had great seasons then go on to have worse PERFORMANCES the next year. A study from 2004 found that of the 112 Rookie of the Year award winners in MLB (Henry Rowengartner wasn’t one of them), 63.4% PERFORMED worse in their second year, while 33% had better PERFORMANCES. About 3.6% stayed the same. The explanation is that these sophomores regress to their ABILITY after benefitting from RANDOM CHANCE the previous season. Another example is the Sports Illustrated or Madden jinxes.
54
Regression to the mean can be used to make decisions. Think about stocks. People always say to “buy low and sell high.” The same reasoning can be used in sports. A team or fantasy owner can consider trading a player who is PERFORMING above his ABILITY while the trade value is high. Furthermore, a team should consider acquiring players who are PERFORMING below their ABILITY because they will tend to be bargains.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.