Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School

Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School joshtabor@hotmail.com

Objectives By the end of the session, participants will: Obtain several classroom-tested examples that promote the real-world applications of mathematics and help students meet the Common Core State Standards Understand that the goal of a model should be minimize the size of prediction errors Understand the properties of least-squares regression lines and how to interpret the slope and intercept Understand the concept of regression to the mean and what it reveals about future performances

Move over Brad Pitt, here is the real star of Moneyball: Predicted winning percentage = Created by Bill James and called the “Pythagorean” expected winning percentage, this formula uses a team’s runs scored (RS) and runs allowed (RA) to predict their winning percentage. Does it work? Why did he use 2 for the exponent instead of some other value??

In 2012, the Oakland A’s scored 713 runs, allowed 614 runs, and won 94 games. According to the Pythagorean formula, a team with this many runs scored and runs allowed would be expected to win about 57.4% of their games. In a 162-game season, this is 0.574(162) = 92.99 expected wins. This means that Oakland won 94 – 92.99 = 1.01 more games than expected, based on their runs scored and allowed.

The difference between an actual value and a predicted value is called a residual. residual = actual value – predicted value In the Common Core State Standards, our students are expected to “informally assess the fit of a function by plotting and analyzing residuals” (S-ID-6.b).

TeamRSRAWinsPredicted WinsResidual ARI7346888186.235-5.23503 ATL7006009493.38820.611765 BAL7127059381.800311.1997 BOS7348066973.4425-4.44249 CHC6137596163.954-2.95396 CHW7486768589.1701-4.17012 CIN6695889791.3965.60403 CLE6678456862.18935.81073 COL7588906468.107-4.10699 Here is a partial table showing how the formula worked for other teams:

So, why did Bill James use 2 for the exponent? Will another value for the exponent work better? Here is a partial table using 1 for the exponent. Does this model work better? TeamRSRAWinsPredicted WinsResidual ARI7346888183.6203-2.62025 ATL7006009487.23086.76923 BAL7127059381.400111.5999 BOS7348066977.213-8.21299 CHC6137596172.3805-11.3805 CHW7486768585.0955-0.09551 CIN6695889786.219610.7804 CLE6678456871.4643-3.46429 COL7588906474.5121-10.5121

Which model is better? In general, we prefer models that produce smaller residuals. To compare these two models, we can compare the sum of squared residuals (SSR). For an exponent of 2, SSR = (-5.2) 2 + (0.6) 2 + … = 411 For an exponent of 1, SSR = (2.6) 2 + (6.8) 2 + … = 1300

The best model is the one that produces the smallest sum of squared residuals (SSR). This is called the least- squares criterion. Here is a scatterplot showing different exponents from 1 to 3 along with their corresponding SSR. Which exponent looks best?

Interestingly, there is a different “ideal” exponent for each sport. (Class activity alert!) For example, here is a scatterplot showing different exponents and SSR for NBA teams in 2009:

Part 2: Modeling Runs Scored Now that we understand how to use runs scored and runs allowed to model predicted winning percentage, how can we model runs scored and runs allowed? Using team data from the 2012 season, we can look for variables that have a strong relationship with runs scored. Here is a scatterplot showing hits vs. runs scored for the 30 teams:

Because the association appears linear, we should use a line to model the relationship between hits and runs scored. But, which line is best? Time for Fathom….

The “best” line is the one that makes the sum of squared residuals the least. Not surprisingly, it is called the least-squares regression line. Here is the scatterplot again, along with the least- squares regression line: predicted RS= -79 + 0.556(hits)

CCSS: S-ID-7: Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data. The slope of the least-squares regression line is 0.556. How do we interpret this value? What about the intercept? Slope: For each additional hit, the predicted number of runs increases by 0.556. Intercept: If a team had 0 hits for the season, the predicted number of runs scored is -79. Realistic? Why not??

Suppose that Oakland has a chance to improve at one position and can expect to have 40 more hits. How many wins is that worth, assuming the performances of other players stay the same? For each additional hit, we predict 0.556 more runs. So, 40 additional hits is worth 40(0.556) = 22.24 more runs. This means Oakland would score 735.24 runs instead of 713. Using the Pythagorean formula: 58.9% of 162 is 95.42 wins. This means 2.43 additional expected wins (95.42 – 92.99 = 2.43).

Which variable does the best job of modeling runs scored? Here are some scatterplots:

The best model is the one with the smallest sum of squared residuals (SSR). Here is a table showing the SSR when predicting runs scored using the following variables: VariableSSR Hits40,603 Home runs56,830 On-base percentage37,138 Slugging average14,237 OPS10,109

Part 3: Modeling Runs Allowed Modeling runs allowed is much more difficult. However, sabermatricians have been making good progress in the last decade after a revolutionary discovery by Voros McCracken. He demonstrated that a pitcher has very little (if any) control over what happens to a ball once it is hit. BABIP (batting average on balls in play) is a measure of what happens during at-bats that don’t end in strikeouts, walks, or home runs. Voros showed that BABIP is essentially random from year to year.

Here is a scatterplot showing the BABIP for pitchers in two consecutive years (2008 and 2009):

Because the outcome of batted balls is basically random, McCracken suggested that the best way to model runs allowed is to use variables that pitchers do have control over. For example, strikeout rate, walk rate, and home run rate. Here is a scatterplot of strikeout rate in 2008 and 2009 for these same pitchers:

Part 4: Regression to the Mean It’s difficult to make predictions, especially about the future. –Yogi Berra So far, we have been investigating relationships between variables within the same season. What teams really want to know is how to make predictions about what will happen next year. Before we do that, let’s flip some coins…

Here is a scatterplot showing the outcomes of two sets of 10 coin flips, along with the line y = x. If we know a flipper did well the first time, what should we predict will happen the second time? What if a flipper did poorly the first time?

Here again is the scatterplot of BABIP for two consecutive years, including the line y = x. If a pitcher had a bad (high) BABIP in 2008, what can we expect to happen the following year? Which players should a poor team like Oakland try to sign?

Now, let’s look at hitters in two consecutive years. Here is a scatterplot showing batting average in 2008 and 2009, along with the line y = x. Do we see the same thing?

Now, here is the same scatterplot with the least- squares regression line added as well. The line predicts that players who were above average in 2008 will be good, but not quite as good in 2009. Likewise, it predicts that players who were below average in 2008 will be bad, but not quite as bad in 2009. This is regression to the mean.

What causes regression to the mean? In sports, performance = ability + random chance. A good performance is usually a combination of good ability and good luck. In future performances, the good luck is unlikely to continue, even if his ability is the same. This explains the SI Jinx and the Madden Curse.

This also applies to student performance on tests, especially MC tests—a good performance one year is likely due to good ability and good luck. What is likely to happen next year? What about an intervention class for students with low scores the previous year?? Understanding regression to the mean is vital for making predictions about the future. Evaluations: Session #466

Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School

Similar presentations

Presentation on theme: "Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School

Similar presentations

Presentation on theme: "Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School"— Presentation transcript:

Similar presentations

About project

Feedback