A Derivation of Bill James Pythagorean Won-Loss Formula.

Slides:



Advertisements
Similar presentations
Probability.
Advertisements

STATISTICS Linear Statistical Models
STATISTICS Random Variables and Probability Distributions
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Chapter 7 Sampling and Sampling Distributions
The 5S numbers game..
Overview of Inferential Statistics
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Distributions of the Sample Mean
Part 11: Random Walks and Approximations 11-1/28 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department.
CHAPTER 14: Confidence Intervals: The Basics
9. Two Functions of Two Random Variables
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Mean, Proportion, CLT Bootstrap
Sta220 - Statistics Mr. Smith Room 310 Class #14.
AP Stats Review. Assume that the probability that a baseball player will get a hit in any one at-bat is Give an expression for the probability.
Copyright © 2009 Cengage Learning 9.1 Chapter 9 Sampling Distributions.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Point estimation, interval estimation
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
3-1 Introduction Experiment Random Random experiment.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Central Limit Theorem The Normal Distribution The Standardised Normal.
MEASURES OF CENTRAL TENDENCY & DISPERSION Research Methods.
CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics, 2007 Instructor Longin Jan Latecki Chapter 7: Expectation and variance.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Unit 4 Starters. Starter Suppose a fair coin is tossed 4 times. Find the probability that heads comes up exactly two times.
Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Theory of Probability Statistics for Business and Economics.
Significance Tests: THE BASICS Could it happen by chance alone?
Lecture 9. If X is a discrete random variable, the mean (or expected value) of X is denoted μ X and defined as μ X = x 1 p 1 + x 2 p 2 + x 3 p 3 + ∙∙∙
1 G Lect 3b G Lecture 3b Why are means and variances so useful? Recap of random variables and expectations with examples Further consideration.
The Guessing Game The entire business of Statistics is dedicated to the purpose of trying to guess the value of some population parameter. In what follows.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
Using Relationships to Make Predictions Pythagorean Formula: Predicted Winning Percentage.
Distributions of the Sample Mean
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
Copyright © 2009 Cengage Learning 9.1 Chapter 9 Sampling Distributions ( 표본분포 )‏
Introduction Stats I. To Do Today Introductions Blog: swstats.blogspot.com Go Over Syllabus Go Over Textbooks Register for EP Take Intro Survey Course.
CHAPTER 9 Inference: Estimation The essential nature of inferential statistics, as verses descriptive statistics is one of knowledge. In descriptive statistics,
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Binomial Distributions Chapter 5.3 – Probability Distributions and Predictions Mathematics of Data Management (Nelson) MDM 4U.
Introduction to Inference Sampling Distributions.
Binomial Distributions Chapter 5.3 – Probability Distributions and Predictions Mathematics of Data Management (Nelson) MDM 4U Authors: Gary Greer (with.
Simulations with Binomials Mean and S.D. of Binomials Section
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Section 6.3 Day 1 Binomial Distributions. A Gaggle of Girls Let’s use simulation to find the probability that a couple who has three children has all.
STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
Sampling and Sampling Distributions. Sampling Distribution Basics Sample statistics (the mean and standard deviation are examples) vary from sample to.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Chi Square Test of Homogeneity. Are the different types of M&M’s distributed the same across the different colors? PlainPeanutPeanut Butter Crispy Brown7447.
Estimating standard error using bootstrap
Parameter versus statistic
Distribution of the Sample Means
Sampling Distributions
Chapter 7: Sampling Distributions
Chapter 9 Hypothesis Testing.
CONCEPTS OF ESTIMATION
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
CHAPTER 3 Describing Relationships
Chapter 10: Estimating with Confidence
Continuous distributions
Keller: Stats for Mgmt & Econ, 7th Ed Sampling Distributions
Virtual University of Pakistan
Major League Baseball American League.
Presentation transcript:

A Derivation of Bill James Pythagorean Won-Loss Formula

What is Sabermetrics? The term sabermetrics, coined by noted baseball analyst Bill James, comes from the acronym for the Society for American Baseball Research, or SABR. James unofficially defined sabermetrics as the search for objective knowledge about baseball. Wolframs defines sabermetrics as the study of baseball statistics.

Bill James: Godfather of Sabermetrics Bill James is a baseball historian, writer, and statistician, who was one of the first supporters/pioneers of sabermetrics and has been the most influential sabermetrician since the discipline began. He started his work in sabermetrics in the early 1970s, and, though unpopular at the time, his work and influence have spread and many of his ideas and statistical inventions are in common use in baseball (as well as other sports) today He is currently the Senior Operations Advisor for the Boston Red Sox, and in 2006 was named one of Time Magazines 100 Most Influential People

James Pythagorean Won-Loss Record This formula gives what a baseball teams overall winning percentage SHOULD have been, based on the number of runs scored and runs allowed. Statistically speaking, it gives an expected value for a teams winning percentage as a function of the teams runs scored and runs allowed. The formula was named Pythagorean W-L because it reminded James of the Pythagorean theorem.

The Pythagorean formula is often used in the middle of a baseball season to estimate how a team will finish the season, or at the end of the season for a reasonable guess at next years W-L record. Here are a couple of interesting examples from this season: On August 4 th, the 2008 Texas Rangers were (W-L%.522), with a Pythagorean expectation of (W-L%.478). They finished the season at (W-L%.488). On July 20 th, the 2008 Cleveland Indians were (W-L%.443), with a Pythagorean expectation of (W-L%.505). They finished the season at (W-L%.500). On July 20 th, the 2008 Toronto Blue Jays were (W-L%.490), with a Pythagorean expectation of (W-L%.531). They finished the season at (W-L%.531). The lesson here is that a teams luck will usually catch up with them over the course of a 162 game season. Of course, there are always exceptions: On July 20 th, the 2008 Anaheim Angels were (W-L%.612), with a Pythagorean expectation of (W-L%.541). They finished the season at (W-L%.617).

Bill James discovery of this formula was, by his own admission, lucky. In response to an that I sent him asking about his methods for deriving the formula, he responded: Mostly luck. I had been experimenting with the data and had several other good formulas for data within 1 standard deviation of the mean. However, many of them were complicated, and they returned absurd answers in extreme cases. But one day, as I was walking across campus at the University of Kansas, it hit me: it was a simple relationship of squares. This presented a much better fit to the data, and was much more elegant. James Derivation of the Pythagorean Formula

James formula for predicting a baseball teams winning percentage worked beautifully, despite the fact that its derivation had little basis in statistical theory. A paper published by Steven J. Miller (then an Associate Professor of Mathematics at Brown University) showed that, under reasonable statistical assumptions about a baseball teams runs scored and runs allowed, James Pythagorean Formula can be shown to follow mathematically. James Derivation of the Pythagorean Formula

Runs scored and runs allowed can be approximated by continuous random variables In order to obtain a simple closed form for expressions for the probability of scoring more runs than allowing in a game, we assume that the runs scored and runs allowed are drawn from continuous and not discrete distributions. This allows us to replace discrete sums with continuous integrals... Of course assumptions of continuous run distribution cannot be correct in baseball, but the hope is that such a computationally useful assumption is a reasonable approximation to reality. Runs scored and runs allowed can be modeled by continuous Weibull distributions [The Weibulls flexible shape parameters] make it much easier to fit the observed baseball data with a Weibull distribution than with some of the better known distributions. Further, the exponential decays too slowly to be realistic; it leads to too many games with large scores. By choosing our parameters appropriately, a Weibull has a much more realistic decay... Runs scored and runs allowed are statistically independent In a baseball game, runs scored and runs allowed cannot be entirely independent, as games do not end in ties... Modified chi-squared tests do show that, given that runs scored and runs allowed must be distinct integers, the runs scored and runs allowed per game are statistically independent. Assumptions

The Weibull Distribution

Remark on the Weibull Distribution parameters

Statement of the Theorem:

Since a teams winning percentage is the probability that they will score more runs than they allow, we want to find P(X>Y), where X is runs scored and Y is runs allowed. Since this probability depends jointly on X and Y, we use a joint probability density function: The Joint PDF

Independence of Random Variables

Expected Values of the RVs X and Y