The statistics behind the game

Slides:



Advertisements
Similar presentations
CHOW TEST AND DUMMY VARIABLE GROUP TEST
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
10-3 Inferences.
Baseball Statistics By Krishna Hajari Faraz Hyder William Walker.
Heteroskedasticity The Problem:
Objectives (BPS chapter 24)
PSY 307 – Statistics for the Behavioral Sciences
Regression Example Using Pop Quiz Data. Second Pop Quiz At my former school (Irvine), I gave a “pop quiz” to my econometrics students. The quiz consisted.
SIMPLE LINEAR REGRESSION
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Introduction to Probability and Statistics Linear Regression and Correlation.
Interpreting Bi-variate OLS Regression
Christopher Dougherty EC220 - Introduction to econometrics (chapter 2) Slideshow: testing a hypothesis relating to a regression coefficient (2010/2011.
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Lecture 5 Correlation and Regression
Confidence intervals were treated at length in the Review chapter and their application to regression analysis presents no problems. We will not repeat.
Introduction to Linear Regression and Correlation Analysis
Returning to Consumption
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
t(ea) for Two: Test between the Means of Different Groups When you want to know if there is a ‘difference’ between the two groups in the mean Use “t-test”.
Chapter 11 Nonparametric Tests.
Data Analysis (continued). Analyzing the Results of Research Investigations Two basic ways of describing the results Two basic ways of describing the.
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
© Copyright McGraw-Hill 2004
Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: exercise 6.13 Original citation: Dougherty, C. (2012) EC220 - Introduction.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
1 In the Monte Carlo experiment in the previous sequence we used the rate of unemployment, U, as an instrument for w in the price inflation equation. SIMULTANEOUS.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
WHITE TEST FOR HETEROSCEDASTICITY 1 The White test for heteroscedasticity looks for evidence of an association between the variance of the disturbance.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
Managerial Economics & Decision Sciences Department hypotheses, test and confidence intervals  linear regression: estimation and interpretation  linear.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
The statistics behind the game
QM222 Class 9 Section A1 Coefficient statistics
business analytics II ▌appendix – regression performance the R2 
assignment 7 solutions ► office networks ► super staffing
business analytics II ▌assignment one - solutions autoparts 
Correlation and Simple Linear Regression
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
NCAA Basketball Tournament: Predicting Performance
Chapter 11: Simple Linear Regression
Comparing Three or More Means
QM222 Class 18 Omitted Variable Bias
John Loucks St. Edward’s University . SLIDES . BY.
QM222 Class 8 Section A1 Using categorical data in regression
The slope, explained variance, residuals
Inferential Statistics:
Correlation and Simple Linear Regression
Chapter 11 Inferences About Population Variances
Auto Accidents: What’s responsible?
QM222 Class 15 Section D1 Review for test Multicollinearity
Correlation and Simple Linear Regression
Chapter 7: The Normality Assumption and Inference with OLS
Common Statistical Analyses Theory behind them
Calculating t X -  t = s x X1 – X2 t = s x1 – x2 s d One sample test
Linear Regression and Correlation
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

The statistics behind the game Baseball Findings The statistics behind the game Harlan Thompson Sungjin Cho Ryan Fagan

An Introduction Throughout its long history, baseball has been the subject of many statistical studies. It lends itself well to statistics because very careful records are kept of everything that happens in every game. The topics that have been studied range from the affect of interleague play on team standings to the role of chance in streaks and slumps Other topics of study include records and predicting the outcomes of games. We thought that looking at home runs and salary would be interesting because the great number of home runs hit and the inflation of salaries are both controversial topics.

Home Runs Per Year -How has the total number of home runs in major league baseball changed from year to year?

Test #1 We ran a regression with the year as the independent variable and the number of home runs as the dependent variable to find out the rate at which the number of home runs in the league is increasing.

Scatterplot

Results Source | SS df MS Number of obs = 25 Model | .911664082 1 .911664082 Prob > F = 0.0001 Residual | .966758514 23 .042032979 R-squared = 0.4853 -------------+------------------------------ Adj R-squared = 0.4630 Total | 1.8784226 24 .078267608 Root MSE = .20502 ------------------------------------------------------------------------------ hr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- year | .0264817 .0056862 4.66 0.000 .0147189 .0382445 _cons | 1.350108 .079607 16.96 0.000 1.185428 1.514787

Interpretation The 95% confidence interval for the coefficient of year is totally positive - this shows that the number of home runs is definitely increasing each year. An R2 value of .4853 clearly shows a positive relationship, although not a very strong one. This could be because many other factors can affect the number of home runs hit -- weather, injuries to certain players, etc. The coefficient of year is .0264817, so each year about .02648 more home runs are hit in each game. This is over 4 more home runs per year.

Test #2 We split up the home run data into 2 separate groups 1976-1987 and 1988-1999. Then we ran a hypothesis test on the two groups to find out if their variances are equal to determine whether or not we could use a paired t test on the data. We used the following hypotheses: H0 : var(HR (‘76 - ‘87)) = var(HR(‘88-’99)) HA : var(HR(‘76 - ‘87)) not= var(HR(‘88-’99))

Results ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- hr1 | 12 1.584108 .073678 .255228 1.421944 1.746272 hr2 | 12 1.775 .0807902 .2798653 1.597182 1.952818 Comb. | 24 1.679554 .0570527 .2794998 1.561532 1.797577 Ho: sd(hr1) = sd(hr2) F(11,11) observed = F_obs = 0.832 F(11,11) lower tail = F_L = F_obs = 0.832 F(11,11) upper tail = F_U = 1/F_obs = 1.202 Critical values at .05 significance level: (.288, 3.47) Because the F statistic does not lie outside of this region, we cannot reject the null hypothesis!!

Interpretation The variance in home run hitting has not changed significantly over the past 25 years. Therefore we can use these two sets of data in a paired t test to determine whether or not the number of home runs hit has increased.

Test #3 Because we found that the two groups did not have an appreciable difference in variance, we can use a paired t test to determine whether or not the number of home runs hit per year has risen from the period 1976-1987 to the period 1988-1999. So we ran a hypothesis test on the two groups with the following hypotheses: H0 : HR (‘76 - ‘87) = HR(‘88-’99) HA : HR(‘76 - ‘87) not= HR(‘88-’99)

Results Paired t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- hr1 | 12 1.584108 .073678 .255228 1.421944 1.746272 hr2 | 12 1.775 .0807902 .2798653 1.597182 1.952818 diff | 12 -.1908917 .0665798 .230639 -.3374327 -.0443506 Ho: mean(hr1 - hr2) = mean(diff) = 0 Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0 t = -2.8671 t = -2.8671 t = -2.8671 P < t = 0.0077 P > |t| = 0.0153 P > t = 0.9923

Interpretation The mean for the years from 1976 to 1987 was 1.584108 HR/game vs. 1.775 HR/game from 1988 to 1999. We can reject our null hypothesis because we found t = -2.8671 (much less than the critical value -1.96). The the probability of Type I error is only .0153. Therefore, the mean number of home runs per game from 1988 to 1999 was significantly greater than the mean number from ‘76 to ‘87. So, the number of home runs per year does seem to be increasing over time.

Home Runs by Position First we looked at last year’s home runs by position for each team. The following is a sample of the data we accumulated... Team SS HR 1B HR 2B HR 3B HR C HR LF HR CF HR RF HR TOT HR Anaheim 6 36 9 47 14 35 25 34 206 NY Mets 4 22 25 24 13 15 17 18 138 San Fran 20 19 33 10 14 49 12 24 181 Next we calculated the total number of home runs and at bats as well as the average number of home runs per at bat from each position for the whole league (in order of performance)... Position HR/AB HRs ABs First Base 0.051925 752 14737 Left Field 0.047185 629 13098 Right Field 0.0462273 667 14314 Center Field 0.0391384 627 15623 Third Base 0.038288 523 13154 Catcher 0.0348063 381 10681 Shortstop 0.0243771 354 14050 Second Base 0.0239095 300 14535

Do some positions hit significantly more than the average? The league average of home runs per at bat is .0384. For each position, we used binomial hypothesis tests to test whether or not the number of home runs per at bat from that position differs significantly from the mean. For each position, Ho : HR/AB = .0384 HA : HR/AB not= .0384 (Reject if |z| > 1.96)

Results SIGNIFICANTLY BETTER (reject null) ABOUT AVERAGE (accept null) First Base: z = 7.978 Left Field: z = 5.731 Right Field: z = 5.104 ABOUT AVERAGE (accept null) Center Field: z = 1.127 Third Base: z = 0.812 Catcher: z = -1.468 BELOW AVERAGE (reject null) Shortstop: z = -8.145 Second Base: z = -11.143

Interpretation So, we’ve proven that first basemen, left fielders and right fielders are significantly above the mean in home run hitting. Shortstop and second basemen are significantly below the mean in home run hitting. Center fielders, third basemen and catchers are about average. This makes sense - the players at positions that require the most mobility (shortstop, second base) would obviously not be as powerful as those who play positions require less speed and agility. It is interesting that center fielders are significantly different from the other outfielders - they do have to have a lot more flexibility and speed.

Does salary affect performance? We looked at team salary vs. number of wins to see if the amount of money paid to the players has a significant affect on a team’s performance. Below is some of the data we used.

Wins vs. Payroll for 2000

Wins vs. Payroll for 1999

Wins vs. Payroll for 1998

Results for 2000

Results for 1999

Results for 1998

Interpretation The R2 value for the year 2000 (.1952) did not reflect a significant correlation, however years 1998 (.5442) and 1999 (.4691) reflect a relationship between total payroll and number of wins Because the coefficient of the number of wins is roughly 1 for all three years, we can conclude that an additional win costs about a million dollars.

Salary and home run hitting Finally, we thought we’d combine these two studies of salary and home run hitting and analyze how the changes in average salary have been resulted in changes in the number of home runs hit per person. Exactly how many more home runs are we getting per $1? We looked at data from 1969 to 2000. We found average salary but we could not find average number of home runs/player. However we thought the leader in home run percentage might give some kind of portrayal of the number of home runs being hit.

Salary vs. Home Runs Year Salary(thousands) HR Pct Leader Year Salary(thousands) HR Pct Leader 1969 24.9 9.16 1985 371.6 7.92 1970 29.3 7.95 1986 412.5 7.08 1971 31.5 9.49 1987 412.5 8.8 1972 34.1 7.57 1988 438.7 7.18 1973 36.6 10.2 1989 497.3 8.66 1974 40.8 6.93 1990 597.5 8.9 1975 44.7 7.17 1991 851.5 7.69 1976 51.5 7.81 1992 1028.7 8.99 1977 76.1 8.46 1993 1076.1 8.58 1978 99.9 7.18 1994 1168.3 9.75 1979 113.6 9.02 1995 1110.8 12.3 1980 143.8 8.76 1996 1120 12.29 1981 185.7 8.76 1997 1336.6 9.29 1982 241.5 7.45 1998 1398.8 13.75 1983 289.2 7.49 1999 1611.2 12.48 1984 329.4 6.87 2000 1895.6 10.21

Regression

Results Source | SS df MS Number of obs = 32 Model | 40.3836608 1 40.3836608 Prob > F = 0.0000 Residual | 54.7139269 30 1.82379756 R-squared = 0.4247 -------------+------------------------------ Adj R-squared = 0.4055 Total | 95.0975877 31 3.06766412 Root MSE = 1.3505 ------------------------------------------------------------------------------ hrpct | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- sal | .002094 .000445 4.71 0.000 .0011852 .0030029 _cons | 7.760353 .3369654 23.03 0.000 7.072178 8.448528

Interpretation The coefficient of salary is .002094 and the entire confidence interval for this value is positive. So it seems that an increase in salary may produce an increase in home run hitting. For every additional hundred thousand dollars in average salary, the leading home run hitter would hit home runs .2% more. We found an R2 value of .4247, which is fairly significant. However, from 1969 to 1976, salary stayed fairly standard (compared to the inflation today), so this may have hurt our regression since home runs were increasing at the time, although not as rapidly as recently. This suggests that home runs and salary may be increasing independently through time. There may not be an actual relationship between the two. Further study would be needed to determine if they are related.