Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics
Inference and Regression Not Perfect Collinearity
Variance Inflation and Multicollinearity When variables are highly but not perfectly correlated, least squares is difficult to compute accurately Variances of least squares slopes become very large. Variance inflation factors: For each xk, VIF(k) = 1/[1 – R2(k)] where R2(k) is the R2 in the regression of xk on all the other x variables in the data matrix
Gasoline Market Regression Analysis: logG versus logIncome, logPG The regression equation is logG = - 0.468 + 0.966 logIncome - 0.169 logPG Predictor Coef SE Coef T P Constant -0.46772 0.08649 -5.41 0.000 logIncome 0.96595 0.07529 12.83 0.000 logPG -0.16949 0.03865 -4.38 0.000 S = 0.0614287 R-Sq = 93.6% R-Sq(adj) = 93.4% Analysis of Variance Source DF SS MS F P Regression 2 2.7237 1.3618 360.90 0.000 Residual Error 49 0.1849 0.0038 Total 51 2.9086 R2 = 2.7237/2.9086 = 0.93643
Gasoline Market Regression Analysis: logG versus logIncome, logPG, ... The regression equation is logG = - 0.558 + 1.29 logIncome - 0.0280 logPG - 0.156 logPNC + 0.029 logPUC - 0.183 logPPT Predictor Coef SE Coef T P Constant -0.5579 0.5808 -0.96 0.342 logIncome 1.2861 0.1457 8.83 0.000 logPG -0.02797 0.04338 -0.64 0.522 logPNC -0.1558 0.2100 -0.74 0.462 logPUC 0.0285 0.1020 0.28 0.781 logPPT -0.1828 0.1191 -1.54 0.132 S = 0.0499953 R-Sq = 96.0% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression 5 2.79360 0.55872 223.53 0.000 Residual Error 46 0.11498 0.00250 Total 51 2.90858 R2 = 2.79360/2.90858 = 0.96047 logPG is no longer statistically significant when the other variables are added to the model.
Evidence of Multicollinearity: Regression of logPG on the other variables gives a very good fit.
Diagnostic Tools Look for incremental contributions to R2 when additional predictors are added Look for predictor variables not to be well explained by other predictors: (these are all the same) Look for “information” and independent sources of information Collinearity and influential observations can be related Removing influential observations can make it worse or better The relationship is far too complicated to say anything useful about how these two might interact.
NIST Statistical Reference Data Sets – Accuracy Tests
The Filipelli Problem
VIF for X10: R2 = .99999999999999630 VIF = .27294543196184830D+15
Other software: Minitab reports the correct answer Stata drops X10
Accurate and Inaccurate Computation of Filipelli Results Accurate computation requires not actually computing (X’X)-1. We (and others) use the QR method. See text for details.
Stata Filipelli Results
Even after dropping two (random columns), results are only correct to 1 or 2 digits.
Inference and Regression Testing Hypotheses
Testing Hypotheses
Hypothesis Testing: Criteria
The F Statistic has an F Distribution
Nonnormality or Large N Denominator of F converges to 1. Numerator converges to chi squared[J]/J. Rely on law of large numbers for the denominator and CLT for the numerator: JF Chi squared[J] Use critical values from chi squared.
Significance of the Regression - R*2 = 0
Table of 95% Critical Values for F
+----------------------------------------------------+ | Ordinary least squares regression | | LHS=LOGBOX Mean = 16.47993 | | Standard deviation = .9429722 | | Number of observs. = 62 | | Residuals Sum of squares = 25.36721 | | Standard error of e = .6984489 | | Fit R-squared = .5323241 | | Adjusted R-squared = .4513802 | +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X| |Constant| 11.9602*** .91818 13.026 .0000 | |LOGBUDGT| .38159** .18711 2.039 .0465 3.71468| |STARPOWR| .01303 .01315 .991 .3263 18.0316| |SEQUEL | .33147 .28492 1.163 .2500 .14516| |MPRATING| -.21185 .13975 -1.516 .1356 2.96774| |ACTION | -.81404** .30760 -2.646 .0107 .22581| |COMEDY | .04048 .25367 .160 .8738 .32258| |ANIMATED| -.80183* .40776 -1.966 .0546 .09677| |HORROR | .47454 .38629 1.228 .2248 .09677| |PCBUZZ | .39704*** .08575 4.630 .0000 9.19362| +--------+------------------------------------------------------------+ F = [(.6211405 - .5323241)/3] / [(1 - .6211405)/(62 – 13)] = 3.829; F* = 2.84
Inference and Regression A Case Study
Mega Deals for Stars A Capital Budgeting Computation Costs and Benefits Certainty: Costs Uncertainty: Benefits Long Term: Need for discounting
Baseball Story A Huge Sports Contract Alex Rodriguez hired by the Texas Rangers for something like $25 million per year in 2000. Costs – the salary plus and minus some fine tuning of the numbers Benefits – more fans in the stands. How to determine if the benefits exceed the costs? Use a regression model.
The Texas Deal for Alex Rodriguez 2001 Signing Bonus = 10M 2001 21 2002 21 2003 21 2004 21 2005 25 2006 25 2007 27 2008 27 2009 27 2010 27 Total: $252M ???
The Real Deal Year Salary Bonus Deferral 2001 21 2 5 to 2011 Deferrals accrue interest of 3% per year.
Costs Insurance: About 10% of the contract per year (Taxes: About 40% of the contract) Some additional costs in revenue sharing revenues from the league (anticipated, about 17.5% of marginal benefits – uncertain) Interest on deferred salary - $150,000 in first year, well over $1,000,000 in 2010. (Reduction) $3M it would cost to have a different shortstop. (Nomar Garciaparra)
PDV of the Costs Using 8% discount factor (They used) Accounting for all costs Roughly $21M to $28M in each year from 2001 to 2010, then the deferred payments from 2010 to 2020 Total costs: About $165 Million/Year in 2001 (Present discounted value)
Benefits More fans in the seats Gate Parking Merchandise Increased chance at playoffs and world series Sponsorships (Loss to revenue sharing) Franchise value
How Many New Fans? Projected 8 more wins per year. What is the relationship between wins and attendance? Not known precisely Many empirical studies (The Journal of Sports Economics) Use a regression model to find out.
Baseball Data 31 teams, 17 years (fewer years for 6 teams) Winning percentage: Wins = 162 * percentage Rank Average attendance. Attendance = 81*Average Average team salary Number of all stars Manager years of experience Percent of team that is rookies Lineup changes Mean player experience Dummy variable for change in manager
Baseball Data (Panel Data)
A Dynamic Equation
About 220,000 fans
The Regression Model
Marginal Value of One Win
Marginal Value of an A Rod 8 games * 63,734 fans = 509,878 fans 509,878 fans * $18 per ticket $2.50 parking etc. $1.80 stuff (hats, bobble head dolls,…) $11.3 Million per year !!!!! It’s not close. (Marginal cost is at least $16.5M / year)
The IPN Player A-Rod and Yankees – The Iconic Performance Network Player Attendance rose to 4M in 2005, 4.3M in 2007 MVP in 2005 and 2007 Huge growth in the YES network Seemed certain to break Bonds’ HR record (Asterisk?) New deal: $275M over 10 years Chicago Cubs offer included team ownership. Drug Problems probably derailed this career path.
The Ghosts of Seasons Past: Long Run Implications - The Shadow Cost The commitment to A-Rod limited the ability of the Texas Rangers to field a great team. The same problem now faces the Yankees. A-Rod is aging and becoming less likely to break the records. His steroid use has tarnished his reputation and reduced the value of his history. Why do teams do these long term mega deals for baseball players?
Kershaw vs. A Rod Shorter term, risk shifting onto the team Bargaining strength has shifted in favor of the player.