Statistical Inference and Regression Analysis: GB

Statistical Inference and Regression Analysis: GB.3302.30
Professor William Greene Stern School of Business IOMS Department Department of Economics

Inference and Regression
Perfect Collinearity

Perfect Multicollinearity
If X does not have full rank, then at least one column can be written as a linear combination of the other columns. X’X does not have rank and cannot be inverted. b cannot be computed.

Multicollinearity Enhanced Monet Area Effect Model: Height and Width Effects Log(Price) = β1 + β2 log Area + β3 log Aspect Ratio + β4 log Height + β5 Signature + ε (Aspect Ratio = Height/Width)

Short Rank X Enhanced Monet Area Effect Model: Height and Width Effects Log(Price) = β1 + β2 log Area + β3 log Aspect Ratio + β4 log Height + β5 Signature + ε (Aspect Ratio = Height/Width) X1 = 1, X2 = logArea, X3 = LogAspect, X4 = logHeight, X5 = Signature X2 = logH + LogW X3 = logH - LogW X4 = logH x2 + x3 – 2x4 = (logH + logW) + (logH – logW) - 2logH = 0 X5 = Signature X4 = 1/2X2 + 1/2X3 c = [0, 1, 1, -2, 0]

Least Squares Fit

Minimizing e’e = [e - X(d - b)] [e - X(d - b)]
b minimizes ee = (y - Xb)(y - Xb). Any other coefficient vector has a larger sum of squares. (Least squares is least squares.) A quick proof: d = the vector, not b u = y - Xd. Then, uu = (y - Xd)(y-Xd) = [y - Xb - X(d - b)][y - Xb - X(d - b)] = [e - X(d - b)] [e - X(d - b)] Expand to find uu = ee + (d-b)XX(d-b) > ee

Dropping a Variable An important special case. Comparing the results that we get with and without a variable z in the equation in addition to the other variables in X. Results which we can show using the previous result: Dropping a variable(s) cannot improve the fit - that is, reduce the sum of squares. The relevant d is (* ,* ,*. … , 0) i.e., some vector that has a zero in a particular place. Adding a variable(s) cannot degrade the fit - that is, increase the sum of squares. Compare the sum of squares when there is a zero in the location to where the vector does not contain the zero – just reverse the cases.

The Fit of the Regression
“Variation:” In the context of the “model” we speak of variation of a variable as movement of the variable, usually associated with (not necessarily caused by) movement of another variable.

Decomposing the Variation of y
Total sum of squares = Regression Sum of Squares (SSR) + Residual Sum of Squares (SSE)

Decomposing the Variation

A Fit Measure R2 = (Very Important Result.) R2 is bounded by zero and one if and only if: (a) There is a constant term in X and (b) The line is computed by linear least squares.

Understanding R2 R2 = squared correlation between y and the prediction of y given by the regression

Regression Results Ordinary least squares regression LHS=BOX Mean = Standard deviation = No. of observations = DegFreedom Mean square Regression Sum of Squares = Residual Sum of Squares = Total Sum of Squares = Standard error of e = Root MSE Fit R-squared = R-bar squared Model test F[ 2, 59] = Prob F > F* | Standard Prob % Confidence BOX| Coefficient Error t |t|>T* Interval Constant| ** CNTWAIT3| *** BUDGET| ***

Adding Variables R2 never falls when a z is added to the regression.
A useful general result

Adding Variables to a Model What is the effect of adding PN, PD, PS, YEAR to the model (one at a time)? Ordinary least squares regression LHS=G Mean = Standard deviation = Number of observs. = Model size Parameters = Degrees of freedom = Residuals Sum of squares = Fit R-squared = Adjusted R-squared = Model test F[ 2, 33] (prob) = (.0000) Effects of additional variables on the regression below: Variable Coefficient New R-sqrd Chg.R-sqrd Partial-Rsq Partial F PD PN PS YEAR Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X Constant| *** PG| *** Y| ***

Adjusted R Squared Adjusted R2 (for degrees of freedom?)
Includes a penalty for variables that don’t add much fit. Can fall when a variable is added to the equation.

Regression Results Ordinary least squares regression LHS=BOX Mean = Standard deviation = No. of observations = DegFreedom Mean square Regression Sum of Squares = Residual Sum of Squares = Total Sum of Squares = Standard error of e = Root MSE Fit R-squared = R-bar squared Model test F[ 2, 59] = Prob F > F* | Standard Prob % Confidence BOX| Coefficient Error t |t|>T* Interval Constant| ** CNTWAIT3| *** BUDGET| ***

Adjusted R-Squared We will discover when we study regression with more than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all. To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.

Notes About Adjusted R2

Transformed Data

Linear Transformations of Data
Change units of measurement by dividing every observation – e.g., $ to Millions of $ (see internet buzz regression) by dividing Box by Change meaning of variables: x=(x1=nominal interest=i, x2=inflation=dp, x3=GDP) z=(x1-x2 = real interest i-dp, x2=inflation=dp, x3=GDP) Change theory of art appreciation: x=(x1=logHeight, x2=logWidth, x3=signature) z=(x1-x2=logAspectRatio, x2=logHeight, x3=signature)

(Linearly) Transformed Data
How does linear transformation affect the results of least squares? Z = XP for KxK nonsingular P (Each variable in Z is a combination of the variables in X.) Based on X, b = (XX)-1X’y. You can show (just multiply it out), the coefficients when y is regressed on Z are c = P -1 b “Fitted value” is Zc = XPP-1b = Xb. The same!! Residuals from using Z are y - Zc = y - Xb (we just proved this.). The same!! Sum of squared residuals must be identical, as y-Xb = e = y-Zc. R2 must also be identical, as R2 = ee/same total SS.

Principal Components Z = XC Why do we do this? Fewer columns than X
Includes as much ‘variation’ of X as possible Columns of Z are orthogonal Why do we do this? Collinearity Combine variables of ambiguous identity such as test scores as measures of ‘ability’

Model Building and Functional Form

Using Logs

Time Trends in Regression
y = α + β1x + β2t + ε β2 is the period to period increase not explained by anything else. log y = α + β1log x + β2t + ε (not log t, just t) β2 is the period to period % increase not explained by anything else.

U.S. Gasoline Market: Price and Income Elasticities Downward Trend in Gasoline Usage

Application: Health Care Data
German Health Care Usage Data, There are altogether 27,326 observations on German households, DOCTOR = 1(number of doctor visits > 0) HOSPITAL = 1(number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / HHKIDS = children under age 16 in the household = 1; otherwise = EDUC = years of schooling FEMALE = 1(female headed household) AGE = age in years MARRIED = marital status EDUC = years of education 31

Dummy Variable D = 0 in one case and 1 in the other
Y = a + bX + cD + e When D = 0, E[Y|X] = a + bX When D = 1, E[Y|X] = a + c + bX

A Conspiracy Theory for Art Sales at Auction
Sotheby’s and Christies, 1995 to about 2000 conspired on commission rates.

If the Theory is Correct…
Sold from 1995 to 2000 Sold before 1995 or after 2000

Evidence: Two Dummy Variables Signature and Conspiracy Effects
The statistical evidence seems to be consistent with the theory.

Set of Dummy Variables Usually, Z = Type = 1,2,…,K
Y = a + bX + d1 if Type= d2 if Type= … dK if Type=K

A Set of Dummy Variables
Complete set of dummy variables divides the sample into groups. Fit the regression with “group” effects. Need to drop one (any one) of the variables to compute the regression. (Avoid the “dummy variable trap.”)

Group Effects in Teacher Ratings

Rankings of 132 U.S.Liberal Arts Colleges
Nancy Burnett: Journal of Economic Education, 1998 Rankings of 132 U.S.Liberal Arts Colleges Reputation=α+β1Religious + β2GenderEcon + β3EconFac β4North + β5South + β6Midwest + β7West + ε

Minitab does not like this model.

Too many dummy variables cause perfect multicollinearity
If we us all four region dummies Reputation = a + bn + … if north Reputation = a + bm + … if midwest Reputation = a + bs + … if south Reputation = a + bw + … if west Only three are needed – so Minitab dropped west Reputation = a … if west

Unordered Categorical Variables
House price data (fictitious) Type 1 = Split level Type 2 = Ranch Type 3 = Colonial Type 4 = Tudor Use 3 dummy variables for this kind of data. (Not all 4) Using variable STYLE in the model makes no sense. You could change the numbering scale any way you like. 1,2,3,4 are just labels.

Transform Style to Types

Hedonic House Price Regression
Each of these is relative to a Split Level, since that is the omitted category. E.g., the price of a Ranch house is $74,369 less than a Split Level of the same size with the same number of bedrooms.

We used McDonald’s Per Capita

More Movie Madness McDonald’s and Movies (Craig, Douglas, Greene: International Journal of Marketing) Log Foreign Box Office(movie,country,year) = α + β1*LogBox(movie,US,year) + β2*LogPCIncome + β4LogMacsPC + GenreEffect + CountryEffect + ε.

Movie Madness Data (n=2198)

Macs and Movies Genres (MPAA) 1=Drama 2=Romance 3=Comedy 4=Action
5=Fantasy 6=Adventure 7=Family 8=Animated 9=Thriller 10=Mystery 11=Science Fiction 12=Horror 13=Crime Countries and Some of the Data Code Pop(mm) per cap # of Language Income McDonalds 1 Argentina Spanish 2 Chile, Spanish 3 Spain Spanish 4 Mexico Spanish 5 Germany German 6 Austria German 7 Australia English 8 UK UK

CRIME is the left out GENRE.
AUSTRIA is the left out country. Australia and UK were left out for other reasons (algebraic problem with only 8 countries).

Functional Form: Quadratic
Y = a + b1X + b2X2 + e dE[Y|X]/dX = b1 + 2b2X

Interaction Effect Y = a + b1X + b2Z + b3X*Z + e
E.g., the benefit of a year of education depends on how old one is. Log(income)=a + b1*Ed + b2*Ed b3*Ed*Age + e dlogIncome/dEd=b1+2b2*Ed+b3*Age

Effect of an additional year of education increases from about 6
Effect of an additional year of education increases from about 6.8% at age 20 to 7.2% at age 40

Statistics and Data Analysis
Properties of Least Squares

Terms of Art Estimates and estimators
Properties of an estimator - the sampling distribution “Finite sample” properties as opposed to “asymptotic” or “large sample” properties

Least Squares

Deriving the Properties of b
So, b = the parameter vector + a linear combination of the disturbances, each times a vector. Therefore, b is a vector of random variables. We analyze it as such. We do the analysis conditional on an X, then show that results do not depend on the particular X in hand, so the result must be general – i.e., independent of X.

Unbiasedness of b

Left Out Variable Bias A Crucial Result About Specification: Two sets of variables in the regression, X1 and X2. y = X1 1 + X2 2 +  What if the regression is computed without the second set of variables? What is the expectation of the "short" regression estimator? b1 = (X1X1)-1X1y

The Left Out Variable Formula
E[b1] = 1 + (X1X1)-1X1X22 The (truly) short regression estimator is biased. Application: Quantity = 1Price + 2Income +  If you regress Quantity on Price and leave out Income. What do you get?

Application: Left out Variable
Leave out Income. What do you get? In time series data, 1 < 0, 2 > 0 (usually) Cov[Price,Income] > 0 in time series data. So, the short regression will overestimate the price coefficient. Simple Regression of G on a constant and PG Price Coefficient should be negative.

Estimated ‘Demand’ Equation Shouldn’t the Price Coefficient be Negative?

Multiple Regression of G on Y and PG. The Theory Works!
Ordinary least squares regression LHS=G Mean = Standard deviation = Number of observs. = Model size Parameters = Degrees of freedom = Residuals Sum of squares = Standard error of e = Fit R-squared = Adjusted R-squared = Model test F[ 2, 33] (prob) = (.0000) Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X Constant| *** Y| *** PG| ***

Specification Errors-1
Omitting relevant variables: Suppose the correct model is y = X11 + X22 + . I.e., two sets of variables. Compute least squares omitting X2. Some easily proved results: Var[b1] is smaller than Var[b1.2]. You get a smaller variance when you omit X2. (One interpretation: Omitting X2 amounts to using extra information (2 = 0). Even if the information is wrong (see the next result), it reduces the variance. (This is an important result.)

Specification Errors-2
Including superfluous variables: Just reverse the results. Including superfluous variables increases variance. (The cost of not using information.) Does not cause a bias, because if the variables in X2 are truly superfluous, then 2 = 0, so E[b1.2] = 1.

Estimating Var[b|X]

Variance of the Least Squares Estimator

Gauss-Markov Theorem A theorem of Gauss and Markov: Least Squares is the Minimum Variance Linear Unbiased Estimator 1. Linear estimator 2. Unbiased: E[b|X] = β Comparing positive definite matrices: Var[c|X] – Var[b|X] is nonnegative definite for any other linear and unbiased estimator.

True Variance of b|X

Estimating 2 Using the residuals instead of the disturbances:
The natural estimator: ee/N as a sample surrogate for /N Imperfect observation of i = ei + ( - b)xi Downward bias of ee/N. We obtain the result E[ee|X] = (N-K)2

Expectation of e’e

Expected Value of e’e:

Estimating σ2 The unbiased estimator is s2 = ee/(N-K).
N-K = “Degrees of freedom correction”

Var[b|X] Estimating the Covariance Matrix for b|X
The true covariance matrix is 2 (X’X)-1 The natural estimator is s2(X’X)-1 “Standard errors” of the individual coefficients are the square roots of the diagonal elements.

X’X (X’X)-1 s2(X’X)-1

Regression Results Ordinary least squares regression LHS=G Mean = Standard deviation = Number of observs. = Model size Parameters = Degrees of freedom = Residuals Sum of squares = Standard error of e = <***** sqr[ /(36 – 7)] Fit R-squared = Adjusted R-squared = Model test F[ 6, 29] (prob) = (.0000) Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X Constant| PG| *** Y| *** TREND| ** PNC| PUC| PPT| ** Create ; trend=year-1960$ Namelist; x=one,pg,y,trend,pnc,puc,ppt$ Regress ; lhs=g ; rhs=x$

Not Perfect Collinearity

Variance Inflation and Multicollinearity
When variables are highly but not perfectly correlated, least squares is difficult to compute accurately Variances of least squares slopes become very large. Variance inflation factors: For each xk, VIF(k) = 1/[1 – R2(k)] where R2(k) is the R2 in the regression of xk on all the other x variables in the data matrix

NIST Statistical Reference Data Sets – Accuracy Tests

The Filipelli Problem

VIF for X10: R2 = VIF = D+15

Other software: Minitab reports the correct answer
Stata drops X10

Accurate and Inaccurate Computation of Filipelli Results
Accurate computation requires not actually computing (X’X)-1. We (and others) use the QR method. See text for details.

Testing Hypotheses

Hypothesis Testing: Criteria

The F Statistic has an F Distribution

Nonnormality or Large N
Denominator of F converges to 1. Numerator converges to chi squared[J]/J. Rely on law of large numbers for the denominator and CLT for the numerator: JF  Chi squared[J] Use critical values from chi squared.

Significance of the Regression - R*2 = 0

Table of 95% Critical Values for F

A Case Study

Mega Deals for Stars A Capital Budgeting Computation
Costs and Benefits Certainty: Costs Uncertainty: Benefits Long Term: Need for discounting

Baseball Story A Huge Sports Contract
Alex Rodriguez hired by the Texas Rangers for something like $25 million per year in 2000. Costs – the salary plus and minus some fine tuning of the numbers Benefits – more fans in the stands. How to determine if the benefits exceed the costs? Use a regression model.

The Texas Deal for Alex Rodriguez
2001 Signing Bonus = 10M Total: $252M ???

The Real Deal Year Salary Bonus Deferral 2001 21 2 5 to 2011
Deferrals accrue interest of 3% per year.

Costs Insurance: About 10% of the contract per year
(Taxes: About 40% of the contract) Some additional costs in revenue sharing revenues from the league (anticipated, about 17.5% of marginal benefits – uncertain) Interest on deferred salary - $150,000 in first year, well over $1,000,000 in 2010. (Reduction) $3M it would cost to have a different shortstop. (Nomar Garciaparra)

PDV of the Costs Using 8% discount factor (They used)
Accounting for all costs Roughly $21M to $28M in each year from 2001 to 2010, then the deferred payments from 2010 to 2020 Total costs: About $165 Million/Year in 2001 (Present discounted value)

Benefits More fans in the seats Gate Parking Merchandise
Increased chance at playoffs and world series Sponsorships (Loss to revenue sharing) Franchise value

How Many New Fans? Projected 8 more wins per year.
What is the relationship between wins and attendance? Not known precisely Many empirical studies (The Journal of Sports Economics) Use a regression model to find out.

Baseball Data 31 teams, 17 years (fewer years for 6 teams)
Winning percentage: Wins = 162 * percentage Rank Average attendance. Attendance = 81*Average Average team salary Number of all stars Manager years of experience Percent of team that is rookies Lineup changes Mean player experience Dummy variable for change in manager

Baseball Data (Panel Data)

A Dynamic Equation

About 220,000 fans

Marginal Value of One More Win

The Regression Model

Marginal Value of One Win

Marginal Value of an A Rod
8 games * 63,734 fans = 509,878 fans 509,878 fans * $18 per ticket $2.50 parking etc. $1.80 stuff (hats, bobble head dolls,…) $11.3 Million per year !!!!! It’s not close. (Marginal cost is at least $16.5M / year)

Statistical Inference and Regression Analysis: GB

Similar presentations

Presentation on theme: "Statistical Inference and Regression Analysis: GB"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Inference and Regression Analysis: GB

Similar presentations

Presentation on theme: "Statistical Inference and Regression Analysis: GB"— Presentation transcript:

Similar presentations

About project

Feedback