Econometric Analysis of Panel Data William Greene Department of Economics Stern School of Business
Reverse Causality in the Preston Curve? lnSPI = + *lnGDPPC(PPP) + , 0 < < 1. (Huffington Post, 2/16/16)
In two of my projects, I was asked by reviewers to address the endogeneity concerns. In one project, I regress employee departure on project termination. Arguably project termination is not exogenous. DEPARTURE = f(a + b*PROJECT TERMINATION + e) (Time until departure???) In the other, I regress firms’ charitable giving in specific countries on their business activities in the local community. Again, business presence in countries are not exogenous. The problem is, both papers used non-linear models (Hazard model in one, and hurdle model in the other), which are required by the data I have. Are you aware of any econometric methods to deal with endogeneity in non-linear models? My search online did not go anywhere. Hazard Model: Not a linear model. Prob[event happens in time interval t to t+Δ| event happens after time t] = a function of (x’β) http://people.stern.nyu.edu/wgreene/Econometrics/ NonlinearPanelDataModels.pdf
I have been asked this question (or ones like it) dozens of times I have been asked this question (or ones like it) dozens of times. I think the issue is getting way overplayed. But, I'm not the majority voice, so you are going to have to deal with this. Step 1: you or the referee need to figure out (make a case for) by what construction is "project termination" endogenous. What is correlated with what **in your hazard model** that makes the variable endogenous? There must be a second equation that implies that project termination is endogenous. What is it? What unobservable in that equation is correlated with what unobservable in the hazard model that makes it endogenous. Same questions for your hurdle model. Step 2: Depends on the outcome of Step 1…
By what construction is SNAP endogenous in the HEALTH equation? SNAP = XβSNAP + Zδ + ε HEALTH = XβHEALTH + ηSNAP + v
Endogeneity y = X+ε, Definition: E[ε|x]≠0 Why not? Omitted variables Unobserved heterogeneity (equivalent to omitted variables) Measurement error on the RHS (equivalent to omitted variables) Endogenous sampling and attrition Simultaneity (?) (“reverse causality”)
Cornwell and Rupert Data Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years Variables in the file are EXP = work experience WKS = weeks worked OCC = occupation, 1 if blue collar, IND = 1 if manufacturing industry SOUTH = 1 if resides in south SMSA = 1 if resides in a city (SMSA) MS = 1 if married FEM = 1 if female UNION = 1 if wage set by union contract ED = years of education LWAGE = log of wage = dependent variable in regressions These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal of Applied Econometrics, 3, 1988, pp. 149-155. See Baltagi, page 122 for further analysis. The data were downloaded from the website for Baltagi's text. 8
Specification: Quadratic Effect of Experience
The Effect of Education on LWAGE
What Influences LWAGE?
The General Problem
Instrumental Variables Framework: y = X + , K variables in X. There exists a set of K variables, Z such that plim(Z’X/n) 0 but plim(Z’/n) = 0 The variables in Z are called instrumental variables. An alternative (to least squares) estimator of is bIV = (Z’X)-1Z’y ~ Cov(Z,y) / Cov(Z,X) We consider the following: Why use this estimator? What are its properties compared to least squares? We will also examine an important application
An Exogenous Influence
Instrumental Variables Structure LWAGE (ED,EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) ED (…,MS, FEM) Equation explains the endogeneity Reduced Form: LWAGE[ ED (…,MS, FEM), EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION ]
X Z
Instrumental Variables in Regression One “problem” variable – the “last” one yit = 1x1it + 2x2it + … + KxKit + εit E[εit|xKit] ≠ 0. (0 for all others) There exists a variable zit such that Relevance E[xKit| x1it, x2it,…, xK-1,it,zit] = g(x1it, x2it,…, xK-1,it,zit) In the presence of the other variables, zit “explains” xit A projection interpretation: In the projection xKt =θ1x1it,+ θ2x2it + … + θk-1xK-1,it + θK zit, θK ≠ 0. Exogeneity E[εit| x1it, x2it,…, xK-1,it,zit] = 0 In the presence of the other variables, zit and εit are uncorrelated.
Two Stage Least Squares Strategy Reduced Form: LWAGE[ ED (MS, FEM,X), EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION ] Strategy (1) Purge ED of the influence of everything but MS, FEM (and the other variables). Predict ED using all exogenous information in the sample (X and Z). (2) Regress LWAGE on this prediction of ED and everything else. Standard errors must be adjusted for the predicted ED
OLS
2SLS (Weak Instruments?) The weird results for the coefficient on ED may be due to the instruments, MS and FEM being dummy variables. There is not much variation in these variables and not much covariation with the other variables.
The Source of the Endogeneity LWAGE = f(ED, EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) + ED = f(MS,FEM, EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) + u
Can We Remove the Endogeneity? LWAGE = f(ED, EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) + u + Strategy Estimate u Add u to the equation. ED is correlated with u+ because it is correlated with u. ED is uncorrelated with when u is in the equation.
Auxiliary Regression for ED to Obtain Residuals IVs Exog. Vars
OLS with Residual (Control Function) Added 2SLS
A Warning About Control Functions Sum of squares is not computed correctly because U is in the regression. A general result. Control function estimators usually require a fix to the estimated covariance matrix for the estimator.
Estimating σ2
Robust estimation of VC “Predicted” X “Actual” X
2SLS vs. Robust Standard Errors +--------------------------------------------------+ | Robust Standard Errors | +---------+--------------+----------------+--------+ |Variable | Coefficient | Standard Error |b/St.Er.| B_1 45.4842872 4.02597121 11.298 B_2 .05354484 .01264923 4.233 B_3 -.00169664 .00029006 -5.849 B_4 .01294854 .05757179 .225 B_5 .38537223 .07065602 5.454 B_6 .36777247 .06472185 5.682 B_7 .95530115 .08681261 11.000 | 2SLS Standard Errors | B_1 45.4842872 .36908158 123.236 B_2 .05354484 .03139904 1.705 B_3 -.00169664 .00069138 -2.454 B_4 .01294854 .16266435 .080 B_5 .38537223 .17645815 2.184 B_6 .36777247 .17284574 2.128 B_7 .95530115 .20846241 4.583
Inference with IV Estimators
Endogeneity Test? (Hausman) Exogenous Endogenous OLS Consistent, Efficient Inconsistent 2SLS Consistent, Inefficient Consistent Base a test on d = b2SLS - bOLS Use a Wald statistic, d’[Var(d)]-1d What to use for the variance matrix? Hausman: V2SLS - VOLS
Hausman Test
Hausman Test: One at a Time?
Endogeneity Test: Wu Considerable complication in Hausman test (text, pp. 234-237) Simplification: Wu test. Regress y on X and estimated for the endogenous part of X. Then use an ordinary Wald test.
Monday, 2/6/17
Regression Based Endogeneity Test
Wu Test Since this is 2SLS using a control function, the standard errors should have been adjusted to carry out this test. (The sum of squares is too small.)
Testing Endogeneity of WKS (1) Regress WKS on 1,EXP,EXPSQ,OCC,SOUTH,SMSA,MS. U=residual, WKSHAT=prediction (2) Regress LWAGE on 1,EXP,EXPSQ,OCC,SOUTH,SMSA,WKS, U or WKSHAT +---------+--------------+----------------+--------+---------+----------+ |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X| Constant -9.97734299 .75652186 -13.188 .0000 EXP .01833440 .00259373 7.069 .0000 19.8537815 EXPSQ -.799491D-04 .603484D-04 -1.325 .1852 514.405042 OCC -.28885529 .01222533 -23.628 .0000 .51116447 SOUTH -.26279891 .01439561 -18.255 .0000 .29027611 SMSA .03616514 .01369743 2.640 .0083 .65378151 WKS .35314170 .01638709 21.550 .0000 46.8115246 U -.34960141 .01642842 -21.280 .0000 -.341879D-14 WKS .00354028 .00116459 3.040 .0024 46.8115246 WKSHAT .34960141 .01642842 21.280 .0000 46.8115246
More General Test for Endogeneity
Weak Instruments One endogenous variable: y = X +xk+ ; Instruments (X,Z) Z is exogenous Symptom: The relevance condition, plim Z’X/n not zero, is close to being violated. Relevance: Z must “explain” xk after controlling for X. Detection: Standard F test in the regression of xk on (X,Z). Wald test of coefficients on Z equal 0, F < 10 suggests a problem. (Staiger and Stock) Other versions by Stock and Yogo (2005), Cragg and Donald (1993), Kleibergen and Paap (2006) for when xk is more than one variable and for when xk = (X,Z) + u and is heteroscedastic, clustered, nonnormal, etc. Remedy: Not much – most of the discussion is about the condition, not what to do about it. Use LIML instead of 2SLS? Requires a normality assumption. Probably not too restrictive.
Weak Instruments (cont.)
Weak Instruments
Endogenous Union Effect Name ; Xunion = one,occ,smsa,ed,exp,union $ Name ; Zinst = fem,ind,south $ Name ; Zunion = one,occ,smsa,ed,exp,Zinst $ ? Inconsistent OLS Regr ; Lhs = lwage ; Rhs = Xunion ; Cluster = 7 $ ? Two Stage Least Squares gives a nonsense result 2sls ; Lhs=lwage;rhs=Xunion;Inst=Zunion ; Cluster=7 $ ? Test for weak instruments Regr ; Lhs=union ; Rhs=Zunion ; res = u ; Cluster = 7 ; test:zinst $ ? Control function estimator ? 2SLS coefficients with the wrong standard errors Regr ; Lhs=lwage;rhs=xunion,u;cluster=7$
OLS Should Be Inconsistent
Nonsense 2SLS Result
Weak Instruments? No What is going on here? When the endogenous variable and/or the excluded instruments are binary, the actual results are sometimes a bit unstable. The theoretical results are generally about covariation of continuous variables.
Appendix Miscellaneous
Aggregate Data and Multinomial Choice: The Model of Berry, Levinsohn and Pakes
Theoretical Foundation Consumer market for J differentiated brands of a good j =1,…, Jt brands or types i = 1,…, N consumers t = i,…,T “markets” (like panel data) Consumer i’s utility for brand j (in market t) depends on p = price x = observable attributes f = unobserved attributes w = unobserved heterogeneity across consumers ε = idiosyncratic aspects of consumer preferences Observed data consist of aggregate choices, prices and features of the brands.
BLP Automobile Market Jt t
Random Utility Model Utility: Uijt=U(wi,pjt,xjt,fjt|), i = 1,…,(large)N, j=1,…,J wi = individual heterogeneity; time (market) invariant. w has a continuous distribution across the population. pjt, xjt, fjt, = price, observed attributes, unobserved features of brand j; all may vary through time (across markets) Revealed Preference: Choice j provides maximum utility Across the population, given market t, set of prices pt and features (Xt,ft), there is a set of values of wi that induces choice j, for each j=1,…,Jt; then, sj(pt,Xt,ft|) is the market share of brand j in market t. There is an outside good that attracts a nonnegligible market share, j=0. Therefore,
Endogenous Prices: Demand side Uijt=U(wi,pjt,xjt,fjt|)= xjt'βi – αpj + fjt + εijt fj is unobserved features of model j Utility responds to the unobserved fj Price pj is partly determined by features fj. In a choice model based on observables, price is correlated with the unobservables that determine the observed choices.
Lambeth Southwark & Vauxhall Main sewage discharge The First IV Study Was a Natural Experiment (Snow, J., On the Mode of Communication of Cholera, 1855) http://www.ph.ucla.edu/epi/snow/snowbook3.html London Cholera epidemic, ca 1853-4 Cholera = f(Water Purity,u) + ε. ‘Causal’ effect of water purity on cholera? Purity=f(cholera prone environment (poor, garbage in streets, rodents, etc.). Regression does not work. Two London water companies Lambeth Southwark & Vauxhall Main sewage discharge River Thames Paul Grootendorst: A Review of Instrumental Variables Estimation of Treatment Effects… http://individual.utoronto.ca/grootendorst/pdf/IV_Paper_Sept6_2007.pdf A review of instrumental variables estimation in the applied health sciences. Health Services and Outcomes Research Methodology 2007; 7(3-4):159-179.
IV Estimators bIV = (Z’X)-1Z’y = (Z’X/n)-1 (Z’X/n)β+ (Z’X/n)-1Z’ε/n * Consistent bIV = (Z’X)-1Z’y = (Z’X/n)-1 (Z’X/n)β+ (Z’X/n)-1Z’ε/n = β+ (Z’X/n)-1Z’ε/n β * Asymptotically normal (same approach to proof as for OLS) * Inefficient – to be shown. * By construction, the IV estimator is consistent. We have an estimator that is consistent when least squares is not.
IV Estimation Why use an IV estimator? Suppose that X and are correlated. Then least squares is neither unbiased nor consistent. Recall the proof of consistency of least squares: b = + (X’X/n)-1(X’/n). Plim b = requires plim(X’/n) = 0. If this does not hold, the estimator is inconsistent.
A Popular Misconception If only one variable in X is correlated with , the other coefficients are consistently estimated. False. The problem is “smeared” over the other coefficients.
Consistency and Asymptotic Normality of the IV Estimator
Asymptotic Covariance Matrix of bIV
Asymptotic Efficiency Asymptotic efficiency of the IV estimator. The variance is larger than that of LS. (A large sample type of Gauss-Markov result is at work.) (1) It’s a moot point. LS is inconsistent. (2) Mean squared error is uncertain: MSE[estimator|β]=Variance + square of bias. IV may be better or worse. Depends on the data
Two Stage Least Squares How to use an “excess” of instrumental variables (1) X is K variables. Some (at least one) of the K variables in X are correlated with ε. (2) Z is M > K variables. Some of the variables in Z are also in X, some are not. None of the variables in Z are correlated with ε. (3) Which K variables to use to compute Z’X and Z’y?
Choosing the Instruments Choose K randomly? Choose the included Xs and the remainder randomly? Use all of them? How? A theorem: (Brundy and Jorgenson, ca. 1972) There is a most efficient way to construct the IV estimator from this subset: (1) For each column (variable) in X, compute the predictions of that variable using all the columns of Z. (2) Linearly regress y on these K predictions. This is two stage least squares
Algebraic Equivalence Two stage least squares is equivalent to (1) each variable in X that is also in Z is replaced by itself. (2) Variables in X that are not in Z are replaced by predictions of that X using All other variables in X that are not correlated with ε All the variables in Z that are not in X.
2SLS Algebra
Asymptotic Covariance Matrix for 2SLS
2SLS Has Larger Variance than LS
A study of moral hazard Riphahn, Wambach, Million: “Incentive Effects in the Demand for Healthcare” Journal of Applied Econometrics, 2003
A study of moral hazard Riphahn, Wambach, Million: “Incentive Effects in the Demand for Healthcare” Journal of Applied Econometrics, 2003 Did the presence of the ADDON insurance influence the demand for health care – doctor visits and hospital visits? For a simple example, we examine the PUBLIC insurance (89%) instead of ADDON insurance (2%).
Application: Health Care Panel Data German Health Care Usage Data, 7,293 Individuals, Varying Numbers of Periods Data downloaded from Journal of Applied Econometrics archive. This is an unbalanced panel with 7,293 individuals. They can be used for regression, count models, binary choice, ordered choice, and bivariate binary choice. This is a large data set. There are altogether 27,326 observations. The number of observations ranges from 1 to 7. (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987). DOCTOR = 1(Number of doctor visits > 0) HOSPITAL = 1(Number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = 0 ADDON = insured by add-on insurance = 1; otherswise = 0 HHNINC = household nominal monthly net income in German marks / 10000. (4 observations with income=0 were dropped) HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling AGE = age in years MARRIED = marital status EDUC = years of education 71
Evidence of Moral Hazard?
Regression Study
Endogenous Dummy Variable Doctor Visits = f(Age, Educ, Health, Presence of Insurance, Other unobservables) Insurance = f(Expected Doctor Visits, Other unobservables)
Approaches (Parametric) Control Function: Build a structural model for the two variables (Heckman) (Semiparametric) Instrumental Variable: Create an instrumental variable for the dummy variable (Barnow/Cain/ Goldberger, Angrist, current generation of researchers) (?) Propensity Score Matching (Heckman et al., Becker/Ichino, Many recent researchers)
Heckman’s Control Function Approach Y = xβ + δT + E[ε|T] + {ε - E[ε|T]} λ = E[ε|T] , computed from a model for whether T = 0 or 1 Magnitude = 11.1200 is nonsensical in this context.
Instrumental Variable Approach Construct a prediction for T using only the exogenous information Use 2SLS using this instrumental variable. Magnitude = 23.9012 is also nonsensical in this context.
Propensity Score Matching Create a model for T that produces probabilities for T=1: “Propensity Scores” Find people with the same propensity score – some with T=1, some with T=0 Compare number of doctor visits of those with T=1 to those with T=0.
Treatment Effect Earnings and Education: Effect of an additional year of schooling Estimating Average and Local Average Treatment Effects of Education when Compulsory Schooling Laws Really Matter Philip Oreopoulos AER, 96,1, 2006, 152-175
Treatment Effects and Natural Experiments
Alternative to Hausman’s Formula? H test requires the difference between an efficient and an inefficient estimator. Any way to compare any two competing estimators even if neither is efficient? Bootstrap? (Maybe)
On Sat, May 3, 2014 at 4:48 PM, … wrote: Dear Professor Greene, I am giving an Econometrics course in Brazil and we are using your textbook. I got a question which I think only you can help me. In our last class, I did a formal proof that var(beta_hat_OLS) is lower or equal than var(beta_hat_2SLS), under homoscedasticity. We know this assertive is also valid under heteroscedasticity, but a graduate student asked me the proof (which is my problem). Do you know where can I find it?