Analysis of Experimental Data II Christoph Engel
linear model I.treatment effect II.continuous explanatory variable III.heteroskedasticity IV.control variables V.interaction effects VI.outliers VII.endogeneity VIII.small and big problems
I. treatment effect pro (usually) more statistical power greater flexibility control variables heteroskedasticity instrumental variables time series and panel models non-linear functional form automatic estimate of effect size (in principle) marginal effect contra more assumptions
data generation set obs 1000 gen uid = _n gen error = rnormal() gen treat = (uid > 500) gen dv = 5 + 2*treat + error
data
non-parametric
parametric
ttest hardly ever used with experimental data no effect size assumes normality
(linear) regression
reference category: baseline, mean treatment: cons = 6.992
(linear) regression reliability of estimates
(linear) regression explained variance
regression model explanandum depvar(i) explanans indepvars(i) explanation cons coef
regression model
fundamental assumption error is uncorrelated with explanatory variables graphical way of testing residuals predicted value should be orthogonal
plot
II. continuous iv
data generating process dv = 5 +.5*level + error
regression
interpretation in a linear model coef = marginal effect take first derivative wrt level prediction one unit increase of level leads to.495 increase of dv
orthogonality of error
prediction reg dv level predict preddv two (sc dv level) (sc preddv level, c(L))
regression
significance intuitive criterion H 0 regressor has no explanatory power = is zero is 0 within confidence interval?
how to construct? mean - / *SE SE = sqrt(entry in var covar matrix) not very intuitive
intuitive approximation assuming the error orthogonal mean 0
graph
what goes wrong? 6.3 % below 0 procedure attributes entire unexplained variance to level regressor
III. heteroskedasticty dv = 5 +.5*level +.1*level*error
estimation
problem probably even bias / inconsistency at any rate standard errors wrong SE level underestimated SE cons overestimated solution (heteroskedasticity) robust standard errors
technically σ0000 0σ000 00σ00 000σ0 0000σ assuming homoskedasticity all obs are iid variance / sd / se the same all over (and all covariance terms are 0)
by contrast σ0000 0σ000 00σ00 000σ0 0000σ σ1σ σ2σ σ3σ σ4σ σ5σ5
IV. control variables
data generating process two dimensional orthogonal rare in experimental data but correlation of indepvar no problem if not very pronounced multicollinearity dv = 5 + 2*treat +.5*level + error
omitted variables if orthogonal no problem with consistency but SE are wrong but cons is wrong
prediction
same with collinearity data generating process as before but replace treat = treat +.1*level
consistency affected
V. interaction effects data generating process dv = 5 + 2*treat +.5*level -.25*treat*level + error
regression
prediction
testing net effect is something relevant happening in the treatment at the beginning
testing treatment effect at various levels is there a treatment effect at the beginning? is there one in the end?
everywhere?
VI. outliers data generating process dv = 5 +.5*level + error replace dv = 1000 if uid > 995
heavy problem
what to do? think of endgame effect proximate cause: highest level (last period) relatively good, but level insig.
transform dv
best: 1/sqrt(dv) good for cons after retransformation very poor for level
find reason / contingency
problem solved
VII. endogeneity immaterial for treatment effect randomization prevents easily relevant when explaining treatment effect data generating process level = 2 +.5*trait + error dv = 5 + 2*treat +.5*level + error
inconsistency
2sls
VIII. small and big problems heteroskedasticity consistent robust SE non-normality (of error term) (law of large numbers) alternative functional form non-independence dgp induced match with statistical model
(small and big problems) (omitted variables) decontextualisation outliers capture by specification (transform dv) endogeneity (randomization) (create) iv