Lurking inferential monsters Lurking inferential monsters? Exploring selection bias in school-based interventions Ben Weidmann benweidmann@g.harvard.edu
A strategy for empirically assessing bias: within-study comparisons We can test for selection bias, by doing a ‘within-study comparison’ when we have experimental data. This idea of is not new [Lalonde 1986] Participate in Program Group 1. Experimental Treatment Group 2. Experimental Control Group 3. Observational Control
Some comments on existing within-study comparisons The within-study comparison literature has generally: Looked at single evaluations [rather than systematically examining a large set, in a particular context] Not a lot about schools [some examples, e.g. Bifulco (2012) on charter schools] Reason to think that school evaluations might be a good context for ‘selection on observables’ In edu evaluations, non-experimental estimates are probably better than in the canonical job-training literature [Cook, Shadish and Wong (2008)]
Research questions and priors What is the distribution of selection bias across a range of school-based interventions in the UK? On average, the estimated bias will be between 0.05 and 0.1 s.d. (in effect size units) 2. Are some types of interventions more prone to selection bias than others? Selection bias will be smaller for interventions focussed on: older children Maths (as opposed to literacy) 3. What non-experimental methods are best at recovering experimental estimates? Mahalanobis distance; (rather than just using propensity score) Preference for matches from within same LA; Sub-classification (rather than nearest neighbour) 4. [RELATED WORK: How different are the EEF trial results if they’re reweighted to represent a more general population of students?]
Research questions and limitations What is the distribution of selection bias across a range of school-based interventions in the UK? We’re examining a specific selection mechanism (that may not apply to other contexts) 2. Are some types of interventions more prone to selection bias than others? We have relatively few cases (~15) so quantitative analysis will be highly uncertain 3. What non-experimental methods are best at recovering experimental estimates? There are lots of methods that we won’t be testing (e.g. coarsened exact matching)
Estimated Bias (effect size, d) What results might look like? 3 stylised possibilities (for research question 1) Small Mainly positive (or negative) Big -0.5 0.5 Estimated Bias (effect size, d)
Questions and comments welcome!
SIMULATION STUDY
Motivation The UK is hoping to set up a Data Service The goal of the service would be to provide ‘impact estimates’ for programs that are already operating in schools The idea is that organisations contact the Data Service and provide a list of the schools in which they’re operating The Data Service then performs an observational study, using matching The resulting estimate will be fed back to schools and the organisation (and/or used to decide which programs will get a fancy, expensive RCT)
Problem What if this exciting new program is only operating in 1 school? Would we be comfortable providing a ‘Data Service Approved’ impact estimate? Two costs 1. Providing the estimate takes time and money. It’s not worth doing if the estimate is going to be too noisy 2. Although we’ll provide information on uncertainty, sometimes consumers of research (e.g. teachers, journalists, policy makers) might not take these into account But how big should our sample be? Power calculations! [Taking into account the fact that our observational study will have bias]
Goal of my simulation Provide a tool to help decide how big sample sizes need to be to justify providing an official estimate of ‘impact’ Illustrate the power and Type S error (sign error) rates for different, realistic scenarios Power: for a given effect size, the probability we correctly reject 𝐻 0 of no treatment effect Type S errors (sign error): the true effect is negative we confidently conclude that the effect is positive (or vice versa)
Overview of data generating process 𝑌 𝑖 0 = 𝛽 1 𝑋 𝑖 + 𝑈 𝑖 𝑌 𝑖 1 = 𝑌 𝑖 0 +Δ Y is outcome [e.g. standardised reading score at age 11] X is a predictor [e.g. standardised reading score at age 7] 𝑋~𝑁(0,1) U is unobserved characteristics (including error) 𝑈~𝑁(0, 𝜎 𝑈 2 ) ; 𝑋 and 𝑈 are independent Δ is the treatment effect Z is a treatment indicator ∈{0,1} 𝑃 𝑍 𝑖 =1 =Φ( 𝛼 1 𝑈 𝑖 ) The parameter 𝑎 1 determines the extent of ‘bias’ Bias: defined as b = E[U|Z=1]-E[U|Z=0]
Factors for the simulation Inputs into the simulation Sample size R2: in a regression of Y(0)~X Bias
Results (Power)
Results (Type S error)
Conclusions When bias has the opposite sign of the true effect: Bias either reduces power… …or increases the chance that you’re going to make a Type S error When bias and the true effect have the same sign, it helps in terms of power and avoiding Type S errors [although you might make a bad mistake about magnitude] As a general takeaway, if the expected bias is similar in magnitude to the expected effect size, you’re toast [regardless of sample size] unless you have strongly predictive covariates