Comparing Bayesian and Frequentist Inference for Decision-Making Presentation at SREE March 2, 2017 Jesse Chandler · Mariel Finucane · Ignacio Martinez Alexandra Resch · Jeffrey Terziev
Motivation for the study Presented results from small pilot of ed tech product Evaluation was added on after the pilot began, underpowered Goals of our technical assistance: Help district learn whether their initiatives are effective Build district capacity for evidence use and generation Our summary: Promising results, but inconclusive Strong reactions to calling results promising given lack of statistical significance Became clear that audiences default to p<0.05 Made us reevaluate our defaults
Can schools generate useful evidence? Observations Schools need to make decisions whether or not an effectiveness study is planned or is feasible Operational decisions take priority over evaluation Even when study is desired, resource constraints affect design Study findings are often not useful to decision makers What can districts learn from everyday decisions? Is there a better way to present information to decision-makers?
This study: Do people make different decisions? Using an online platform we showed a convenience sample information about hypothetical school district decisions: a choice between two software products. In both cases there was some evidence that the new software is more effective, but participants were told that switching takes time and money. No “correct” answer – depends on how you value the costs and benefits, your risk tolerance For each scenario, we asked them What would you decide to do? How confident do you feel about your choice?
Some caveats We’re looking at a particular way of presenting frequentist results Null hypothesis testing using the defaults common in program evaluation in education Vs a particular way of presenting Bayesian results We present posterior probability of an effective treatment The appropriate methods and ideal presentation of results will vary by the application at hand We test this with a convenience sample
Why we choose a Bayesian comparison It could produce inferences we thought were better aligned with how decision-makers think: P(truth|data) vs P(data|truth) Findings could be phrased as probabilistic statements (e.g. there is an 80% chance the intervention has a positive effect on student achievement) Other possible benefits, but not examined here
The scenario You are a curriculum coordinator for a school district and need to decide whether to stick with a current technology or switch to a new technology District conducted pilot in 10 classrooms, randomly assigning 5 classrooms to each product Products cost the same, but some transition costs Each condition sees different presentation of results Is asked: Based on the data, your recommendation is to Use [existing product] Use [new product] Collect more data before deciding which software to use
Randomized crossover design Within subjects: Bayesian or frequentist version first All subjects see both Randomize which is math vs reading scenario Between subjects: text only vs text + graph
All conditions see average scores “Your data specialist tells you that on average, the students in the classrooms that used the new MathCoach software scored 10.31 points higher on the year-end tests than the students in the classrooms that used MathTech.”
Interpretation differs by condition Standard frequentist The 95% confidence interval of the difference in test scores between the two groups of classrooms… includes 0, so they cannot reject the hypothesis that the interventions have the same effect. Bayesian There is a 77% chance that the new technology improves achievement, and a 23% chance that the new software decreases achievement.
The standard treatment
The Bayesian treatment
The Sample Convenience sample of 280 participants gathered through Amazon Mechanical Turk Samples drawn from MTurk are relatively young, well educated (for an overview see Chandler & Shapiro, 2016) Our sample 56% male 37 years old (SD = 12) 48% have at least a college degree Asked four factual questions about the scenarios. Excluded 11 participants who had more than 2 incorrect answers.
Bayesian results are actionable
Graphs are actionable
Bayesian results increase confidence
Bayesian results are easier to understand Showing results from first scenario only – second scenario is complicated by practice effect but produces substantially the same interpretation