How do we know what works?

How do we know what works?
Robert Coe ResearchEd, London, 5 Sept 2015

Higgins, S. , Katsipataki, M. , Kokotsaki, D. , Coleman, R. , Major, L
Higgins, S., Katsipataki, M., Kokotsaki, D., Coleman, R., Major, L.E., & Coe, R. (2013). The Sutton Trust-Education Endowment Foundation Teaching and Learning Toolkit. London: Education Endowment Foundation. [Available at Coe, R. (2013) Improving Education: A triumph of hope over experience. Inaugural Lecture of Professor Robert Coe, Durham University, 18 June Essay version available at Video at Coe, R., Aloisi, C., Higgins, S. and Elliot Major, L. (2014) ‘What makes great teaching? Review of the underpinning research’. Sutton Trust, October Cordingley, P., Higgins, S., Greany, T., Buckler, N., Coles-Jordan, D., Crisp, B., Saunders, L., Coe, R. (2015) Developing Great Teaching: Lessons from the international reviews into effective professional development. Teacher Development Trust.

How do we know what works?
Progress in evidence-based education Defining ‘what works’ The case for RCTs Some standard objections When ‘what works’ doesn’t work Practical implications

How far have we come? 1999 2015 Very few UK education researchers who had done RCTs Dominant view: you can’t (or shouldn’t) do RCTs in education Very limited policy interest in robust evaluation Growing, sustainable body of UK researchers with education RCT expertise EEF funding changed those views Policy interest excellent in parts

To claim something ‘works’
Is there a choice between two (or more) plausible options? Well-defined (inc how to implement) Repeatable, generalisable, transferable Feasible, acceptable, equipoise Can we agree what outcome(s) are important? Value judgements resolved or explicit Valid measurement process Is there rigorous systematic evidence to support one choice? Systematic review Overall average difference & ‘moderators’

http://www. dylanwiliam

In a ‘research-based’ profession:
Professionals would, for the majority of decisions they need to take, be able to find and access credible research studies that provided evidence that particular courses of action that would, implemented as directed, be substantially more likely to lead to better outcomes than others. Wiliam (2014) Professionals would, for some decisions they need to take, be able to access and understand high-quality evidence that particular courses of action would be likely to lead to better outcomes than others. Coe 2015

From Corder et al (2015) International Journal of Behavioral Nutrition and Physical Activity
Appropriately cautious claims: “An extra hour of screen time was associated with 9.3(−14·3,-4·3) fewer [GCSE] points” “it would be impossible to tell whether reductions in screen time caused an increase in academic performance without a randomised controlled trial” But also some implicit causal claims “Screen time was associated with lower academic performance, suggesting that strategies to limit screen behaviours among adolescents may benefit academic performance” Corder, K., Atkin, A. J., Bamber, D. J., Brage, S., Dunn, V. J., Ekelund, U., ... & Goodyer, I. M. (2015). Revising on the run or studying on the sofa: Prospective associations between physical activity, sedentary behaviour, and exam results in British adolescents. International Journal of Behavioral Nutrition and Physical Activity 2015, 12:106 . Available at

Media quotes But even if pupils spent more time studying, more time spent watching TV or online, still harmed their results, the analysis suggested. "We believe that programmes aimed at reducing screen time could have important benefits for teenagers' exam grades, as well as their health," said Dr Van Sluijs “We found that TV viewing, computer games and internet use were detrimental to academic performance”

Is screen time the cause of poorer GCSEs?
A statistical association can be evidence for causal relationship if other explanations for the relationship have been systematically generated, tested and discredited Eg smoking and cancer But even high correlations, sophisticated models and ‘strong’ controls do not guarantee this Coe (2009) “What appeared to the original researchers to be substantial and unequivocal causal effects were reduced to tiny and uncertain differences when the effects of plausible unobserved differences were taken into account.” In this study No control for any prior cognitive measure Weak control for SES (IMD from LSOAs) Many obvious alternative explanations Coe R (2009) Unobserved but not unimportant: the effects of unmeasured variables on causal attributions. Effective Education 1 (2), Available at

Is screen time the cause of poorer GCSEs?
It is a meaningless question: What are the well-defined, feasible, repeatable options for action? A question you could answer: Does intervention X to reduce the time 14-year-olds spend on non-educational screen time lead to increases in their GCSEs? Related questions Does X actually reduce screen time? What support factors are required for it to work? Does it work more/less with some groups?

Is the RCT a gold standard?

Claim: If you do A, it will improve B Evidence: We did A and it improved B
Would B have improved anyway? (counterfactual) Was it really A? (attribution) Did B really improve? (interpretation) Will it work again for me? (generalisation)

1. Would B have improved anyway? (counterfactual)
Was there an equivalent, randomly allocated comparison group? Randomisation done properly? Beware attrition Was there a comparison group, equivalent on observed measures? Quality of the measures? Quantity of the measures (inc repeated measures)? Unobserved differences (eg enthusiasm, choice)? Was there a non-equivalent comparison group? Select an overlapping subset (propensity score matching) Statistical ‘control’ is problematic If no direct comparison: impossible to interpret Examples Subversion of randomisation: Tennessee STAR Attrition in response to treatment: EEF???; this study? Poor measures/unobserved: Coe 2009 paper: Easter summer school, G&T identification Non-equivalent comparison: grammar schools No comparison: Y9 mentoring

2. Was it really A? (attribution)
Could the process of being observed or involved in an experiment have caused it (reactivity/Hawthorne effects)? Could the involvement of the researcher or developer be a factor? Contamination: what did control group do? Did they actually do A (faithfully)? Did other things change? Hawthorne: M-A by Adair; find another route (this study); Peer learning Researcher involvement: Success for All Contamination: Not implemented: Other changes: league tables in Wales

3. Did B really improve? (interpretation)
Was the measure of B adequate? Validity of measure (biased, unreliable, misinterpreted) Too narrow/broad Ceiling/floor effects Un-blinded judgements Was the ‘post-test’ timing too soon/late? Any attrition? (missing data, lost persons, units) Could it have been just chance? (statistical significance) Was the reporting comprehensive and unbiased? data dredging selective reporting publication bias Narrow/broad: m-a of standardised tests vs bespoke Unblinded: Numbers count (Torgerson et al) Timing: High-Scope Perry pre-school; Cambridge-Somerville Youth Study Attrition (loss to follow-up): File drawer effects:

4. Will it work again for me? (generalisation)
Representativeness Context (including support factors) Population (achieved, not just intended) Intervention not specified or replicable Will it still work at a large scale? Support factors: California class size Intervention not specified: AfL Scale effects: Slavin & Smith

Claim: If you do A, it will improve B Evidence: We did A and it improved B
Does RCT help?   Would B have improved anyway? (counterfactual) Was it really A? (attribution) Did B really improve? (interpretation) Will it work again for me? (generalisation)

Some standard objections to RCTs
Causation Social world is too complex / Humans are free agents Values Positivism requires objective, value-free stance Generalisation Every context is unique Too hard Problems with RCTs: clustering, power, file-drawer, wrong questions, moderators, wrong outcomes, etc Not my thing

When ‘what works’ doesn’t work …

What should have worked
Durham Shared Maths (EEF) California class size (Cartwright & Hardie, 2012) Scale-up (Slavin & Smith 2008) AfL Cartwright, N., & Hardie, J. (2012). Evidence-based policy: a practical guide to doing it better. Oxford: Oxford University Press. Slavin R.E. and Smith D. (2008) Effects of sample size on effect size in systematic reviews in education. Educational Evaluation and Policy Analysis,

What should you do? Don’t ignore the evidence just because it is imperfect: understand the limitations and help to improve it Simple, superficial knowledge of research evidence may not improve decision making: deep, integrated understanding is required Routinely monitor the effectiveness of your practice Evaluate the impact of any changes you make Four aces for improving the quality of teaching ‘Four Aces’ from rEdScot: Experience, Data, Feedback, Research

How do we know what works?

Similar presentations

Presentation on theme: "How do we know what works?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How do we know what works?

Similar presentations

Presentation on theme: "How do we know what works?"— Presentation transcript:

Similar presentations

About project

Feedback