Evaluation in Education: 'new' approaches, different perspectives, design challenges Camilla Nevill Head of Evaluation, Education Endowment Foundation 24th January 2017 camilla.nevill@eefoundation.org.uk www.educationendowmentfoundation.org.uk @EducEndowFoundn
Introduction The EEF is an independent charity dedicated to breaking the link between family income and educational achievement. In 2011 the Education Endowment Foundation was set up by Sutton Trust as lead charity in partnership with the Impetus Trust. The EEF is funded by a Department for Education grant of £125m and will spend over £220m over its fifteen year lifespan. In 2013, the EEF was named with The Sutton Trust as the government-designated ‘What Works’ centre for improving education outcomes for school-aged children.
The EEF Two aims: Our approach 1. Break the link between family income and school attainment 2. Build the evidence base on the most promising ways of closing the attainment gap
The Teaching and Learning Toolkit A meta-analysis of education research Contains c.10,000 studies Cost, impact & security included to aid comparison
133 projects funded to date EEF, March 2016 7,500 schools currently participating in projects 133 projects funded to date 750,000 pupils currently involved in EEF projects £220m estimated spend over lifetime of the EEF 26 independent evaluation teams 100 RCTs £82m funding awarded to date 66 published reports
New approach, different perspectives, design challenges Design with the end user in mind There is no one right answer – communicate and compromise
New approach Evaluate projects Rigorous, independent evaluations Longitudinal outcomes Robust counterfactual (RCTs) Impact and process evaluations
Education v other fields How does this compare to evaluation in your field?
Trials: Education v public health Public health / development Some independent evaluation Usually not independently funded? Mostly cluster and multi-site trials. Clusters clearly defined. Mostly cluster and multi-site trials. Clusters less clearly defined? High ICC Low ICC? Obtaining consent can be easy Obtaining consent can be complex and difficult Follow up is in theory easy; children must attend school Follow up can be harder Administrative data (NPD) Depends on outcome Unfamiliarity with method More familiar. More respect in medicine
Main messages Design with the end user in mind There is no right answer – communicate and compromise
Process for appointing evaluators Grants team identify projects, 1st Grants Comm. shortlist Evaluation teams receive 1page project descriptio-ns Teams submit 2 page EoI Teams chosen to submit proposal Teams submit 8 page proposal Teams chosen to evaluate projects 2nd set up meeting with evaluation team, project team and EEF 2nd Grants Comm. shortlist First set-up meeting with evaluation team, project team and EEF Finalise evaluation design. Decide on eligibility criteria, details of protocol, process evaluation measures linked to logic model Share understanding of intervention logic. Decide overall design, timeline , sample size, control group condition. Developer (& evaluator) budgets set
Different perspectives EEF Evaluator Set-up meeting Developer
Different perspectives EEF Useful results Quick results Keep costs down Evaluator Publications Funding to do research Personal interests Set-up meeting Developer Funding to deliver programme Demonstrate impact Good relationships with schools Publications?
Design challenges Improving Working Memory Teaching memory strategies by playing computer games For 5 year-olds struggling at maths Delivered by Teaching Assistants Developed by Oxford University educational psychologists Evidence of improvement in WM from two small (30 and 150 children) controlled studies
Design challenges How many arms? Working Memory (WM) WM blended with maths Matched time maths support Business as usual (BAU)
Design challenges When would you randomise? Deliver programme (10 hours) 121 support for 20-30 mins for total 5 hours Computer games for 5 hours Maths attainment School Recruited Identify TAs and link teacher One-day training for TAs Identify pupils (bottom 1/3) Improved working memory Oxford University
Design challenges Deliver programme (10 hours) Delivery log Deliver programme (10 hours) Survey, observations, interviews WM test Maths test 121 support for 20-30 mins for total 5 hours Computer games for 5 hours Maths attainment School Recruited Identify TAs and link teacher Identify pupils (bottom 1/3) One-day training for TAs Improved working memory Oxford University Randomisation
Estimated months’ progress Matched time support v BAU Control Design challenges Catch Up Numeracy For 4 to 11 year-olds struggling at maths Delivered by Teaching Assistants 10 modules of tailored support Flexible delivery model (no fixed length) Evidence from EEF pupil-randomised efficacy trial: Group Number of pupils Effect size Estimated months’ progress Evidence strength Catch Up v BAU Control 108 0.21 (0.01, 0.42) +3 Matched time support v BAU Control 102 0.27 (0.06, 0.49) +4
Design challenges What control group would you use?
Design challenges Catch Up Numeracy 150 schools Recruited Identify TAs and ~8 children in years 3-5 behind in maths Randomise 75 schools, 600 children: Business as usual control group 75 schools, 600 children: Flexible Catch Up delivery model Follow up maths test
Problems with interpretation What if we see no effect of Catch Up and control group gets lots more support? What if we see a big effect of Catch Up and the control group has received lots less support?
A radical idea: Pre-specify interpretation! Positive effect No effect Negative effect Control longer than Catch Up Matched time Control shorter than Catch Up
A radical idea: Pre-specify interpretation! Positive effect No effect Negative effect Control longer than Catch Up x Matched time Control shorter than Catch Up
A radical idea: Pre-specify interpretation! Positive effect No effect Negative effect Control longer than Catch Up Catch Up is more effective, even when more active control time →Do Catch Up (continuing active control without appropriate stopping may have a harmful effect) Both did or both did not work. Probably did given existing evidence? →Do Catch Up because same effect with less time Catch Up is less effective than providing longer active control. →Assess the cost of each and do active control if not much more expensive. Matched time Catch Up is more effective than active control Both did work or both did not work. Probably did given existing evidence? →Do Catch Up or active control Catch Up is less effective than active control →Do active control Control shorter than Catch Up Catch Up is more effective than less time active control →Do Catch Up because need structure to stop TAs stopping too early →Do active control as same effect with less time Catch Up is less effective than providing less active control.
Design challenges Boarding school Teenage Sleep Children in need at risk of going into care Referred by Local Authorities Teenage Sleep Changing school start times to later Positive effects from US trials (8am start v 11am start)
Main messages (and sub-messages) Design with the end user in mind Test the right intervention Make sure your comparison is relevant Measure implementation and cost There is no right answer – communicate and compromise Use logic model to understand the intervention Pre-specify the interpretation to aid decision making Not all interventions can be randomised
Thank you camilla. nevill@eefoundation. org. uk www Thank you camilla.nevill@eefoundation.org.uk www.educationendowmentfoundation.org.uk @EducEndowFoundn
Measuring the security of trials Summary of the security of evaluation findings ‘Padlocks’ developed in consultation with evaluators Five categories – combined to create overall rating: Group Number of pupils Effect size Estimated months’ progress Evidence strength Literacy intervention 550 0.10 (0.03, 0.18) +2 Rating 1. Design 2. Power (MDES) 3. Attrition 4. Balance 5. Threats to validity 5 Fair and clear experimental design (RCT) < 0.2 < 10% Well-balanced on observables No threats to validity 4 Fair and clear experimental design (RCT, RDD) < 0.3 < 20% 3 Well-matched comparison (quasi-experiment) < 0.4 < 30% 2 Matched comparison (quasi-experiment) < 0.5 < 40% 1 Comparison group with poor or no matching < 0.6 < 50% No comparator > 0.6 > 50% Imbalanced on observables Significant threats