Building Evidence in Education: Workshop for EEF evaluators 2 nd June: York 6 th June: London www.educationendowmentfoundation.org.uk.

Slides:



Advertisements
Similar presentations
Appraisal of an RCT using a critical appraisal checklist
Advertisements

Analysis by design Statistics is involved in the analysis of data generated from an experiment. It is essential to spend time and effort in advance to.
Designing an impact evaluation: Randomization, statistical power, and some more fun…
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
© Institute for Fiscal Studies Evaluation design for Achieve Together Ellen Greaves and Luke Sibieta.
Robert Coe Neil Appleby Academic mentoring in schools: a small RCT to evaluate a large policy Randomised Controlled trials in the Social Sciences: Challenges.
Building Evidence in Education: Workshop for EEF evaluators 2 nd June: York 6 th June: London
Experimental evaluation in education Professor Carole Torgerson School of Education, Durham University, United Kingdom International.
Designs to Estimate Impacts of MSP Projects with Confidence. Ellen Bobronnikov March 29, 2010.
Adapting Designs Professor David Torgerson University of York Professor Carole Torgerson Durham University.
KINE 4565: The epidemiology of injury prevention Randomized controlled trials.
The use of administrative data in Randomised Controlled Trials (RCT’s) John Jerrim Institute of Education, University of London.
Why to Randomize a Randomized Controlled Trial? (and how to do it) John Matthews University of Newcastle upon Tyne.
Building Evidence in Education: Conference for EEF Evaluators 11 th July: Theory 12 th July: Practice
Reading the Dental Literature
Using evidence to raise the attainment of children facing disadvantage James Richardson Senior Analyst, Education Endowment Foundation 1 st April 2014.
Missing Data Issues in RCTs: What to Do When Data Are Missing? Analytic and Technical Support for Advancing Education Evaluations REL Directors Meeting.
Jeff Beard Lisa Helma David Parrish Start Presentation.
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
Dr. Chris L. S. Coryn Spring 2012
Clustered or Multilevel Data
Allocation Methods David Torgerson Director, York Trials Unit
Sampling and Experimental Control Goals of clinical research is to make generalizations beyond the individual studied to others with similar conditions.
Research Methods.
Today Concepts underlying inferential statistics
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
SAMPLING AND STATISTICAL POWER Erich Battistin Kinnon Scott Erich Battistin Kinnon Scott University of Padua DECRG, World Bank University of Padua DECRG,
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Addressing educational disadvantage, sharing evidence, finding out what works Camilla Nevill Evaluation Manager.
Building Evidence in Education: Conference for EEF Evaluators 11th July: Theory 12th July: Practice
Determining Sample Size
Building Evidence in Education: Conference for EEF Evaluators 11 th July: Theory 12 th July: Practice
Criteria for Assessing The Feasibility of RCTs. RCTs in Social Science: York September 2006 Today’s Headlines: “Drugs education is not working” “ having.
ARROW Trial Design Professor Greg Brooks, Sheffield University, Ed Studies Dr Jeremy Miles York University, Trials Unit Carole Torgerson, York University,
Selecting and Recruiting Subjects One Independent Variable: Two Group Designs Two Independent Groups Two Matched Groups Multiple Groups.
Data Analysis – Statistical Issues Bernd Genser, PhD Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador
Optimal Design for Longitudinal and Multilevel Research Jessaca Spybrook July 10, 2008 *Joint work with Steve Raudenbush and Andres Martinez.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
Sampling, sample size estimation, and randomisation
Rigorous Quasi-Experimental Evaluations: Design Considerations Sung-Woo Cho, Ph.D. June 11, 2015 Success from the Start: Round 4 Convening US Department.
Sample Size Considerations for Answering Quantitative Research Questions Lunch & Learn May 15, 2013 M Boyle.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
TRANSLATING RESEARCH INTO ACTION Sampling and Sample Size Marc Shotland J-PAL HQ.
META-ANALYSIS, RESEARCH SYNTHESES AND SYSTEMATIC REVIEWS © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
One-Way Analysis of Covariance (ANCOVA)
Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.
Measuring Impact 1 Non-experimental methods 2 Experiments
Impact of two teacher training programmes on pupils’ development of literacy and numeracy ability: a randomised trial Jack Worth National Foundation for.
Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.
Developing teaching as an evidence informed profession UCET Annual Conference Kevan Collins - Chief Executive
Evaluating Impacts of MSP Grants Ellen Bobronnikov January 6, 2009 Common Issues and Potential Solutions.
Sample Size Determination
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.
Chapter 8: Between Subjects Designs
Developing an evaluation of professional development Webinar #2: Going deeper into planning the design 1.
Pilot and Feasibility Studies NIHR Research Design Service Sam Norton, Liz Steed, Lauren Bell.
Key Stage 2 SATs Willand School. Key Stage 2 SATs Changes In 2014/15 a new national curriculum framework was introduced by the government for Years 1,
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
EEF Evaluators’ Conference 25 th June Session 1: Interpretation / impact 25 th June 2015.
PRAGMATIC Study Designs: Elderly Cancer Trials
Statistical Analysis Plans EEF Evaluators’ Conference 2016 Dr Ben Styles Head of NFER’s Education Trials Unit.
Evaluation in Education: 'new' approaches, different perspectives, design challenges Camilla Nevill Head of Evaluation, Education Endowment Foundation.
Trials Adrian Boyle.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov March 23, 2011.
Lurking inferential monsters
The English RCT of ‘Families and Schools Together’
Evidence in Action: Using Research to Narrow the Gap Eleanor Stringer
Power, Sample Size, & Effect Size:
Sampling and Power Slides by Jishnu Das.
Presentation transcript:

Building Evidence in Education: Workshop for EEF evaluators 2 nd June: York 6 th June: London

The EEF by numbers 83 evaluations funded to date 3,000 schools participating in projects 34 topics in the Toolkit 16 independent evaluation teams 600,000 pupils involved in EEF projects 14 members of EEF team £220 m estimated spend over lifetime of the EEF 6,000 heads presented to since launch 10 reports published

Session 1: Design RCT design, power calculations and randomisation Ben Styles (NFER) Maximising power using the NPD John Jerrim (Institute of Education)

RCT design Power calculations and randomisation Ben Styles Education Endowment Foundation June 2014

RCT design The ideal trial Methods of randomisation Power calculations Syntax exercise!

A statistician’s ideal trial Randomly select eligible pupils from NPD No consent! Simple randomisation of pupils to intervention and control groups No attrition No data matching problems No measurement error

BEFORE YOU START ! 1. Trial registration: specification of primary and secondary outcomes in addition to sub- group analyses 2. Recruit participants and explain method to stakeholders 3. Select participants according to fixed eligibility criteria 4. Obtain consent 5. Baseline outcome measurement (or use existing administrative data) 6. Randomise eligible participants into groups (evaluator carries out randomisation) 7. Intervention runs in experimental group; control receives ‘business-as-usual’/an alternative activity 8. Administer follow-up measurement (evaluator) 9. Intention-to-treat analysis followed by reporting as per CONSORT guidelines 10. Control receives intervention (under what circumstances?)

Why we depart from the ideal Schools manage pupils! Nature of the intervention Contamination – how serious is the risk?

Restricted randomisation? Use simple randomisation where you can Timetable considerations in a pupil-randomised trial → stratify by school Important predictor variable with small and important category → stratify by predictor Fewer than 20 schools → minimise Multiple recruitment tranches → blocked Pairing → BAD IDEA!

Restricted randomisation Simple randomisationRestricted randomisation Restricted randomisation more complicated and can go wrong. Take strata into account in analysis:

To remember! If you have restricted your randomisation using a factor that is associated with the outcome (e.g. school) THEN INCLUDE THE FACTOR AS A COVARIATE IN YOUR ANALYSIS

Chance imbalance at baseline As distinct from bias induced by measurement attrition Can be quite large in small trials e.g. on baseline measure Include covariate in final analysis

Sample size calculations School or pupil-randomised? Intra-cluster correlation Correlation between covariate and outcome Expected effect size p(type I error)=0.05; power=0.8 Attrition

Rule of thumb Lehr, 1992

Pupil randomised ICC = 0 Correlation between baseline and outcome: uploads/pdf/Pre-testing_paper.pdf and your previous work uploads/pdf/Pre-testing_paper.pdf Effect size: previous evidence; cost- effectiveness; EEF security ratings Attrition: EEF allow recruitment to be 15% above sample size after attrition

Cluster-randomised Same as for pupils aside from ICC Proportion of total variance that is due to between cluster variance EEF pre-testing paper has some useful guidance Pre-test also reduces ICC e.g. from 0.2 to 0.15 for KS2 baseline, GCSE outcome

MDES Minimum detectable effect size EEF require this on the basis of real parameters for the security rating (avoid retrospective power calculation) How good were my estimates?

Sample size spreadsheet (fill in the highlighted boxes)Scenario 1 Expected number of pupils per school being sampled180 ROH (Intra-class correlation - percentage of variance in outcome being studied attributable to school attended)0.15 Deff (adjustment for nested design)27.85 Confidence level (of test we will use to assess effect)95.0% Critical T-value1.96 Correlation between before and after scores0.70 SD of residuals in scores (if scores have SD of 1)0.71 Expected effect size (in terms of absolute outcome scores)0.2 Expected effect size (in terms of residual outcome scores)0.28 n(schools) in intervention31 n(schools) in control31 n(pupils) in intervention5580 n(pupils) in control5580 Expected SE of difference between groups (in SDs)0.10 Power80.0%

Running the randomisation SYNTAX EXERCISE In pairs, explain what each of the steps does How many schools were randomised in this block?

Conclusions Always think of any RCT (any quantitative impact evaluation) as a departure from the ideal trial The design, power calculations, method of randomisation and analysis all interrelate and need to be consistent

Maximising power using the NPD John Jerrim (Institute of Education)

Structure How much power do EEF trials currently have? PISA, power, star ratings and current EEF trials Exercise Work in groups to design an EEF trial Goal = Maximise power at minimal cost My answers How might I try to maximise power? Your answers! / Discussion

Power in context Effect sizes, PISA rankings and EEF padlock ratings

How powerful are EEF trials thus far? EEF secondary school trials As of 01 / 05 / 2014 Detectable effect size Mean = Median = 0.25 Between 4* and 5* by EEF guidelines….

Power and the PISA reading rankings UK’s current position Effect size = 0.10 Effect size = 0.20 (EEF 5*) Effect size = 0.30(EEF 4*) MEDIAN EEF TRIAL = 0.25 Effect size = 0.40(EEF 3*) IMPLICATION Effect sizes of 0.20 are damn big … particularly given pretty small doses we are giving Effect size = 0.50(EEF 2*)

Do we currently have a power problem? - Quite possibly! - So trying to get more power in future trials very important…..

Exercise

Task: In groups, discuss how you would design the following trial Intervention = Teaching children how to play chess Maximum number of treatment schools = 20 secondary schools Year group = Year 7 Level of randomisation = School level Test = One-to-one non-verbal IQ assessment with trained educationalist (end of year 7) Control condition = ‘Business as usual’ Study type = ‘Efficacy’ study (proof of concept) Objective: Maximise power at minimum cost How would you design this trial to meet these twin objectives? What could you do to increase power in this trial E.g. Would you use a baseline test? If so, what? Exercise

My answers The usual suspects….. …and less obvious options

The usual suspects….. 1.Use a regression model and include baseline covariates….. - Adding controls explains variance. Boosts power 2.Use Key stage 2 test scores as “pre-test”…. - Point of baseline covariates is to explain variance - KS 2 scores in maths likely to be reasonably correlated with outcome (non-verbal IQ) - CHEAP! From NPD. 3. Stratify the sample prior to randomisation - Potentially reduces error variance. Thus boosts power. - Additional advantages. Balance of baseline characteristics. 4. Really engage with control schools - Make sure we minimise loss of sample through attrition

Less ‘obvious’ options….

Don’t test every child…….. There are around 200 children per secondary school….. …. One-to-one testing is expensive …Testing more than 50 pupils buys you little additional power RANDOMLY SAMPLE PUPILS WITHIN SCHOOLS! Assumptions 20 schools Pre/post corr of % power Rho = 0.15

…..use an unequal sampling fraction We all know that ↑ clusters (k) means ↑ power This example: limited to only a small number of treatment schools (20) ….but control condition was non-intrusive and cheap So don’t just recruit 20 control schools as well – recruit more! Nothing about RCT’s mean we need equal k for treatment and control Power calculation becomes more complex (anybody know it!?)

Use more homogenous selection of schools…. ALL UK SCHOO LS LOW PERFORMING SCHOOLS ONLY

Why does rho decline?? The within school variation barely changes ….. …. While the between school variation declines substantially

Implications As example is an efficacy study why not restrict attention to low performing schools only? - Boosts power! - Fits with EEF mandate (close performance gap) - Not worried about generalisability We implicitly do this anyway (e.g. by doing trials in just one or two LA’s)…… …..but can we do it in a smarter way??? Little appreciated trade-off between POWER and GENERALISBILITY - Long-term implications for EEF - Trial representative of England population very hard to achieve

Conclusions Do we have a “power problem”? Quite possibly Median detectable effect size = 0.25 in EEF secondary school trials If were to boost UK reading PISA scores by this amount, we would move above Canada, Taiwan and Finland in the rankings….. Ways to potentially increase power Include baseline covariates (from NPD where possible) Stratify the sample prior to randomisation Engage with control schools! Do you need to test every child? Practical alternatives? Could you increase number of control schools without adding much to cost (unequal randomisation fraction) Could you restrict your focus to a narrower population? (e.g. low performing schools only)?