EEF Evaluators’ Conference 25 th June 2015. Session 1: Interpretation / impact 25 th June 2015.

EEF Evaluators’ Conference 25 th June 2015

Session 1: Interpretation / impact 25 th June 2015

Rethinking the EEF Padlocks Calum Davey Education Endowment Foundation 25 th June 2015

Overview →Background →Problems →Attrition →Power/chance →Testing →Proposal →Discussion

Background Summary of the security of evaluation findings ‘Padlocks’ developed in consultation with evaluators

Background Summary of the security of evaluation findings ‘Padlocks’ developed in consultation with evaluators Group Number of pupils Effect size Estimated months’ progress Evidence strength Literacy intervention5500.10 (0.03, 0.18)+2

Background Summary of the security of evaluation findings ‘Padlocks’ developed in consultation with evaluators Five categories – combined to create overall rating: Group Number of pupils Effect size Estimated months’ progress Evidence strength Literacy intervention5500.10 (0.03, 0.18)+2 Rating1. Design2. Power (MDES)3. Attrition4. Balance5. Threats to validity 5 Fair and clear experimental design (RCT) < 0.2< 10% Well-balanced on observables No threats to validity 4 Fair and clear experimental design (RCT, RDD) < 0.3< 20% 3 Well-matched comparison (quasi-experiment) < 0.4< 30% 2 Matched comparison (quasi-experiment) < 0.5< 40% 1 Comparison group with poor or no matching < 0.6< 50% 0 No comparator > 0.6> 50% Imbalanced on observables Significant threats

Background

Oxford Improving Numeracy and Literacy Rating1. Design 2. Power (MDES) 3. Attrition4. Balance5. Threats to validity 5 Fair and clear experimental design (RCT) < 0.2< 10% Well-balanced on observables No threats to validity 4 Fair and clear experimental design (RCT, RDD) < 0.3< 20% 3 Well-matched comparison (quasi- experiment) < 0.4< 30% 2 Matched comparison (quasi- experiment) < 0.5< 40% 1 Comparison group with poor or no matching < 0.6< 50% 0 No comparator > 0.6> 50% Imbalanced on observables Significant threats

Act, Sing, Play Rating1. Design 2. Power (MDES) 3. Attrition4. Balance5. Threats to validity 5 Fair and clear experimental design (RCT) < 0.2< 10% Well-balanced on observables No threats to validity 4 Fair and clear experimental design (RCT, RDD) < 0.3< 20% 3 Well-matched comparison (quasi- experiment) < 0.4< 30% 2 Matched comparison (quasi- experiment) < 0.5< 40% 1 Comparison group with poor or no matching < 0.6< 50% 0 No comparator > 0.6> 50% Imbalanced on observables Significant threats

Team Alphie Rating1. Design 2. Power (MDES) 3. Attrition4. Balance5. Threats to validity 5 Fair and clear experimental design (RCT) < 0.2< 10% Well-balanced on observables No threats to validity 4 Fair and clear experimental design (RCT, RDD) < 0.3< 20% 3 Well-matched comparison (quasi- experiment) < 0.4< 30% 2 Matched comparison (quasi- experiment) < 0.5< 40% 1 Comparison group with poor or no matching < 0.6< 50% 0 No comparator > 0.6> 50% Imbalanced on observables Significant threats

Problems : power MDES at baseline MDES changes Confusion with p-values and CIs: –Effect bigger than MDES! E.g. Calderdale ES=0.74, MDES <0.5 –P-value < 0.05! E.g. Butterfly Phonics ES=0.43, p 0.5 Rating 2. Power (MDES) 5< 0.2 4< 0.3 3< 0.4 2< 0.5 1< 0.6 0> 0.6

Problems : attrition Calculated overall at the level of randomisation 10% pupils off school each day Disadvantages individually- randomised: –Act, Sing, Play (pupil): 0% attrition at school or class level, 10% at pupil level –Oxford Science (school): 3% attrition at school level, 16% at pupil level Are the levels right? Rating3. Attrition 5< 10% 4< 20% 3< 30% 2< 40% 1< 50% 0> 50%

Problems : testing Lots of testing administered by teachers Teachers rarely blinded to intervention status What is the threat to validity when effect sizes are small? Rating 5. Threats to validity 5 No threats to validity 4 3 2 1 0 Significant threats

Potential solution? Assess ‘chance’ as well as MDES in padlock? Assess attrition at pupil level for all trials? Randomise invigilation of testing to assess bias? Number of pupils (number with intervention) Confidence interval for months progress?

Discussion Can p-values, confidence intervals, power, sample size, etc. could be combined in measure of ‘chance’? What are the advantages and disadvantages of reporting confidence intervals alongside the security rating? Is it right to include all attrition in the security rating? What potential disadvantages are there? What is the more appropriate way to ensure unbiasedness in testing? Would it be possible to conduct a trial across evaluations?

EEF Evaluators’ Conference 25 th June 2015. Session 1: Interpretation / impact 25 th June 2015.

Similar presentations

Presentation on theme: "EEF Evaluators’ Conference 25 th June 2015. Session 1: Interpretation / impact 25 th June 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EEF Evaluators’ Conference 25 th June 2015. Session 1: Interpretation / impact 25 th June 2015.

Similar presentations

Presentation on theme: "EEF Evaluators’ Conference 25 th June 2015. Session 1: Interpretation / impact 25 th June 2015."— Presentation transcript:

Similar presentations

About project

Feedback