Narrowing the evaluation gap

Slides:



Advertisements
Similar presentations
Approaches to Publish rather than Perish: Some Lessons from the School of Hard Knocks Dr. John Loomis, Professor Dept. of Ag & Resource Economics Colorado.
Advertisements

Statistical Issues in Research Planning and Evaluation
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
Stat 301 – Day 15 Comparing Groups. Statistical Inference Making statements about the “world” based on observing a sample of data, with an indication.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Chapter 7 Probability and Samples: The Distribution of Sample Means
Inferential Statistics
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Confidence Intervals and Hypothesis Testing
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
14. Introduction to inference
Introduction to Statistical Inferences Inference means making a statement about a population based on an analysis of a random sample taken from the population.
Health and Disease in Populations 2001 Sources of variation (2) Jane Hutton (Paul Burton)
Exam Exam starts two weeks from today. Amusing Statistics Use what you know about normal distributions to evaluate this finding: The study, published.
Inference We want to know how often students in a medium-size college go to the mall in a given year. We interview an SRS of n = 10. If we interviewed.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Section 10.1 Confidence Intervals
Ch 10 – Intro To Inference 10.1: Estimating with Confidence 10.2 Tests of Significance 10.3 Making Sense of Statistical Significance 10.4 Inference as.
Section 3.3: The Story of Statistical Inference Section 4.1: Testing Where a Proportion Is.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
1 Probability and Statistics Confidence Intervals.
April Center for Open Fostering openness, integrity, and reproducibility of scientific research.
Dr.Theingi Community Medicine
Lecture Notes and Electronic Presentations, © 2013 Dr. Kelly Significance and Sample Size Refresher Harrison W. Kelly III, Ph.D. Lecture # 3.
PSY 626: Bayesian Statistics for Psychological Science
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
How does publication in psychological science work?
Unit 5 – Chapters 10 and 12 What happens if we don’t know the values of population parameters like and ? Can we estimate their values somehow?
Chapter 16: Sample Size “See what kind of love the Father has given to us, that we should be called children of God; and so we are. The reason why the.
Journeys into journals: publishing for the new professional
Critically Appraising a Medical Journal Article
Disseminating Research Findings Shawn A. Lawrence, PhD, LCSW SOW 3401
Experimental Psychology
Unit 5: Hypothesis Testing
Inference and Tests of Hypotheses
Chapter 21 More About Tests.
Evidence-Based Medicine Appendix 1: Confidence Intervals
AP Seminar: Statistics Primer
1. Estimation ESTIMATION.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
The way you share your ideas across the profession
Experimental Psychology PSY 433
Hypothesis Testing for Proportions
More about Tests and Intervals
PSY 626: Bayesian Statistics for Psychological Science
Stat 217 – Day 28 Review Stat 217.
Section 10.2 Tests of Significance
Review: What influences confidence intervals?
Elementary Statistics
Stat 217 – Day 17 Review.
CHAPTER 18: Inference in Practice
Type I and Type II Error AP Stat February 28th 2011.
Significance Tests: The Basics
Reasoning in Psychology Using Statistics
Chapter 12 Power Analysis.
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Confidence Intervals & Polls
Ch 11 實習 (2).
Section 11.1: Significance Tests: Basics
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
Experimental Psychology PSY 433
Presentation transcript:

Narrowing the evaluation gap Session 4 Eight challenges facing the quantitative social sciences in being relevant for public policy John Jerrim 2018 EEF Evaluators’ Conference Narrowing the evaluation gap #EEFeval18 @EducEndowFound

Eight challenges facing the quantitative social sciences in being relevant for public policy……

1. The prevalence of overly-complicated methods…. 2 1. The prevalence of overly-complicated methods….. 2. How slow everything is…… 3. The problem of peer review….. 4. Lack of real quality assurance….. 5. Publication bias……. 6. Open access data vs pre-registration of studies…… 7. Over-reliance upon hypothesis-testing / statistical inference….. 8. Why does the EEF toolkit and RCTs look so different?

1. The prevalence of overly-complicated methods Beauty of RCTs = Just compare mean scores across groups. Other methods are getting increasingly complicated…… Frankly, many (most?) researchers using many of these methods don’t really understand them! Example – structural equation modelling (SEM). - Now widely used in sociology/psychology……. - Actually, can be quite a complex method…... - Many people don’t really understand the numbers being produced (fit stats!) Very hard to communicate the methods/results as things get complex. - Hard to review/quality assure. Not transparent. - Hard to communicate!

Example of overly complicated social science results……… Saunders (2010: Figure 1). Social Mobility Myths.

2. How slow everything is….. If I come up with a policy-relevant research idea, here is what I have to do: Month 0 = Come up with idea. Month 3 = Research proposal written up and submitted. Month 9 = Grant decision made. Month 10 = Start work. Month 13 = Complete first draft of paper. Submit to journal. Month 17 = Decision from journal. Revise and resubmit. Month 19 = Submit back to journal. Month 21 = Decision from journal. Accept. Month 22 = Proofs from journal. Month 24 = Publication. I have taken almost half of a parliamentary to go from my initial idea to getting evidence out there…… ….but I could get this out in around 3 months if I just get on and do it!

3. The problem of peer-review….. A large chunk of this time-lag is due to peer-review. - Grant review process (6 months) - Journal review process (6 months) Would be ok if peer-review in academic worked well. It doesn’t. Lots of bad practise: - ESRC does not blind reviewers to the applicant. - ESRC sent me a peer review to do – of my own PhD student! - ‘Special issues’ = You become the editor and give your mates an easy ride. - Journal sent me a paper to review of my co-author. - Reviews are very subjective. - Not at all transparent… If you fail to get accepted, just publish paper elsewhere. - It will get published somewhere eventually!

4. Lack of real quality assurance procedures. Peer-review of papers in academic journals a very low bar. - If at first you don’t succeed, try try again! More prestigious journals = higher quality articles? - Possible, but debatable! - Many poor papers still get into top journals. - Policymakers don’t know prestigious journals from any other! - Only way to judge the quality is to read it yourself! No-one actually checks people’s workings….. - No one checks the code (or typically even asks) for errors…… - But errors happen! We are all human. Many journals still do not require code to be published….. - Don’t have to make freely available how you reached your conclusion….

What can we do it change/improve/instead of peer-review? Improve transparency of reviews (BMJ approach) - Publish reviewers comments and author responses. - Publish all iterations of papers (first submission through to final article). - Mandatory open publication of code Get rid of journals entirely and publish everything online in working paper series? - Happens anyway! - Lots cheaper. - Why do we really need academic journals anyway? Fund people rather than specific projects? - E.g. Fund people for renewable 5-year periods. - Let people get on with things. - Stop wasting time on funding applications rather than actual research.

5. Publication bias….. More severe issue in social than medical sciences? Think about RCTs…….. - Researchers already very heavily invested by the time results come. - Writing up relatively small piece of marginal effort. - High “sunk costs”. Think about social science research using survey data. - Very quick to do rough estimations/analysis….. - Get idea of answer within a few days…. - Very low sunk cost…. - Not a lot lost from not writing up! A lot of zero findings in the social sciences will not be written up!

6. Is pre-registration/protocols a solution? Great idea when doing primary data collection (EEF trials)….. …but most quantitative social science still based upon secondary data (e.g. birth cohorts) Such resources are (quite rightly) open access….. But trade off between open access and pre-registration! Pre-registration/protocols not a viable option for most QSS research! - What else can we do to make sure zero/small results are written up?

7. What on earth is a confidence interval? 1. The probability that the true mean is greater than 0 is at least 95 %. 2. The probabilitythatthetruemeanequals0issmallerthan 5 %. 3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. 4. There is a 95% probability that the true mean lies between 0.1 and 0.4. 5. Wecanbe95% confident that the true mean lies between 0.1 and 0.4. 6. If we were to repeat the experiment over and over, then 95 % of the time the true mean falls between 0.1 and 0.4 In groups….. Which of these statements are true, and which are false!?

7. What on earth is a confidence interval? Professor Gorard conducts an experiment and reports: “The 95% confidence interval for the mean ranges from 0.1 to 0.4” In groups, decide which of the following statements are true and which are false? 1. The probability that the true mean is greater than 0 is at least 95%. 2. The probability that the true mean equals 0 is smaller than 5%. 3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. 4. There is a 95% probability that the true mean lies between 0.1 and 0.4. 5. We can be 95% confident that the true mean lies between 0.1 and 0.4. 6. If we were to repeat the experiment over and over, then 95 % of the time the true mean falls between 0.1 and 0.4

How did a (convenience) sample of psychology students and researchers respond?

7. Over-reliance upon statistical inference P-values / confidence intervals / statistical significance are overused! What is a 95% confidence interval? If you were to repeat the same random sampling process 20 times, on 19 occasions your estimate of the true population parameter would fall between the upper and lower bound It gives you an indication of the uncertainty of the ‘true’ value in the population based upon the sampling procedure…… …It tells you nothing about importance, magnitude, policy relevance …Nor does it tell you about any other kind of uncertainty (e.g. missing data) When is it important to report such things? When we have truly random samples from a well-defined population (e.g. PISA)…… …BUT even then it should be secondary to estimates of magnitude (e.g. effect size)

Should p-values be used to decide what is a ‘promising project’? Texting-parents intervention included within the “promising project” group…….. Results below…… Discussion point Should p-values be used to decide what is a ‘promising project’? Based upon the results above, do people think this is a promising project? What criteria should EEF use to define a promising project?

8. Why does the EEF toolkit and RCTs look so different? Lot of effect sizes of 0.4 in the toolkit….. ….likewise in well-known studies (e.g. Hattie) BUT the average EEF RCT effect size = 0.05ish….. Discussion points 1. Why do people think there are such big differences? 2. Should the EEF update the toolkit to reflect this difference? If so, how!?