Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,

Slides:



Advertisements
Similar presentations
Measuring Growth Using the Normal Curve Equivalent
Advertisements

Using Growth Models to improve quality of school accountability systems October 22, 2010.
Standardized Scales.
Hierarchical Linear Modeling: An Introduction & Applications in Organizational Research Michael C. Rodriguez.
ASSESSING RESPONSIVENESS OF HEALTH MEASUREMENTS. Link validity & reliability testing to purpose of the measure Some examples: In a diagnostic instrument,
Measures of Academic Progress. Make informed instructional decisions  Identify gaps/needs  Support specific skill development across content areas 
Designs to Estimate Impacts of MSP Projects with Confidence. Ellen Bobronnikov March 29, 2010.
Experimental Research Designs
Explaining Race Differences in Student Behavior: The Relative Contribution of Student, Peer, and School Characteristics Clara G. Muschkin* and Audrey N.
Chapter Fifteen Understanding and Using Standardized Tests.
Implementation and Evaluation of the Rural Early Adolescent Learning Project (REAL): Commonalities in Diverse Educational Settings Jill V. Hamm, Dylan.
KIPP: Effectiveness and Innovation in Publicly-Funded, Privately-Operated Schools October 4, 2012 Presentation to the APPAM/INVALSI Improving Education.
Effect Size Overheads1 The Effect Size The effect size (ES) makes meta-analysis possible. The ES encodes the selected research findings on a numeric scale.
Evaluating Pretest to Posttest Score Differences in CAP Science and Social Studies Assessments: How Much Growth is Enough? February 2014 Dale Whittington,
Introduction to Meta-Analysis Joseph Stevens, Ph.D., University of Oregon (541) , © Stevens 2006.
Treatment Effects: What works for Whom? Spyros Konstantopoulos Michigan State University.
Practical Meta-Analysis -- D. B. Wilson
Agenda: Block Watch: Random Assignment, Outcomes, and indicators Issues in Impact and Random Assignment: Youth Transition Demonstration –Who is randomized?
Chapter 7 Correlational Research Gay, Mills, and Airasian
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Measures of Academic Progress. Make informed instructional decisions  Identify gaps/needs  Support specific skill development across content areas 
Standardization the properties of objective tests.
Institute of Education Sciences (IES) 25 th Annual Management Information Systems Conference (Feb , 2012) Useful and Fair Accountability Data in.
Introduction to the Georgia Student Growth Model Student Growth Percentiles 1.
Including a detailed description of the Colorado Growth Model 1.
Is Your School Improving Outcomes for Students with Disabilities: Guesswork or Science? Presented by The Elementary & Middle Schools Technical Assistance.
Student Engagement Survey Results and Analysis June 2011.
Reevaluation Using PSM/RTI Processes, PLAFP, and Exit Criteria How do I do all this stuff?
Introduction to the Georgia Student Growth Model Understanding and Using SGPs to Improve Student Performance 1.
How Can Teacher Evaluation Be Connected to Student Achievement?
Chapter 3 Understanding Test Scores Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
School Performance Framework Sponsored by The Colorado Department of Education Summer 2010 Version 1.3.
State Charter Schools Commission of Georgia SCSC Academic Accountability Update State Charter School Performance
Evaluation Results Missouri Reading Initiative.
Reevaluation Using PSM/RTI Processes, PLAFP, and Exit Criteria How do I do all this stuff?
Classifying Designs of MSP Evaluations Lessons Learned and Recommendations Barbara E. Lovitts June 11, 2008.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
The Nation’s Report Card Science National Assessment of Educational Progress (NAEP)
Lab 9: Two Group Comparisons. Today’s Activities - Evaluating and interpreting differences across groups – Effect sizes Gender differences examples Class.
Assessing Learners with Special Needs: An Applied Approach, 6e © 2009 Pearson Education, Inc. All rights reserved. Chapter 5: Introduction to Norm- Referenced.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.
Evaluation Results Missouri Reading Initiative.
1 Children Left Behind in AYP and Non-AYP Schools: Using Student Progress and the Distribution of Student Gains to Validate AYP Kilchan Choi Michael Seltzer.
Assessing Responsiveness of Health Measurements Ian McDowell, INTA, Santiago, March 20, 2001.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.
Chapter 8: Between Subjects Designs
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter)
Setting ambitious, yet realistic goals is the first step toward ensuring that all our students are successful throughout school and become proficient adult.
Teaching the Control of Variables Strategy in Fourth Grade Classrooms Robert F. Lorch, Jr., William J. Calderhead, Emily E. Dunlap, Emily C. Hodell, Benjamin.
ACT ASPIRE GROWTH REPORTS. DISTRICTS AND SCHOOLS THAT PARTICIPATED IN ACT ASPIRE ASSESSMENTS (READING, MATH, ENGLISH, SCIENCE AND WRITING) WITH AN N COUNT.
Effectiveness of Selected Supplemental Reading Comprehension Interventions: Impacts on a First Cohort of Fifth-Grade Students June 8, 2009 IES Annual Research.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Terra Nova By Tammy Stegman, Robyn Ourada, Sandy Perry, & Kim Cotton.
Examining Achievement Gaps
Effect Sizes.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov March 23, 2011.
Teacher SLTs
NWEA Measures of Academic Progress (MAP)
Improving the Design of STEM Impact Studies: Considerations for Statistical Power Discussant Notes Cristofer Price SREE
Standards- based grading for parents- grades k-2
Effect size measures for single-case designs: General considerations
Dan Goldhaber1,2, Vanessa Quince2, and Roddy Theobald1
CORE Academic Growth Model: Results Interpretation
Teacher SLTs
Understanding and Using Standardized Tests
Margaret Wu University of Melbourne
Driven by Data to Empower Instruction and Learning
Presentation transcript:

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry, Mikel Cole, Megan Roberts, Karen Anthony, Matthew Busick Vanderbilt University And also Howard Bloom, Carolyn Hill, & Alison Black IES Research Conference Washington, DC June 2010

Intervention research model Compare treatment (T) sample with control (C) sample on education outcome measure Description of the intervention effect that results from this comparison:   Means on outcome measure for T and C samples; difference between means   p-values for statistical significance of the difference between

Problem to be addressed The native statistical findings that represent the effect of an intervention on an education outcome often provide little insight into the nature, magnitude, or practical significance of the effect Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful

Example Intervention: vocabulary-building program Samples: fifth graders receiving (T) and not receiving (C) the program Outcome: CAT5 reading achievement test Mean score for T: 718 Mean score for C: 703 Difference between T and C means: 15 points p-value: <.05 [! Note– not an indicator of magnitude of effect!] Questions:   Is this a big effect or a trivial one? Do the students read a lot better now, or just a little better?   If they were poor readers before, is this a big enough effect to now make them proficient readers?   If they were behind their peers, have they now caught up? Someone intimately familiar with the CAT5 scoring may be able to look at the means and answer such questions, but most of us haven’t a clue.

Two approaches to review here Descriptive representations of intervention effects: Translations of the native statistical results into forms that are more readily understood Practical significance: Assessing the magnitude of intervention effects in relationship to criteria that have recognized value in the context of application

Useful Descriptive Representations of Intervention Effects

Representation in terms of the original metric Often inherently meaningful, e.g.:   proportion of days student was absent   number of suspensions or expulsions   proportion of assignments competed Covariate adjusted means (baseline diffs; attrition) Pretest baselines and differential pre-post change (example on next slide)

Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self- report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference Scenario AScenario BScenario C PretestPosttestPretestPosttestPretestPosttest Intervention Control

Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self- report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference Scenario AScenario BScenario C PretestPosttestPretestPosttestPretestPosttest Intervention Control

Effect size Typically the standardized mean difference ES ES d =Δ/σ

Utility of effect size Useful for comparing effects across studies with ‘same’ outcome measured differently Somewhat meaningful to researchers But not very intuitive; provides little insight into nature and magnitude of effect, esp for nonresearchers Often reported in relation to Cohen’s guidelines for ‘small,’ ‘medium,’ and large BAD IDEA

Notes and quirks about ESs Better with covariate adjusted means Don’t adjust variance/SD– concept of standardization Issue of the variance on which to standardize Effect sizes standardized on variance/SD other than between individuals Effect size from multilevel analysis results

Proportions of T and C samples above or below a threshold score

Cohen U3 overlap index Adapted from Redfield & Rousseau, σ 50% above C mean 77% above C mean

Rosenthal & Rubin BESD d =.80

Proportion reaching or exceeding a performance threshold

Options for threshold values Mean of control sample (U3) Grand mean of combined T and C samples (BESD) Predefined performance threshold (e.g., NAEP) Other possibilities:   Mean of norming sample, e.g., standard score of 100 on PPVT   Mean of reference group with ‘gap,’ e.g., students who don’t qualify for FRPL, majority students   Study determined threshold, e.g., score at which teachers see behavior as problematic   Target value, e.g., achievement gain needed for AYP   Any other identifiable score on the measure that has interpretable meaning within the context of the intervention study

Conversion to grade equivalent (and age equivalent) scores Mean Reading Grade Equivalent (GE) Scores of Success for All (SFA) and Control Samples [from Slavin et al., 1996]

Characteristics and quirks of grade equivalent scores Provided (or not) by test developer [Note: could be developed by researcher for context of intervention study] Vary from X.0 to X.9 over 9 month school year Not criterion-referenced; estimates from empirical norming sample Imputed where norming data are thin, esp for students outside grade range Nonlinear relationship to test scores, e.g., given GE difference in early grades is larger score difference than in later grades, but greater within variation in later grades

Practical Significance: Criterion Frameworks for Assessing the Magnitude of Intervention Effects

Practical significance must be judged in reference to some external standard relevant to the intervention context E.g., compare effect found in study with: E.g., compare effect found in study with:  Effects others have found on similar measures with similar interventions  Normative expectations for change  Policy-relevant performance gaps  Intervention costs (not discussed here)

Cohen’s rules of thumb for interpreting effect size: Normative but overly broad Cohen Small = 0.20  Small = 0.20  Medium = 0.50  Medium = 0.50  Large = 0.80  Large = 0.80  Cohen, Jacob (1988) Statistical Power Analysis for the Behavioral Sciences 2 nd edition (Hillsdale, NJ: Lawrence Erlbaum). Lipsey Small = 0.15  Medium = 0.45  Large = 0.90  Lipsey, Mark W. (1990) Design Sensitivity: Statistical Power for Experimental Research (Newbury Park, CA: Sage Publications).

Effect sizes for achievement from random assignment studies of education interventions 124 random assignment studies 124 random assignment studies 181 independent subject samples 181 independent subject samples 831 effect size estimates 831 effect size estimates

Achievement effect sizes by grade level and type of achievement test Grade Level & Achievement Measure N of ES EstimatesMeanSD Elementary School Standardized test (broad) Standardized test (narrow) Specialized topic/test Middle School Standardized test (broad) Standardized test (narrow) Specialized topic/test High school Standardized test (broad)-- Standardized test (narrow) Specialized topic/test

Achievement effect sizes by grade level and type of achievement test Grade Level & Achievement Measure N of ES EstimatesMeanSD Elementary School Standardized test (broad) Standardized test (narrow) Specialized topic/test Middle School Standardized test (broad) Standardized test (narrow) Specialized topic/test High school Standardized test (broad)-- Standardized test (narrow) Specialized topic/test

Achievement effect sizes by target recipients Target Recipients Number of ES Estimates MeanSD Individual Students (one-on-one) Small groups (not classrooms) Classroom of students Whole school Mixed

Normative expectations for change: Estimating annual gains in effect size from national norming samples for standardized tests Up to seven tests were used for reading, math, science, and social science Up to seven tests were used for reading, math, science, and social science The mean and standard deviation of scale scores for each grade were obtained from test manuals The mean and standard deviation of scale scores for each grade were obtained from test manuals The standardized mean difference across succeeding grades was computed The standardized mean difference across succeeding grades was computed These results were averaged across tests and weighted according to Hedges (1982) These results were averaged across tests and weighted according to Hedges (1982)

Annual reading growth Reading Grade Growth Transition Effect Size K K Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10.

Policy-relevant demographic performance gaps Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups Effect size gaps for groups may vary across grades, years, tests, and districts Effect size gaps for groups may vary across grades, years, tests, and districts

Policy-relevant performance gaps between “average” and “weak” schools Main idea:  What is the performance gap (in effect size) for the same types of students in different schools? Approach:  Estimate a regression model that controls for student characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.  Infer performance gap (in effect size) between schools at different percentiles of the performance distribution

In conclusion … The native statistical form for intervention effects provides little understanding of their nature or magnitude The native statistical form for intervention effects provides little understanding of their nature or magnitude Translating the effects into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers Translating the effects into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers There are a number of easily applied translations that could be routinely used in reporting intervention effects There are a number of easily applied translations that could be routinely used in reporting intervention effects The practical significance of those effects, however, requires that they be compared with some criterion meaningful in the intervention context The practical significance of those effects, however, requires that they be compared with some criterion meaningful in the intervention context Assessing practical significance is more difficult but, there are a number of approaches that may be appropriate depending on the intervention and outcome construct Assessing practical significance is more difficult but, there are a number of approaches that may be appropriate depending on the intervention and outcome construct