Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry, Mikel Cole, Megan Roberts, Karen Anthony, Matthew Busick Vanderbilt University And also Howard Bloom, Carolyn Hill, & Alison Black IES Research Conference Washington, DC June 2010

Intervention research model Compare treatment (T) sample with control (C) sample on education outcome measure Description of the intervention effect that results from this comparison:   Means on outcome measure for T and C samples; difference between means   p-values for statistical significance of the difference between

Problem to be addressed The native statistical findings that represent the effect of an intervention on an education outcome often provide little insight into the nature, magnitude, or practical significance of the effect Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful

Example Intervention: vocabulary-building program Samples: fifth graders receiving (T) and not receiving (C) the program Outcome: CAT5 reading achievement test Mean score for T: 718 Mean score for C: 703 Difference between T and C means: 15 points p-value: <.05 [! Note– not an indicator of magnitude of effect!] Questions:   Is this a big effect or a trivial one? Do the students read a lot better now, or just a little better?   If they were poor readers before, is this a big enough effect to now make them proficient readers?   If they were behind their peers, have they now caught up? Someone intimately familiar with the CAT5 scoring may be able to look at the means and answer such questions, but most of us haven’t a clue.

Two approaches to review here 1. 1. Descriptive representations of intervention effects: Translations of the native statistical results into forms that are more readily understood 2. 2. Practical significance: Assessing the magnitude of intervention effects in relationship to criteria that have recognized value in the context of application

Useful Descriptive Representations of Intervention Effects

Representation in terms of the original metric Often inherently meaningful, e.g.:   proportion of days student was absent   number of suspensions or expulsions   proportion of assignments competed Covariate adjusted means (baseline diffs; attrition) Pretest baselines and differential pre-post change (example on next slide)

Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self- report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference Scenario AScenario BScenario C PretestPosttestPretestPosttestPretestPosttest Intervention25.523.817.723.822.923.8 Control25.627.417.627.423.027.4

Effect size Typically the standardized mean difference ES ES d =Δ/σ

Utility of effect size Useful for comparing effects across studies with ‘same’ outcome measured differently Somewhat meaningful to researchers But not very intuitive; provides little insight into nature and magnitude of effect, esp for nonresearchers Often reported in relation to Cohen’s guidelines for ‘small,’ ‘medium,’ and large BAD IDEA

Notes and quirks about ESs Better with covariate adjusted means Don’t adjust variance/SD– concept of standardization Issue of the variance on which to standardize Effect sizes standardized on variance/SD other than between individuals Effect size from multilevel analysis results

Proportions of T and C samples above or below a threshold score

Cohen U3 overlap index Adapted from Redfield & Rousseau, 1981.73 σ 50% above C mean 77% above C mean

Rosenthal & Rubin BESD d =.80

Proportion reaching or exceeding a performance threshold

Options for threshold values Mean of control sample (U3) Grand mean of combined T and C samples (BESD) Predefined performance threshold (e.g., NAEP) Other possibilities:   Mean of norming sample, e.g., standard score of 100 on PPVT   Mean of reference group with ‘gap,’ e.g., students who don’t qualify for FRPL, majority students   Study determined threshold, e.g., score at which teachers see behavior as problematic   Target value, e.g., achievement gain needed for AYP   Any other identifiable score on the measure that has interpretable meaning within the context of the intervention study

Conversion to grade equivalent (and age equivalent) scores Mean Reading Grade Equivalent (GE) Scores of Success for All (SFA) and Control Samples [from Slavin et al., 1996]

Characteristics and quirks of grade equivalent scores Provided (or not) by test developer [Note: could be developed by researcher for context of intervention study] Vary from X.0 to X.9 over 9 month school year Not criterion-referenced; estimates from empirical norming sample Imputed where norming data are thin, esp for students outside grade range Nonlinear relationship to test scores, e.g., given GE difference in early grades is larger score difference than in later grades, but greater within variation in later grades

Practical Significance: Criterion Frameworks for Assessing the Magnitude of Intervention Effects

Practical significance must be judged in reference to some external standard relevant to the intervention context E.g., compare effect found in study with: E.g., compare effect found in study with:  Effects others have found on similar measures with similar interventions  Normative expectations for change  Policy-relevant performance gaps  Intervention costs (not discussed here)

Cohen’s rules of thumb for interpreting effect size: Normative but overly broad Cohen Small = 0.20  Small = 0.20  Medium = 0.50  Medium = 0.50  Large = 0.80  Large = 0.80  Cohen, Jacob (1988) Statistical Power Analysis for the Behavioral Sciences 2 nd edition (Hillsdale, NJ: Lawrence Erlbaum). Lipsey Small = 0.15  Medium = 0.45  Large = 0.90  Lipsey, Mark W. (1990) Design Sensitivity: Statistical Power for Experimental Research (Newbury Park, CA: Sage Publications).

Effect sizes for achievement from random assignment studies of education interventions 124 random assignment studies 124 random assignment studies 181 independent subject samples 181 independent subject samples 831 effect size estimates 831 effect size estimates

Achievement effect sizes by grade level and type of achievement test Grade Level & Achievement Measure N of ES EstimatesMeanSD Elementary School693.14.31 Standardized test (broad)89.06.12 Standardized test (narrow)374.13.24 Specialized topic/test230.28.52 Middle School70.11.19 Standardized test (broad)13.07.16 Standardized test (narrow)30.24.18 Specialized topic/test27.14.28 High school68.10.20 Standardized test (broad)-- Standardized test (narrow)22.05.06 Specialized topic/test43.38.28

Achievement effect sizes by target recipients Target Recipients Number of ES Estimates MeanSD Individual Students (one-on-one)252.16.33 Small groups (not classrooms)322.24.33 Classroom of students178.11.27 Whole school35.10.18 Mixed44.07.15

Normative expectations for change: Estimating annual gains in effect size from national norming samples for standardized tests Up to seven tests were used for reading, math, science, and social science Up to seven tests were used for reading, math, science, and social science The mean and standard deviation of scale scores for each grade were obtained from test manuals The mean and standard deviation of scale scores for each grade were obtained from test manuals The standardized mean difference across succeeding grades was computed The standardized mean difference across succeeding grades was computed These results were averaged across tests and weighted according to Hedges (1982) These results were averaged across tests and weighted according to Hedges (1982)

Annual reading growth Reading Grade Growth Transition Effect Size ----------------------------------- K - 1 1.52 K - 1 1.52 1 - 2 0.97 1 - 2 0.97 2 - 3 0.60 2 - 3 0.60 3 - 4 0.36 3 - 4 0.36 4 - 5 0.40 4 - 5 0.40 5 - 6 0.32 5 - 6 0.32 6 - 7 0.23 6 - 7 0.23 7 - 8 0.26 7 - 8 0.26 8 - 9 0.24 8 - 9 0.24 9 - 10 0.19 9 - 10 0.19 10 - 11 0.19 10 - 11 0.19 11 - 12 0.06 11 - 12 0.06-------------------------------------------------- Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10.

Policy-relevant demographic performance gaps Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups Effect size gaps for groups may vary across grades, years, tests, and districts Effect size gaps for groups may vary across grades, years, tests, and districts

Policy-relevant performance gaps between “average” and “weak” schools Main idea:  What is the performance gap (in effect size) for the same types of students in different schools? Approach:  Estimate a regression model that controls for student characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status.  Infer performance gap (in effect size) between schools at different percentiles of the performance distribution

In conclusion … The native statistical form for intervention effects provides little understanding of their nature or magnitude The native statistical form for intervention effects provides little understanding of their nature or magnitude Translating the effects into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers Translating the effects into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers There are a number of easily applied translations that could be routinely used in reporting intervention effects There are a number of easily applied translations that could be routinely used in reporting intervention effects The practical significance of those effects, however, requires that they be compared with some criterion meaningful in the intervention context The practical significance of those effects, however, requires that they be compared with some criterion meaningful in the intervention context Assessing practical significance is more difficult but, there are a number of approaches that may be appropriate depending on the intervention and outcome construct Assessing practical significance is more difficult but, there are a number of approaches that may be appropriate depending on the intervention and outcome construct

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,

Similar presentations

Presentation on theme: "Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,

Similar presentations

Presentation on theme: "Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,"— Presentation transcript:

Similar presentations

About project

Feedback