On teaching statistical inference: What do p values (not) mean? Bruce Blaine, PhD, PStat® Department of Mathematical and Computing Sciences St. John Fisher College 1
Limitations of NHST The misapplication of null hypothesis significance testing (NHST) procedures for statistical inference is well known. NHST procedures do not address what researchers most want to know. NHST procedures test a (nil) null hypothesis, which is rarely true and therefore uninformative to reject. NHST procedures deliver a conditional probability, p(D|H o ), which is commonly misinterpreted. NHST procedures do not test research hypotheses. NHST procedures do not quantify effect size. 2
Misinterpretations of p values Two misinterpretations of p values from NHST procedures are common in the social sciences (c.f., Kline, 2004): 1. Magnitude fallacy p values are misunderstood as an effect size statistic, such that p is inversely proportional to the evidence for the treatment effect. “…the effect was marginally significant, p=.07” “…the effect was highly (or extremely) significant, p<.001” 2. Validity fallacy p(D|H o ) is misunderstood as p(H 1 |D). “…the treatment improved the outcome, p<.05” “…the treatment had no effect on the outcome, p>.05” 3
Classroom exercise 1: Addressing the magnitude fallacy 1.In Excel (using Data Analysis Toolpak add-in), have students enter the data from a hypothetical experiment in Table Provide, or have them create, the table in Table Have students run an independent-samples t test (assume equal variances). 4. Copy and paste treatment and control data to increase ns by 5, repeating the t test each time. 5. Fill in the table with values from the analyses. 4 Table1. Table 2.
Classroom exercise 1: Results This exercise should point out that p values decrease in the 3 experiments even though the treatment has the same effect in each—why? Students should come to appreciate that larger samples are associated with smaller estimated standard errors. For a constant mean difference (which doesn’t change in this exercise), this will produce larger t values, and smaller p values. 5
Imagine 3 studies that compare students with high (Treatment, or T) and low Facebook time (Control, or C) on GPA, with descriptive statistics from the studies in the table below: 1.Have students observe (via hand calculated t tests or 95% confidence intervals) that none of the 3 studies would reject H o at p< In Excel (using the Meta Easy add-in), have students enter the data from the 3 hypothetical studies and generate a meta- analysis of the effect of Facebook time on GPA. 6 Classroom exercise 2: Addressing the validity fallacy
Classroom exercise 2: Results The exercise should point out that although none of the 3 studies is statistically significant (defined as p<.05), when their data is combined the Facebook effect on GPA is significant. Students should notice that the 95% CI estimate of the Facebook effect on GPA (the FE diamond) does not include 0. 7
These exercises allow data to teach students where p values come from and how to properly interpret them. o Exercise 1 shows that although p values are influenced by mean difference and sample size data, they cannot be trusted to quantify the mean difference alone. o Exercise 2 shows that evidence from “nonsignificant” studies, when taken as evidence against H 1, can be misleading. Genuine treatment effects may be obscured in studies with small samples, high variability, or both. 8 Summary lessons
On teaching statistical inference: more estimation, less NHST o Typical social science statistics textbooks and curricula are overdependent upon NHST methods for statistical inference. o These exercises can be part of a larger effort to teach more estimation methods in basic statistics courses, including confidence intervals, effect size statistics, and meta-analysis. o Estimation methods are more intuitive, because they speak to research, rather than null, hypotheses. 9