Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations
Greenland et al (2016)

Overview of Paper Statistical tests account for randomness
Often misused, misinterpreted Seeks to clear up confusion regarding use of: Statistical tests P values Confidence intervals Statistical power Random variation is a source of error in scientific experiments. Statistical tests help to account or control for this variation Clear and concise interpretations of these tests are often not available Provides 25 examples of common mistakes and misinterpretations. We assume you’ve read all the examples, so we won’t go over them individually. Instead we’ve grouped them by theme.

Recurring themes All model assumptions must be true
If you’re wrong, you’re wrong Bias in reporting and publishing Other explanations may exist G These are errors that pop up, in one variation or another, throughout the paper. Any result from a statistics test comes with the caveat that all model assumptions are true. If any assumptions are violated, this can lead to too-large or too-small p values or conflicting results across multiple studies If you reject the null hypothesis because your p value is less than 0.05, you don’t have a 5% chance of being wrong. If you falsely reject the null hypothesis you’re 100% wrong, no matter what your p value is. This hold true for confidence intervals and power, as well. Under a different model, the same data may generate a different p value; the p value you see in publication is the one the author chose to publish or the editor chose to print. Without transparency you cannot know how the researcher arrived at her results Even is a p value shows support for a particular hypothesis, another, untested, hypothesis may be an even better fit. Or an entirely different model may be a better fit

A Few Definitions Statistical model: mathematical representation of data variability Test hypothesis: the hypothesis targeted by the model P value: continuous measure of compatibility between data and hypothesis, under the given model More definitions to come... G Statistical model: Includes assumptions, such as random sampling, that may or may not be met/realistic; Often presented in abstract form, or not presented at all Test hypothesis: hypothesizes that the effect will be of a specific size. Often, but not always, null hypothesis. May also specify a certain size of effect. P value: “model” here includes all assumptions of the model, Ranges from 0 (no compatibility) to 1 (complete compatibility), Generally some arbitrary cutoff determines certain values to be significant or non-significant (alpha)

What probability does P really represent?
The P value simply indicates the degree to which data conform to the pattern predicted by the null hypothesis P value is from set of assumptions, so it can’t refer to probability of same assumptions Probability computed assuming chance was operating alone and assumes test hypothesis is true otherwise, every assumption used to compute P value is correct, even null hypothesis computes as if assumptions were correct

What does the size of the p value mean?
Significant p value (p ≤ 0.05) does not mean test hypothesis should be rejected Nonsignificant p value (p > 0.05) does not mean test hypothesis should be accepted Any p value < 1 indicates some other hypothesis may be better fit Only flags data are unusual under test hypothesis Only indicated data are not unusual under test hypothesis Data may not be unusual under some other hypothesis not tested

Statistical significance and effect size
Non-significance is not the same as no effect Significance is not the same as importance Effects may be lost in statistical noise P value, or significance level, are not the same thing as effect size Any p value p value less than 1 indicates at least some effect was present But that effect may not be very large. A very small p value indicates strong support for some effect, but gives no indication of how much effect. Additionally, not all effects may be detected by statistical tests. In a small study, even very large effects can be lost G Statistical significance (alpha) is generally set before the study is conducted, and is commonly set a 0.05. Statistical significance represent how likely you are to erroneously reject the null hypothesis

Equality and inequality
Precision and transparency when reporting p values P values refer to extremity G This allows a reader to accurately interpret one’s results Saying a p value is equal to 0.05 is not the same thing as saying a p value is less than or equal to 0.05. Using an inequality conceals the true p value P values are a kind of inequality themselves: they represent the probability of observing the results observed and results *more extreme* than the results observed. They do not represent the probability of observing *only* the results observed.

What are you actually testing?
Statistical significance is a property of test result Match the test to the hypothesis Statistical significant is not an inherent property of what’s being studied, so you can’t “find evidence of” significance. Rather, significance is a property of the results you got from your statistical test. So your p value can be significant (or not) but your effect cannot be. If your test hypothesis is that the measured effect will equal a certain value, then it is appropriate to use a two-sided p value. If your hypothesis is that the effect will be greater than a certain value, then it is appropriate to use a one-sided p value. Which ultimately matches your test to the hypothesis

P values across studies
Even under ideal conditions, further studies not likely to produce same p values P values extremely sensitive to small variations to study parameters Whole ≠ sum of parts G Since a p value is probability of obtaining results at least as extreme as those observed, if your study produces a p value of 0.03, there is only a 3% chance a future study would obtain a p value that small or smaller. And that’s under ideal conditions. P values are very sensitive to violations of model assumptions and differences in population size. Similar p values across individual studies may mean something very different when combined. Multiple studies with nonsignificant p values combined using the Fisher formula may produce a significant p value

P values across populations
P values are very sensitive to differences in population size Compare populations, not p values G Because p values are so sensitive to differences in population size, it is possible to have different p values even when their results are clearly in agreement, and vice versa P values cannot be compared between populations, only populations can be compared to each other

Confidence intervals A range of values so defined that there is a specified probability that the value of a parameter lies within it.

What is inside and outside a Confidence Interval?
Confidence interval is range between two numbers The true effect is either in the confidence interval or not Assumptions could be violated leading to false results, also be careful with “disproved” The 95% refers to how often 95% CI computed from very many studies would contain true size If all assumptions used to compute intervals were correct. Combination of data with assumptions needed to declare an effect size outside the interval is incompatible with observations

How to compare Confidence Intervals?
Can overlap but test hypothesis’ P values must still be considered Even under ideal conditions, a future estimate will fall into the current interval much less than 95% of the time when the model is correct, precision of statistical estimation is measured directly by confidence interval width. It is not a matter of inclusion or exclusion of the null or any value. CI are superior to tests and P values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes compatible with the data

How is statistical power used?
Pre-study probability that the test will correctly reject null hypothesis Not a p value! Best used in pre-study planning G Greenland defines power as… Power does not measure the compatibility of the results with the hypothesis, which is to say it’s not a p value, and power cannot be calculated from the results Cannot be compared to p value as measure of support for against hypothesis

Solutions and Guidelines
More interpretation than: P value is above or below 0.05 Explain how results were generated and tests chosen Be careful of which results best support which hypotheses Correct statistical evaluation of multiple studies requires a pooled analysis that addresses study biases Any opinion offered about probability, likelihood, certainty cannot be derived from statistical methods alone All statistical methods make assumptions Combined C/D with each other.

Conclusions Statistical tests are inherently limited
Tests of statistical significance were intended to account for random variability as a source for error, prevent overinterpretation of data Evolved to be “ritualistic” and used to make broad statement of significance or lack thereof “The tests themselves give no final verdict, but as tools help the worker who is using them to form his final decision” - Pearson/Fisher Transparency and moderation/caution! "no statistical method is immune to misinterpretation and misuse, but prudent users of statistics will avoid approaches especially prone to serious abuse

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)

Similar presentations

Presentation on theme: "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)

Similar presentations

Presentation on theme: "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)"— Presentation transcript:

Similar presentations

About project

Feedback