Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)

Slides:



Advertisements
Similar presentations
Introduction to Hypothesis Testing
Advertisements

Chapter 7 Hypothesis Testing
Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Overview of Statistical Hypothesis Testing: The z-Test
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Overview Definition Hypothesis
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Lecture 18 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Ch 10 – Intro To Inference 10.1: Estimating with Confidence 10.2 Tests of Significance 10.3 Making Sense of Statistical Significance 10.4 Inference as.
Issues concerning the interpretation of statistical significance tests.
Lecture 17 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Slide 21-1 Copyright © 2004 Pearson Education, Inc.
© Copyright McGraw-Hill 2004
A significance test or hypothesis test is a procedure for comparing our data with a hypothesis whose truth we want to assess. The hypothesis is usually.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
1 Basics of Inferential Statistics Mark A. Weaver, PhD Family Health International Office of AIDS Research, NIH ICSSC, FHI Lucknow, India, March 2010.
+ Homework 9.1:1-8, 21 & 22 Reading Guide 9.2 Section 9.1 Significance Tests: The Basics.
HYPOTHESIS TESTING.
Module 10 Hypothesis Tests for One Population Mean
Confidence Intervals for Proportions
More on Inference.
Hypothesis Testing I The One-sample Case
Chapter 9: Testing a Claim
Unit 5: Hypothesis Testing
Inference and Tests of Hypotheses
Chapter 21 More About Tests.
Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.
Testing Hypotheses About Proportions
Chapters 20, 21 Hypothesis Testing-- Determining if a Result is Different from Expected.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Confidence Intervals for Proportions
Confidence Intervals for Proportions
CHAPTER 9 Testing a Claim
Hypothesis Testing: Hypotheses
Null Hypothesis Testing
More about Tests and Intervals
Inferential Statistics
Week 11 Chapter 17. Testing Hypotheses about Proportions
More on Inference.
Testing Hypotheses about Proportions
CHAPTER 9 Testing a Claim
Decision Errors and Power
Geology Geomath Chapter 7 - Statistics tom.h.wilson
CHAPTER 9 Testing a Claim
Significance Tests: The Basics
Hypothesis Testing.
Significance Tests: The Basics
Testing Hypotheses About Proportions
Chapter 12 Power Analysis.
Confidence Intervals for Proportions
Psych 231: Research Methods in Psychology
CHAPTER 9 Testing a Claim
Testing Hypotheses I Lesson 9.
CHAPTER 9 Testing a Claim
Confidence Intervals for Proportions
Type I and Type II Errors
Statistical Test A test of significance is a formal procedure for comparing observed data with a claim (also called a hypothesis) whose truth we want to.
CHAPTER 9 Testing a Claim
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
Presentation transcript:

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)

Overview of Paper Statistical tests account for randomness Often misused, misinterpreted Seeks to clear up confusion regarding use of: Statistical tests P values Confidence intervals Statistical power Random variation is a source of error in scientific experiments. Statistical tests help to account or control for this variation Clear and concise interpretations of these tests are often not available Provides 25 examples of common mistakes and misinterpretations. We assume you’ve read all the examples, so we won’t go over them individually. Instead we’ve grouped them by theme.

Recurring themes All model assumptions must be true If you’re wrong, you’re wrong Bias in reporting and publishing Other explanations may exist G These are errors that pop up, in one variation or another, throughout the paper. Any result from a statistics test comes with the caveat that all model assumptions are true. If any assumptions are violated, this can lead to too-large or too-small p values or conflicting results across multiple studies If you reject the null hypothesis because your p value is less than 0.05, you don’t have a 5% chance of being wrong. If you falsely reject the null hypothesis you’re 100% wrong, no matter what your p value is. This hold true for confidence intervals and power, as well. Under a different model, the same data may generate a different p value; the p value you see in publication is the one the author chose to publish or the editor chose to print. Without transparency you cannot know how the researcher arrived at her results Even is a p value shows support for a particular hypothesis, another, untested, hypothesis may be an even better fit. Or an entirely different model may be a better fit

A Few Definitions Statistical model: mathematical representation of data variability Test hypothesis: the hypothesis targeted by the model P value: continuous measure of compatibility between data and hypothesis, under the given model More definitions to come... G Statistical model: Includes assumptions, such as random sampling, that may or may not be met/realistic; Often presented in abstract form, or not presented at all Test hypothesis: hypothesizes that the effect will be of a specific size. Often, but not always, null hypothesis. May also specify a certain size of effect. P value: “model” here includes all assumptions of the model, Ranges from 0 (no compatibility) to 1 (complete compatibility), Generally some arbitrary cutoff determines certain values to be significant or non-significant (alpha)

What probability does P really represent? The P value simply indicates the degree to which data conform to the pattern predicted by the null hypothesis P value is from set of assumptions, so it can’t refer to probability of same assumptions Probability computed assuming chance was operating alone and assumes test hypothesis is true otherwise, every assumption used to compute P value is correct, even null hypothesis computes as if assumptions were correct

What does the size of the p value mean? Significant p value (p ≤ 0.05) does not mean test hypothesis should be rejected Nonsignificant p value (p > 0.05) does not mean test hypothesis should be accepted Any p value < 1 indicates some other hypothesis may be better fit Only flags data are unusual under test hypothesis Only indicated data are not unusual under test hypothesis Data may not be unusual under some other hypothesis not tested

Statistical significance and effect size Non-significance is not the same as no effect Significance is not the same as importance Effects may be lost in statistical noise P value, or significance level, are not the same thing as effect size Any p value p value less than 1 indicates at least some effect was present But that effect may not be very large. A very small p value indicates strong support for some effect, but gives no indication of how much effect. Additionally, not all effects may be detected by statistical tests. In a small study, even very large effects can be lost G Statistical significance (alpha) is generally set before the study is conducted, and is commonly set a 0.05. Statistical significance represent how likely you are to erroneously reject the null hypothesis

Equality and inequality Precision and transparency when reporting p values P values refer to extremity G This allows a reader to accurately interpret one’s results Saying a p value is equal to 0.05 is not the same thing as saying a p value is less than or equal to 0.05. Using an inequality conceals the true p value P values are a kind of inequality themselves: they represent the probability of observing the results observed and results *more extreme* than the results observed. They do not represent the probability of observing *only* the results observed.

What are you actually testing? Statistical significance is a property of test result Match the test to the hypothesis Statistical significant is not an inherent property of what’s being studied, so you can’t “find evidence of” significance. Rather, significance is a property of the results you got from your statistical test. So your p value can be significant (or not) but your effect cannot be. If your test hypothesis is that the measured effect will equal a certain value, then it is appropriate to use a two-sided p value. If your hypothesis is that the effect will be greater than a certain value, then it is appropriate to use a one-sided p value. Which ultimately matches your test to the hypothesis

P values across studies Even under ideal conditions, further studies not likely to produce same p values P values extremely sensitive to small variations to study parameters Whole ≠ sum of parts G Since a p value is probability of obtaining results at least as extreme as those observed, if your study produces a p value of 0.03, there is only a 3% chance a future study would obtain a p value that small or smaller. And that’s under ideal conditions. P values are very sensitive to violations of model assumptions and differences in population size. Similar p values across individual studies may mean something very different when combined. Multiple studies with nonsignificant p values combined using the Fisher formula may produce a significant p value

P values across populations P values are very sensitive to differences in population size Compare populations, not p values G Because p values are so sensitive to differences in population size, it is possible to have different p values even when their results are clearly in agreement, and vice versa P values cannot be compared between populations, only populations can be compared to each other

Confidence intervals A range of values so defined that there is a specified probability that the value of a parameter lies within it.

What is inside and outside a Confidence Interval? Confidence interval is range between two numbers The true effect is either in the confidence interval or not Assumptions could be violated leading to false results, also be careful with “disproved” The 95% refers to how often 95% CI computed from very many studies would contain true size If all assumptions used to compute intervals were correct. Combination of data with assumptions needed to declare an effect size outside the interval is incompatible with observations

How to compare Confidence Intervals? Can overlap but test hypothesis’ P values must still be considered Even under ideal conditions, a future estimate will fall into the current interval much less than 95% of the time when the model is correct, precision of statistical estimation is measured directly by confidence interval width. It is not a matter of inclusion or exclusion of the null or any value. CI are superior to tests and P values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes compatible with the data

How is statistical power used? Pre-study probability that the test will correctly reject null hypothesis Not a p value! Best used in pre-study planning G Greenland defines power as… Power does not measure the compatibility of the results with the hypothesis, which is to say it’s not a p value, and power cannot be calculated from the results Cannot be compared to p value as measure of support for against hypothesis

Solutions and Guidelines More interpretation than: P value is above or below 0.05 Explain how results were generated and tests chosen Be careful of which results best support which hypotheses Correct statistical evaluation of multiple studies requires a pooled analysis that addresses study biases Any opinion offered about probability, likelihood, certainty cannot be derived from statistical methods alone All statistical methods make assumptions Combined C/D with each other.

Conclusions Statistical tests are inherently limited Tests of statistical significance were intended to account for random variability as a source for error, prevent overinterpretation of data Evolved to be “ritualistic” and used to make broad statement of significance or lack thereof “The tests themselves give no final verdict, but as tools help the worker who is using them to form his final decision” - Pearson/Fisher Transparency and moderation/caution! "no statistical method is immune to misinterpretation and misuse, but prudent users of statistics will avoid approaches especially prone to serious abuse