Traps and pitfalls in medical statistics Arvid Sjölander.

Slides:



Advertisements
Similar presentations
Our goal is to assess the evidence provided by the data in favor of some claim about the population. Section 6.2Tests of Significance.
Advertisements

Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Last Time (Sampling &) Estimation Confidence Intervals Started Hypothesis Testing.
Our goal is to assess the evidence provided by the data in favor of some claim about the population. Section 6.2Tests of Significance.
1 Hypothesis Testing Chapter 8 of Howell How do we know when we can generalize our research findings? External validity must be good must have statistical.
Decision Errors and Power
Testing Hypotheses About Proportions Chapter 20. Hypotheses Hypotheses are working models that we adopt temporarily. Our starting hypothesis is called.
+ Chapter 10 Section 10.4 Part 2 – Inference as Decision.
Chapter 10: Hypothesis Testing
Stat 112 – Notes 3 Homework 1 is due at the beginning of class next Thursday.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Chapter 9 Hypothesis Testing.
Ch. 9 Fundamental of Hypothesis Testing
Sample Size Determination
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 11 Introduction to Hypothesis Testing.
Example 10.1 Experimenting with a New Pizza Style at the Pepperoni Pizza Restaurant Concepts in Hypothesis Testing.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Confidence Intervals and Hypothesis Testing - II
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 9 Introduction to Hypothesis Testing.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Section 9.1 Introduction to Statistical Tests 9.1 / 1 Hypothesis testing is used to make decisions concerning the value of a parameter.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
14. Introduction to inference
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
Health and Disease in Populations 2001 Sources of variation (2) Jane Hutton (Paul Burton)
Chapter 4 Introduction to Hypothesis Testing Introduction to Hypothesis Testing.
LECTURE 19 THURSDAY, 14 April STA 291 Spring
Significance Toolbox 1) Identify the population of interest (What is the topic of discussion?) and parameter (mean, standard deviation, probability) you.
A Broad Overview of Key Statistical Concepts. An Overview of Our Review Populations and samples Parameters and statistics Confidence intervals Hypothesis.
Chapter 21: More About Tests “The wise man proportions his belief to the evidence.” -David Hume 1748.
Chapter 11 Testing Hypotheses about Proportions © 2010 Pearson Education 1.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Statistical Inference
Confidence intervals are one of the two most common types of statistical inference. Use a confidence interval when your goal is to estimate a population.
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
No criminal on the run The concept of test of significance FETP India.
1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Ch 10 – Intro To Inference 10.1: Estimating with Confidence 10.2 Tests of Significance 10.3 Making Sense of Statistical Significance 10.4 Inference as.
Issues concerning the interpretation of statistical significance tests.
Lecture 17 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
CHAPTER 9 Testing a Claim
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Rejecting Chance – Testing Hypotheses in Research Thought Questions 1. Want to test a claim about the proportion of a population who have a certain trait.
Chapter 21: More About Tests
9.3/9.4 Hypothesis tests concerning a population mean when  is known- Goals Be able to state the test statistic. Be able to define, interpret and calculate.
BPS - 5th Ed. Chapter 151 Thinking about Inference.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Hypothesis Testing.
STA Lecture 221 !! DRAFT !! STA 291 Lecture 22 Chapter 11 Testing Hypothesis – Concepts of Hypothesis Testing.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Hypothesis Testing.
Chapter 12 Tests of Hypotheses Means 12.1 Tests of Hypotheses 12.2 Significance of Tests 12.3 Tests concerning Means 12.4 Tests concerning Means(unknown.
Today: Hypothesis testing p-value Example: Paul the Octopus In 2008, Paul the Octopus predicted 8 World Cup games, and predicted them all correctly Is.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Critical Appraisal Course for Emergency Medicine Trainees Module 2 Statistics.
+ Homework 9.1:1-8, 21 & 22 Reading Guide 9.2 Section 9.1 Significance Tests: The Basics.
Module 10 Hypothesis Tests for One Population Mean
Unit 5: Hypothesis Testing
Daniela Stan Raicu School of CTI, DePaul University
Significance Tests: The Basics
Chapter 12 Power Analysis.
CHAPTER 16: Inference in Practice
AP STATISTICS LESSON 10 – 4 (DAY 2)
Presentation transcript:

Traps and pitfalls in medical statistics Arvid Sjölander

26 april 2015Arvid Sjölander2 Motivating example  You are involved in a project to find out if snus causes ulcer.  A questionnaire is sent out to 300 randomly chosen subjects.  200 subjects respond:  We can use the relative risk (RR) to measure the association between snus and ulcer:  Can we safely conclude that snus prevents ulcer? Ulcer YesNoR Snus Yes228 2/30  0.07 No /170=0.1

26 april 2015Arvid Sjölander3 Outline  Systematic errors  Selection bias  Confounding  Randomization  Reverse causation  Random errors  Confidence interval  P-value  Hypothesis test  Significance level  Power

26 april 2015Arvid Sjölander4 One possible explanation  It is a wide spread hypothesis that snus causes ulcer.  Snus users who develop ulcer may therefore feel somewhat guilty, and may therefore be reluctant to participate in the study  Hence, RR<1 may be (partly) explained by an underrepresentation of snus users with ulcer among the responders.  This is a case of selection bias.

26 april 2015Arvid Sjölander5 Selection bias  We only observe the RR among the potential responders.  The RR among the responders (observed) may not be equal to the population RR (unobserved). Population Potential non- responders Potential responders Sample

26 april 2015Arvid Sjölander6 How do we avoid selection bias?  Make sure that the sample is drawn randomly from the whole population of interest - must trace the non-responders.  Send out the questionnaire again, follow up phone calls etc. Population Potential non- responders Potential responders Sample

26 april 2015Arvid Sjölander7 Another possible explanation  Because of age-trends, young people use snus more often than old people.  For biological reasons, young people have a smaller risks for ulcer than old people.  Hence, RR<1 may be (partly) explained by snus-users being in “better shape” than non-users.  This is a case of confounding, and age is called a confounder.

26 april 2015Arvid Sjölander8 Confounding  The RR measures the association between snus and ulcer.  The association depends on both the causal effect, and the influence of age.  In particular, even in the absence of a causal effect, there will be an (inverse) association between snus and ulcer (RR  1). ?

26 april 2015Arvid Sjölander9 How do we avoid confounding?  At the design stage: randomization, i.e. assigning “snus” and “no snus” by “the flip of a coin”.  + reliable; it eliminates the influence of all confounders.  - expensive and possibly unethical.  At the analysis stage: adjust (the observed association) for (the influence of) age, e.g. stratification, matching, regression modeling.  + cheap and ethical.  - not fully reliable; cannot adjust for unknown or unmeasured confounders. ?

26 april 2015Arvid Sjölander10 Yet another explanation  It is a wide spread hypothesis among physicians that snus causes and aggravates ulcer.  Snus users who suffers from ulcer may therefore be advised by their physicians to quit.  Hence, RR<1 may be (partly) explained by a tendency among people with ulcer to quit using snus.  This is a case of reverse causation.

26 april 2015Arvid Sjölander11 Reverse causation  Reverse causation can be avoided by randomization. SnusUlcer ?

26 april 2015Arvid Sjölander12 Systematic errors  Selection bias, confounding, and reverse causation, are referred to as systematic errors, or bias.  “You don’t measure what you are interested in”.  How can you tell if your study is biased?  You can’t! (At least not from the observed data).  It is important to design the study carefully and “think ahead” to avoid bias.  What may the reason be for potential response/non-response?  How can we trace the non-responders?  Which are the possible confounders?  Do we need to randomize the study? Would randomization be ethical and practically possible?

26 april 2015Arvid Sjölander13 Example cont’d  Assume that we believe that the study is unbiased (no selection bias, no confounding and no reverse causation).  Can we safely conclude that snus prevents ulcer? Ulcer YesNoR Snus Yes228 2/30  0.07 No /170=0.1

26 april 2015Arvid Sjölander14 Random errors  True RR = observed RR?  True RR  observed RR! Population Sample True RRObserved RR=0.7

26 april 2015Arvid Sjölander15 Confidence interval  Where can we expect the true RR to be?  The 95% Confidence Interval (CI) answers this question.  It is a range of plausible values for the true RR.  Example: RR=0.7, 95% CI: (0.5,0.9).  The narrower CI, the less uncertainty in the true RR.  The width of the CI depends on the sample size, the larger sample, the narrower CI.  How do we compute a CI? Ask a statistician!

CI for our data  RR=0.7, 95% CI: (0.16,2.74).  Conclusion? 26 april 2015Arvid Sjölander16 Ulcer YesNoR Snus Yes228 2/30  0.07 No /170=0.1

26 april 2015Arvid Sjölander17 P-value  Often, we specifically want to know whether the true RR is equal to 1 (no association between snus and ulcer).  The hypothesis that the true RR = 1 is called the “null hypothesis”; H 0.  The p-value (p) is an objective measure of the strength of evidence in the observed data against H 0.  0 < p < 1.  The smaller p-value, the stronger evidence against H 0.  How do we compute p? Ask a statistician?

Factors that determine the p-value  What do you think p depends on?  The sample size: the larger sample, the smaller p.  The magnitude of the observed association: the stronger association, the smaller p.  A common mistake: “The p-value is low, but the sample size is small so we cannot trust the results”.  Yes you can!  The p-value takes the sample size into account. Once the p-value is computed, the sample size carries no further information. 26 april 2015Arvid Sjölander18

P-value for our data  P = 0.81  Conclusion? 26 april 2015Arvid Sjölander19 Ulcer YesNoR Snus Yes228 2/30  0.07 No /170=0.1

Making a decision  The p-value is an objective measure of the strenght of evidence against H 0.  The smaller p-value, the stronger evidence against H 0.  Sometimes, we have to make a formal decision of whether or not to reject H 0.  This decision process is formally called hypothesis testing.  We reject H 0 when the evidence against H 0 are “strong enough”.  i.e. when the p-value is “small enough”. 26 april 2015Arvid Sjölander20

Significance level  The rejection threshold is called the significance level.  E.g. “5% significance level” means that we have decided to reject H 0 if p<0.05.  That we use a low significance level level means that we require strong evidence against H 0 for rejection.  That we use a high significance level means that we are satisfied with weak evidence against H 0 for rejection.  What is the advantage of using a low significance level? What about a high significance level? 26 april 2015Arvid Sjölander21

A parallell to the court room  H 0 = the prosecuted is innocent.  p value = the strength of evidence against H 0.  Low significance level = need strong evidence to condemn to jail.  Few innocent in jail, but many guilty in freedom.  High significance level = weak evidence sufficient to condemn to jail.  Many guilty in jail, but many innocent in jail as well. 26 april 2015Arvid Sjölander22

Type I and type II errors  There is always a trade-off between the risk for type I and the risk for type II errors.  Low significance level (difficult to reject H 0 )  small risk for type I errors, but large risk for type II errors.  High significance level (easy to reject H 0 )  small risk for type II errors, but large risk for type I errors.  By convention, we use 5% significance level (reject H 0 if p<0.05). 26 april 2015Arvid Sjölander23 H 0 is falseH 0 is true Reject H 0 OK Type I error (false positive) Don’t reject H 0 Type II error (false negative) OK

Relation between significance level and type I errors  In fact, the significance level = the risk for type I errors.  If we follow the convention and use 5% significance level (reject H 0 if p<0.05) then we have 5% risk of type I errors.  What does this mean, more concretely? 26 april 2015Arvid Sjölander24 H 0 is falseH 0 is true Reject H 0 OK Type I error (false positive) Don’t reject H 0 Type II error (false negative) OK Sig level

Power  Power = the chance of being able to reject H 0, when H 0 is false.  Relation between significance level and power:  High significance level (easy to reject H 0 )  high power.  Low significance level (difficult to reject H 0 )  low power. 26 april 2015Arvid Sjölander25 H 0 is falseH 0 is true Reject H 0 OK Type I error (false positive) Don’t reject H 0 Type II error (false negative) OK Sig level Power

Power calculations  It is important to determine the power of the study before data is collected.  That the power is low means that we will probably not find what we are looking for.  Direct calculation of the power is beyond the scope of this course  Ask a statistician! 26 april 2015Arvid Sjölander26

Power calculations, cont’d  Heuristically, the power of the study is determined by three factors:  The significance level; higher significance level gives higher power.  The true RR; stronger association gives higher power.  The sample size; larger sample gives higher power.  Typically, we want to have a power of at least 80%.  In practice, the significance level is fixed at 5%.  We also typically have an idea of what deviations from H 0 that are scientifically relevant to detect (e.g. RR > 1.5).  We determine the sample size that we need, to have the desired power. 26 april 2015Arvid Sjölander27

26 april 2015Arvid Sjölander28 Systematic vs random errors  There are two qualitative differences between systematic and random errors.  #1  Data can tell us if an observed association is possibly due to random errors - check the p-value.  Data can never tell us if an observed association is due to systematic errors.  #2  Uncertainty due to random errors can be reduced by increasing the sample size  narrower confidence intervals.  Systematic errors results from a poor study design, and can not be reduced by increasing the sample size.

26 april 2015Arvid Sjölander29 Summary  In medical research, we are often interested in the causal effect of one variable on another.  An observed association between two variables does not necessarily imply that one causes the other.  Always be aware of the following pitfalls:  Selection bias  Confounding  Reverse causation  Random errors