Hypothesis Testing: Cautions

Slides:



Advertisements
Similar presentations
Our goal is to assess the evidence provided by the data in favor of some claim about the population. Section 6.2Tests of Significance.
Advertisements

Hypothesis Testing: Intervals and Tests
Inference Sampling distributions Hypothesis testing.
Our goal is to assess the evidence provided by the data in favor of some claim about the population. Section 6.2Tests of Significance.
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Significance STAT 101 Dr. Kari Lock Morgan SECTION 4.3, 4.5 Significance level (4.3)
Fundamentals of Hypothesis Testing. Identify the Population Assume the population mean TV sets is 3. (Null Hypothesis) REJECT Compute the Sample Mean.
Determining Statistical Significance
Confidence Intervals and Hypothesis Tests
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Hypotheses STAT 101 Dr. Kari Lock Morgan SECTION 4.1 Statistical test Null and alternative.
Synthesis and Review 3/26/12 Multiple Comparisons Review of Concepts Review of Methods - Prezi Essential Synthesis 3 Professor Kari Lock Morgan Duke University.
© 2002 Prentice-Hall, Inc.Chap 7-1 Statistics for Managers using Excel 3 rd Edition Chapter 7 Fundamentals of Hypothesis Testing: One-Sample Tests.
© 2003 Prentice-Hall, Inc.Chap 9-1 Fundamentals of Hypothesis Testing: One-Sample Tests IE 340/440 PROCESS IMPROVEMENT THROUGH PLANNED EXPERIMENTATION.
More Randomization Distributions, Connections
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
Statistics: Unlocking the Power of Data Lock 5 Synthesis STAT 250 Dr. Kari Lock Morgan SECTIONS 4.4, 4.5 Connecting bootstrapping and randomization (4.4)
Statistical Inference Decision Making (Hypothesis Testing) Decision Making (Hypothesis Testing) A formal method for decision making in the presence of.
A Broad Overview of Key Statistical Concepts. An Overview of Our Review Populations and samples Parameters and statistics Confidence intervals Hypothesis.
Essential Statistics Chapter 131 Introduction to Inference.
CHAPTER 14 Introduction to Inference BPS - 5TH ED.CHAPTER 14 1.
Agresti/Franklin Statistics, 1 of 122 Chapter 8 Statistical inference: Significance Tests About Hypotheses Learn …. To use an inferential method called.
Testing of Hypothesis Fundamentals of Hypothesis.
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Cautions STAT 250 Dr. Kari Lock Morgan SECTION 4.3, 4.5 Type I and II errors (4.3) Statistical.
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Cautions STAT 250 Dr. Kari Lock Morgan SECTION 4.3, 4.5 Errors (4.3) Multiple testing.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Ch 10 – Intro To Inference 10.1: Estimating with Confidence 10.2 Tests of Significance 10.3 Making Sense of Statistical Significance 10.4 Inference as.
Statistical Inference An introduction. Big picture Use a random sample to learn something about a larger population.
CHAPTER 9 Testing a Claim
© 2004 Prentice-Hall, Inc.Chap 9-1 Basic Business Statistics (9 th Edition) Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests.
© 2001 Prentice-Hall, Inc.Chap 9-1 BA 201 Lecture 14 Fundamentals of Hypothesis Testing.
What is a Hypothesis? A hypothesis is a claim (assumption) about the population parameter Examples of parameters are population mean or proportion The.
AP Statistics Chapter 11 Notes. Significance Test & Hypothesis Significance test: a formal procedure for comparing observed data with a hypothesis whose.
Chapter 12 Tests of Hypotheses Means 12.1 Tests of Hypotheses 12.2 Significance of Tests 12.3 Tests concerning Means 12.4 Tests concerning Means(unknown.
6.2 Large Sample Significance Tests for a Mean “The reason students have trouble understanding hypothesis testing may be that they are trying to think.”
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.1 Significance Tests:
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.1 Significance Tests:
+ Homework 9.1:1-8, 21 & 22 Reading Guide 9.2 Section 9.1 Significance Tests: The Basics.
Chapter Nine Hypothesis Testing.
Warm Up #’s 12, 14, and 16 on p. 552 Then answer the following question; In a jury trial, what two errors could a jury make?
Inference for Proportions
Unit 5 – Chapters 10 and 12 What happens if we don’t know the values of population parameters like and ? Can we estimate their values somehow?
CHAPTER 9 Testing a Claim
FINAL EXAMINATION STUDY MATERIAL III
Unit 5: Hypothesis Testing
CHAPTER 9 Testing a Claim
Warm Up Check your understanding p. 541
Statistics for Managers using Excel 3rd Edition
A Closer Look at Testing
CHAPTER 9 Testing a Claim
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2016 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Spring 2018 Room 150 Harvill Building 9:00 - 9:50 Mondays, Wednesdays.
When we free ourselves of desire,
INTEGRATED LEARNING CENTER
Statistical Tests - Power
Stat 217 – Day 28 Review Stat 217.
CHAPTER 9 Testing a Claim
Chapter Nine Part 1 (Sections 9.1 & 9.2) Hypothesis Testing
Essential Statistics Introduction to Inference
YOU HAVE REACHED THE FINAL OBJECTIVE OF THE COURSE
CHAPTER 9 Testing a Claim
Significance Tests: The Basics
CHAPTER 9 Testing a Claim
Basic Practice of Statistics - 3rd Edition Introduction to Inference
Psych 231: Research Methods in Psychology
CHAPTER 9 Testing a Claim
YOU HAVE REACHED THE FINAL OBJECTIVE OF THE COURSE
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
STA 291 Spring 2008 Lecture 21 Dustin Lueker.
Presentation transcript:

Hypothesis Testing: Cautions STAT 250 Dr. Kari Lock Morgan Hypothesis Testing: Cautions SECTION 4.3, 4.5 Errors (4.3) Multiple testing (4.5) Replication

Intervals and Tests Confidence intervals are most useful when you want to estimate population parameters Hypothesis tests and p-values are most useful when you want to test hypotheses about population parameters Confidence intervals give you a range of plausible values; p-values quantify the strength of evidence against the null hypothesis

Interval, Test, or Neither? Is the following question best assessed using a confidence interval, a hypothesis test, or is statistical inference not relevant? How much do college students sleep, on average? Confidence interval Hypothesis test Statistical inference not relevant

Interval, Test, or Neither? Is the following question best assessed using a confidence interval, a hypothesis test, or is statistical inference not relevant? Do college students sleep more than the recommended 8 hours a night, on average? Confidence interval Hypothesis test Statistical inference not relevant

Interval, Test, or Neither? Is the following question best assessed using a confidence interval, a hypothesis test, or is statistical inference not relevant? What proportion of college students in the sleep study sample slept at least 8 hours? Confidence interval Hypothesis test Statistical inference not relevant

Reproducibility Crisis

Reproducibility Crisis Study: half of the studies you read about in the news are wrong (Vox, 3/3/2017) Poor replication validity of biomedical association studies reported by newspapers (PLOS One, 2/21/2017) The fickle p-value generates irreproducible results (Nature, 2/26/2015) Why most published research findings are false (PLOS Medicine, 8/30/2005)

Does choice of mate improve offspring fitness (in fruit flies)? Question of the Day Does choice of mate improve offspring fitness (in fruit flies)?

Mate Choice and Offspring What effect (if any) do you think freedom to choose a mate has on offspring fitness? Improves it Worsens it Does not affect it

Original Study p-value < 0.01 Controversial – went against conventional wisdom Researchers at Penn State tried to replicate the results… Partridge, L. Mate choice increases a component of offspring fitness in fruit flies Nature, 283: 290-291. 1/17/80.

Fruit Fly Mate Choice Experiment Took 600 female fruit flies and randomly divided them into two groups: 300 got put in a cage with 900 males (mate choice) 300 were placed in individual vials with only one male each (no mate choice) After mating, females were separated from the males and put in egg-laying chambers 200 larvae from each chamber was taken and placed in a cage with 200 mutant flies (for competition) This was repeated 10 times/day for 5 days (50 runs) Schaeffer, S.W., Brown, C.J., Anderson, W.W. (1984). “Does mate choice affect fitness?” Genetics, 107: s94. (Conducted at PSU by Dr. Steve Schaeffer in Biology)

Mate Choice and Offspring Survival 6,067 of the 10,000 mate choice larvae survived and 5,976 of the 10,000 no mate choice larvae survived p-value: 0.102

Mate Choice and Offspring Survival Another possibility: consider each run of the experiment a case, rather than each fly Paired data, so look at difference for each pair p-value = 0.21

Errors   Decision Truth Errors can happen! There are four possibilities: Decision Reject H0 Do not reject H0 H0 true H0 false  TYPE I ERROR Truth  TYPE II ERROR A Type I Error is rejecting a true null (false positive) A Type II Error is not rejecting a false null (false negative)

Mate Choice and Offspring Fitness Option #1: The original study (p-value < 0.01) made a Type I error, and H0 is really true Option #2: The second study (p-value = 0.102 or 0.21) made a Type II error, and Ha is really true Option #3: No errors were made; different experimental settings yielded different results Same species of fruit fly, same type of mutant, same design Possible difference: The original study had flies that had been in the lab for longer, so were more likely to be at genetic equilibrium [Note: Dr. Schaeffer suspects Option #1, saying the original study is an outlier among studies of this kind]

Analogy to Law A person is innocent until proven guilty. Evidence must be beyond the shadow of a doubt. Types of mistakes in a verdict? Convict an innocent Release a guilty

Probability of Type I Error Distribution of statistics, assuming H0 true: If the null hypothesis is true: 5% of statistics will be in the most extreme 5% 5% of statistics will give p-values less than 0.05 5% of statistics will lead to rejecting H0 at α = 0.05 If α = 0.05, there is a 5% chance of a Type I error

Probability of Type I Error Distribution of statistics, assuming H0 true: If the null hypothesis is true: 1% of statistics will be in the most extreme 1% 1% of statistics will give p-values less than 0.01 1% of statistics will lead to rejecting H0 at α = 0.01 If α = 0.01, there is a 1% chance of a Type I error

Probability of Type I Error The probability of making a Type I error (rejecting a true null) is the significance level, α

Probability of Type II Error How can we reduce the probability of making a Type II Error (not rejecting a false null)? Decrease the sample size Increase the sample size

Larger sample size makes it easier to reject the null H0: p = 0.5 Ha: p > 0.5 n = 100 So, increase n to decrease chance of Type II error

Probability of Type II Error How can we reduce the probability of making a Type II Error (not rejecting a false null)? Decrease the significance level Increase the significance level

Significance Level and Errors α Reject H0 Could be making a Type I error if H0 true Chance of Type I error Do not reject H0 Could be making a Type II error if Ha true Related to chance of making a Type II error Decrease α if Type I error is very bad Increase α if Type II error is very bad

Multiple Testing Because the chance of a Type I error is α… α of all tests with true null hypotheses will yield significant results just by chance. If 100 tests are done with α = 0.05 and nothing is really going on, 5% of them will yield significant results, just by chance This is known as the problem of multiple testing

Multiple Testing Consider a topic that is being investigated by research teams all over the world Using α = 0.05, 5% of teams are going to find something significant, even if the null hypothesis is true

Multiple Testing Consider a research team/company doing many hypothesis tests Using α = 0.05, 5% of tests are going to be significant, even if the null hypotheses are all true

Mate Choice and Offspring Fitness The experiment was actually comprised of 50 smaller experiments. What if we had calculated the p-value for each run? 0.9570 0.8498 0.1376 0.5407 0.7640 0.9845 0.3334 0.8437 0.2080 0.8912 0.8879 0.6615 0.6695 0.8764 1.0000 0.0064 0.9982 0.7671 0.9512 0.2730 0.5812 0.1088 0.0181 0.0013 0.6242 0.0131 0.7882 0.0777 0.9641 0.0001 0.8851 0.1280 0.3421 0.1805 0.1121 0.6562 0.0133 0.3082 0.6923 0.1925 0.4207 0.0607 0.3059 0.2383 0.2391 0.1584 0.1735 0.0319 0.0171 0.1082 50 p-values: What if we just reported the run that yielded a p-value of 0.0001? Is that ethical?

Publication Bias Publication bias refers to the fact that usually only the significant results get published The one study that turns out significant gets published, and no one knows about all the insignificant results (also known as the file drawer problem) This combined with the problem of multiple testing can yield very misleading results

Jelly Beans Cause Acne! http://xkcd.com/882/ Consider having your students act this out in class, each reading aloud a different part. it’s very fun! http://xkcd.com/882/

Multiple Testing and Publication Bias α of all tests with true null hypotheses will yield significant results just by chance. The one that happens to be significant is the one that gets published. THIS SHOULD SCARE YOU.

Clinical Trials Preclinical (animal studies) Phase 0: Study pharmacodynamics and pharmacokinetics Phase 1: Screening for safety Phase 2: Placebo trials to establish efficacy Phase 3: Trials against standard treatment and to confirm efficacy Only then does a drug go to market…

What Can You Do? Point #1: Errors (type I and II) are possible Point #2: Multiple testing and publication bias are a huge problem Is it all hopeless? What can you do? Recognize when a claim is one of many tests Adjust for multiple tests (e.g. Bonferroni) Look for replication of results…

Replication Replication (or reproducibility) of a study in another setting or by another researcher is extremely important! Studies that have been replicated with similar conclusions gain credibility Studies that have been replicated with different conclusions lose credibility Replication helps guard against Type I errors AND helps with generalizability

Mate Choice and Offspring Fitness Actually, the research at Penn State included 3 different experiments; two different species of fruit flies and three different mutant types 1. Drosophila melanogaster, Mutant: sparkling eyes 2. Drosophila melanogaster, Mutant: white eyes 3. Drosophila pseudoobscura, Mutant: orange eyes Multiple possible outcomes (% surviving in each group, % of survivors who were from experimental group (not mutants) Multiple ways to analyze – proportions, quantitative paired analysis

Mate Choice and Offspring Fitness Original study: Significant in favor of choice p-value < 0.01 PSU study #1: Not significant 6067/10000 - 5976/10000 = 0.6067 - 0.5976 = 0.009 p-value = 0.09 PSU study #2: Significant in favor of no choice 4579/10000 – 4749/10000 = 0.4579 – 0.4749 = -0.017 p-value = 0.992 for choice, 0.008 for no choice PSU study #3: Significant in favor of no choice 1641/5000 – 1758/5000 = 0.3282 – 0.3516 = -0.02 p-value = 0.993 for choice, 0.007 for no choice

Reproducibility Crisis “While the public remains relatively unaware of the problem, it is now a truism in the scientific establishment that many preclinical biomedical studies, when subjected to additional scrutiny, turn out to be false. Many researchers believe that if scientists set out to reproduce preclinical work published over the past decade, a majority would fail. This, in short, is the reproducibility crisis."Amid a Sea of False Findings, the NIH Tries Reform (3/16/15) A recent study tried to replicate 100 results published in psychology journals: 97% of the original results were significant, only 36% of replicated results were significant Estimating the reproducibility of psychological science (8/28/15)

Summary Conclusions based off p-values are not perfect Type I and Type II errors can happen α of all tests will be significant just by chance Often, only the significant results get published Replication is important for credibility

To Do HW 4.4, 4.5 (due Monday, 3/20)

www.causeweb.org Author: JB Landers