Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 21- 1
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 21 More About Tests
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide How to Think About P-Values A P-value is a conditional probability—the probability of the observed statistic given that the null hypothesis is true. The P-value is NOT the probability that the null hypothesis is true. It’s not even the conditional probability that null hypothesis is true given the data. Be careful to interpret the P-value correctly.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 1: P-value A medical researcher has tested a new treatment for poison ivy against the traditional ointment. With a P-value of 0.047, he concludes the new treatment is more effective. Explain what the P- value means in this context.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 1 continued… If there is no difference in effectiveness, the chance of seeing an observed difference this large or larger is 4.7% by natural sampling variation. This is very low, so most likely he has evidence that his ointment is more effective, and his null hypothesis is rejected.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Alpha Levels Sometimes we need to make a firm decision about whether or not to reject the null hypothesis. When the P-value is small, it tells us that our data are rare given the null hypothesis. How rare is “rare”?
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Alpha Levels (cont.) We can define “rare event” arbitrarily by setting a threshold for our P-value. If our P-value falls below that point, we’ll reject H 0. We call such results statistically significant. The threshold is called an alpha level, denoted by .
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Alpha Levels (cont.) Common alpha levels are 0.10, 0.05, and You have the option—almost the obligation—to consider your alpha level carefully and choose an appropriate one for the situation. The alpha level is also called the significance level. When we reject the null hypothesis, we say that the test is “significant at that level.”
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 2: Alpha A researcher developing scanners to search for hidden weapons at airports has concluded that a new device is significantly better than the current scanner. He made this decision based on a test using α=.05. Would he have made the same decision at α=.10? How about α=.01? Explain.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 2: Alpha At α=.10, he would have made the same decision. We know his P-value was less then.05, which also has to be less than.10. To reject H 0 at α=.01, the P-value must be less than.01, which is not necessarily the case. So he might not make the same decision.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Critical Values Again When the alternative is one-sided, the critical value puts all of on one side: When the alternative is two-sided, the critical value splits equally into two tails:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Confidence Intervals and Hypothesis Tests Confidence intervals and hypothesis tests are built from the same calculations. They have the same assumptions and conditions. You can approximate a hypothesis test by examining a confidence interval. Just ask whether the null hypothesis value is consistent with a confidence interval for the parameter at the corresponding confidence level.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 3: Click It or Ticket Teens are at greatest risk of being killed or injured in traffic crashes. According to the National Highway Traffic Safety Administration, 65% of young people killed were not wearing a safety belt. Because many deaths could easily be prevented by the use of safety belts, several states have begun “Click It or Ticket” campaigns. In 2005, a local newspaper reported that a roadblock resulted in 23 tickets to drivers who were unbelted out of 134 stopped for inspection. Does this provide evidence that the goal of over 80% compliance was met? Let’s also use a confidence interval to test this hypothesis.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 5 continued… Hypothesis: H 0 : p =.80 H A : p >.80 The null hypothesis is that 80% of the drivers will be wearing their safety belts. The alternative hypothesis is that more than 80% will be wearing their safety belts due to the “Click It or Ticket” campaign.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 5 continued… Model: 1. I will assume that the drivers are not likely to influence each other about wearing their seatbelt, making them mutually independent. 2. This isn’t a random sample, but I assume that these drivers are representative of the driving public % condition: 134 is certainly less than all drivers. 4. Success Failure: np = 134(.8) = 111 ≥ 10 and nq = 134(.2) = 23 ≥ 10 therefore, the sample is large enough. Since all of the conditions are met, the model is approximately Normal and we can do a one proportion z-test using 1 tail.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 5 cont… Mechanics: We have to create a confidence level that corresponds to the alpha level of the test. So if α=.05, then we should create a 90% confidence interval because this is a one-sided test. That will leave 5% on each side of the observed proportion.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 5 continued… I am 90% confident that between 77.4% and 88.2% of all drivers wear their seatbelts. Because the hypothesized rate of 80% is within this interval, the true rate could be lower. I fail to reject the null hypothesis. There is insufficient evidence to conclude that more than 80% of all drivers are now wearing seatbelts. The small sample size might also make the interval too wide to be very specific.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Making Errors Here’s some shocking news for you: nobody’s perfect. Even with lots of evidence we can still make the wrong decision. When we perform a hypothesis test, we can make mistakes in two ways: I. The null hypothesis is true, but we mistakenly reject it. (Type I error) II. The null hypothesis is false, but we fail to reject it. (Type II error)
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Analogy to medicine… In medical disease testing, the null hypothesis is usually the assumption that a person is healthy. The alternative is that he or she has the disease we’re testing for. A Type I error is a false positive- a healthy person is diagnosed with the disease. A Type II error is a false negative – an infected person is diagnosed as disease free.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Another analogy: In a Statistics final exam (with H 0 : the student has learned only 60% of the material): What is Type I error? (hint: false positive) What is a Type II error? (hint: false negative)
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Making Errors (cont.) Which type of error is more serious depends on the situation at hand. In other words, the gravity of the error is context dependent. Here’s an illustration of the four situations in a hypothesis test:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 4: Alzheimer’s Testing for Alzheimer’s disease can be a long and expensive process, consisting of lengthy tests and medical diagnosis. Recently a group of researchers (Solomon et al., 1998) devised a 7-minute test to serve as a quick screen for the disease for use in the general population of senior citizens. A patient who tested positive would then go through the more expensive battery of tests and medical diagnosis. The authors reported a false positive rate of 4% and a false negative rate of 8%. a. Put this in the context of a hypothesis test. What are the null and alternative hypotheses? b. What would a Type I error mean? c. What would a Type II error mean? d. Which is worse here, a Type I or type II error? Explain. e. What is the power of this test?
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 4 continued… a. The null hypothesis is that a person is healthy. The alternative is that they have Alzheimer’s disease. b. A Type I error is deciding a person has Alzheimer’s when he or she doesn’t. c. A Type II error is failing to diagnose Alzheimer’s disease when the person has it. d. A type I error would require more testing, resulting in time and money lost. A type Ii error would mean that the person did not receive the treatment he or she needed. A Type II error is much worse.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Making Errors (cont.) How often will a Type I error occur? Since a Type I error is rejecting a true null hypothesis, the probability of a Type I error is our level. When H 0 is false and we reject it, we have done the right thing. A test’s ability to detect a false hypothesis is called the power of the test.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Making Errors (cont.) When H 0 is false and we fail to reject it, we have made a Type II error. We assign the letter to the probability of this mistake. It’s harder to assess the value of because we don’t know what the value of the parameter really is. There is no single value for --we can think of a whole collection of ’s, one for each incorrect parameter value.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Making Errors (cont.) One way to focus our attention on a particular is to think about the effect size. Ask “How big a difference would matter?” We could reduce for all alternative parameter values by increasing . This would reduce but increase the chance of a Type I error. This tension between Type I and Type II errors is inevitable. The only way to reduce both types of errors is to collect more data. Otherwise, we just wind up trading off one kind of error against the other.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Power The power of a test is the probability that it correctly rejects a false null hypothesis. When the power is high, we can be confident that we’ve looked hard enough at the situation. The power of a test is 1 – .
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Power (cont.) Whenever a study fails to reject its null hypothesis, the test’s power comes into question. When we calculate power, we imagine that the null hypothesis is false. The value of the power depends on how far the truth lies from the null hypothesis value. The distance between the null hypothesis value, p 0, and the truth, p, is called the effect size. Power depends directly on effect size.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide A Picture Worth a Thousand Words The larger the effect size, the easier it should be to see it. Obtaining a larger sample size decreases the probability of a Type II error, so it increases the power. It also makes sense that the more we’re willing to accept a Type I error, the less likely we will be to make a Type II error.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide A Picture Worth a Thousand Words (cont.) This diagram shows the relationship between these concepts:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Reducing Both Type I and Type II Error The previous figure seems to show that if we reduce Type I error, we must automatically increase Type II error. But, we can reduce both types of error by making both curves narrower. How do we make the curves narrower? Increase the sample size.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Reducing Both Type I and Type II Error (cont.) This figure has means that are just as far apart as in the previous figure, but the sample sizes are larger, the standard deviations are smaller, and the error rates are reduced:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Reducing Both Type I and Type II Error (cont.) Original comparison of errors: Comparison of errors with a larger sample size:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 5: Equal opportunity? A company is sued for job discrimination because only19% of the newly hired candidates were minorities when 27% of all applicants were minorities. Is this strong evidence that the company’s hiring practices are discriminatory? a. Is this a one-tailed or a two tailed test? Why? b. In this context, what would a Type I error be? c. In this context, what would a Type II error be? d. In this context describe what is meant by the power of the test. e. If the hypothesis is tested at the 5% level of significance instead of 1%, how will affect the power of the test? f. The lawsuit is based on the hiring of 37 employees. Is the power of the test higher than, lower than, or the same as it would be if it were based on 87 hires?
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 5 continued… a. One-tailed. The company wouldn’t be sued if “too many” minorities were hired. b. Deciding the company is discriminating when it is not. c. Deciding the company is not discriminating when it is. d. The probability of correctly detecting discrimination when it exists. e. Increases the power. f. Lower, since the sample size is smaller.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 6: Hoops A basketball player with a poor foul-shot record practices intensively during the off-season. He tells the coach that he has raised his proficiency from 60% to 80%. Dubious, the coach asks him to take 10 shots, and is surprised when the player hits 9 out of 10. Did the player prove that he has improved? a. Suppose the player really is no better than before- still a 60% shooter. What’s the probability he could hit at least 9 out of 10 shots? (Hint: Use a Binomial model) b. If that is what happened, now the coach thinks the player has improved, when he has not. Which type of error is that? c. List a way the coach and player could increase the power to detect any improvement.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Example 6 continued… a binomcdf(10,.6, 8) b. Type I c. 37.6%1-binomcdf(10,.8, 8) d. Increase the number of shots.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide What have we learned? There’s a lot more to hypothesis testing than a simple yes/no decision. And, we’ve learned about the two kinds of errors we might make and seen why in the end we’re never sure we’ve made the right decision.