IE241: Introduction to Hypothesis Testing
We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to the second major area of statistics, hypothesis testing. What is a statistical hypothesis? A statistical hypothesis is an assumption about f(X) if X is continuous or p(X) if X is discrete. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the hypothesis.
Let’s look at an example. A buyer of light bulbs bought 50 bulbs of each of two brands. When he tested them, Brand A had an average life of 1208 hours with a standard deviation of 94 hours. Brand B had a mean life of 1282 hours with a standard deviation of 80 hours. Are brands A and B really different in quality?
We set up two hypotheses. The first, called the null hypothesis Ho, is the hypothesis of no difference. Ho: μ A = μ B The second, called the alternative hypothesis Ha, is the hypothesis that there is a difference. Ha: μ A ≠ μ B
On the basis of the sample of 50 from each of the two populations of light bulbs, we shall either reject or not reject the hypothesis of no difference. In statistics, we always test the null hypothesis. The alternative hypothesis is the default winner if the null hypothesis is rejected.
We never really accept the null hypothesis; we simply fail to reject it on the basis of the evidence in hand. Now we need a procedure to test the null hypothesis. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the null hypothesis. There are two possible decisions, reject or not reject. This means there are also two kinds of error we could make.
The two types of error are shown in the table below. True state Decision H o trueH o false Reject H o Type 1 error α Correct decision Do not reject H o Correct decision Type 2 error β
If we reject H o when H o is in fact true, then we make a type 1 error. The probability of type 1 error is α. If we do not reject H o when H o is really false, then we make a type 2 error. The probability of a type 2 error is β.
Now we need a decision rule that will make the probability of the two types of error very small. The problem is that the rule cannot make both of them small simultaneously. Because in science we have to take the conservative route and never claim that we have found a new result unless we are really convinced that it is true, we choose a very small α, the probability of type 1 error.
Then among all possible decision rules given α, we choose the one that makes β as small as possible. The decision rule consists of a test statistic and a critical region where the test statistic may fall. For means from a normal population, the test statistic is where the denominator is the standard deviation of the difference between two independent means.
The critical region is a tail of the distribution of the test statistic. If the test statistic falls in the critical region, Ho is rejected. Now, how much of the tail should be in the critical region? That depends on just how small you want α to be. The usual choice is α =.05, but in some very critical cases, α is set at.01. Here we have just a non-critical choice of light bulbs, so we’ll choose α =.05. This means that the critical region has probability =.025 in each tail of the t distribution.
For a t distribution with.025 in each tail, the critical value of t = 1.96, the same as z because the sample size is greater than 30. The critical region then is |t |> In our light bulb example, the test statistic is
Now 4.23 is much greater than 1.96 so we reject the null hypothesis of no difference and declare that the average life of the B bulbs is longer than that of the A bulbs. Because α =.05, we have 95% confidence in the decision we made.
We cannot say that there is a 95% probability that we are right because we are either right or wrong and we don’t know which. But there is such a small probability that t will land in the critical region if Ho is true that if it does get there, we choose to believe that Ho is not true. If we had chosen α =.01, the critical value of t would be 2.58 and because 4.23 is greater than 2.58, we would still reject Ho. This time it would be with 99% confidence.
How do we know that the test we used is the best test possible? We have controlled the probability of Type 1 error. But what is the probability of Type 2 error in this test? Does this test minimize it subject of the value of α?
To answer this question, we need to consider the concept of test power. The power of a statistical test is the probability of rejecting Ho when Ho is really false. Thus power = 1-β. Clearly if the test maximizes power, it minimizes the probability of Type 2 error β. If a test maximizes power for given α, it is called an admissible testing strategy.
Before going further, we need to distinguish between two types of hypotheses. A simple hypothesis is one where the value of the parameter under Ho is a specified constant and the value of the parameter under Ha is a different specified constant. For example, if you test Ho: μ = 0 vs Ha: μ = 10 then you have a simple hypothesis test. Here you have a particular value for Ho and a different particular value for Ha.
For testing one simple hypothesis Ha against the simple hypothesis Ho, a ground-breaking result called the Neyman-Pearson lemma provides the most powerful test. λ is a likelihood ratio with the Ha parameter MLE in the numerator and the Ho parameter MLE in the denominator. Clearly, any value of λ > 1 would favor the alternative hypothesis, while values less than 1 would favor the null hypothesis.
Consider the following example of a test of two simple hypotheses. A coin is either fair or has p(H) = 2/3. Under Ho, P(H) = ½ and under Ha, P(H) = 2/3. The coin will be tossed 3 times and a decision will be made between the two hypotheses. Thus X = number of heads = 0, 1, 2, or 3. Now let’s look at how the decision will be made.
First, let’s look at the probability of Type 1 error α. In the table below, Ho ⇒ P(H) =1/2 and Ha ⇒ P(H) = 2/3. Now what should the critical region be? XP(X|Ho)P(X|Ha) 01/81/27 13/86/27 23/812/27 31/88/27
Under Ho, if X = 0, α = 1/8. Under Ho, if X = 4, α = 1/8. So if either of these two values is chosen as the critical region, the probability of Type 1 error would be the same. Now what if Ha is true? If X = 0 is chosen as the critical region, the value of β = 26/27 because that is the probability that X ≠ 0. On the other hand, if X = 4 is chosen as the critical region, the value of β = 19/27 because that is the probability that X ≠ 3. Clearly, the better choice for the critical region is X=3 because that is the region that minimizes β for fixed α. So this critical region provides the more powerful test.
In discrete variable problems like this, it may not be possible to choose a critical region of the desired α. In this illustration, you simply cannot find a critical region where α =.05 or.01. This is seldom a problem in real-life experimentation because n is usually sufficiently large so that there is a wide variety of choices for critical regions.
This problem to illustrate the general method for selecting the best test was easy to discuss because there was only a single alternative to Ho. Most problems involve more than a single alternative. Such hypotheses are called composite hypotheses.
Examples of composite hypotheses: Ho: μ = 0 vs Ha: μ ≠ 0 which is a two-sided Ha. A one-sided Ha can be written as Ho: μ = 0 vs Ha: μ > 0 or Ho: μ = 0 vs Ha: μ < 0 All of these hypotheses are composite because they include more than one value for Ha. And unfortunately, the size of β here depends on the particular alternative value of μ being considered.
In the composite case, it is necessary to compare Type 2 errors for all possible alternative values under Ha. So now the size of Type 2 error is a function of the alternative parameter value θ. So β(θ) is the probability that the sample point will fall in the noncritical region when θ is the true value of the parameter.
Because it is more convenient to work with the critical region, the power function 1-β(θ) is usually used. The power function is the probability that the sample point will fall in the critical region when θ is the true value of the parameter. As an illustration of these points, consider the following continuous example.
Let X = the time that elapses between two successive trippings of a Geiger counter in studying cosmic radiation. It is assumed that the density function is f(x;θ) = θe -θx where θ is a parameter which depends on experimental conditions. Under Ho, θ = 2. Now a physicist believes that θ < 2. So under Ha, θ < 2.
Now one choice for the critical region is X ≥ 1. and Another choice is the left tail, X ≤.07 for which α =.135. That is, Now let’s examine the power functions for the two competing critical regions.
For the critical region X > 1, and for the critical region X <.07, The graphs of these two functions are called the power curves for the two critical regions.
These two power functions are Note that the power function for X>1 region is always higher than the power function for X 1 is superior.