STAT 111 Introductory Statistics Lecture 11: Hypothesis Testing June 9, 2004
Today’s Topics Hypothesis testing continued Testing a population mean Test statistics P-values Statistical Significance Testing a population mean Using and abusing hypothesis tests
Hypothesis Testing Terminology: Hypothesis: a statement about the parameters in a population or model Null hypothesis H0: claim which is initially favored or believed to be true; the claim we try to find evidence against. Usually a statement of “no effect or no difference.” Alternative hypothesis Ha: claim that we hope or suspect to be true instead of H0.
Hypothesis Testing Hypothesis testing is designed to assess the strength of the evidence against the null hypothesis. Hypotheses refer to some population or model, not to any particular outcome. Generally, we begin with the alternative hypothesis Ha and set up H0 as the statement that the hoped-for effect is not present.
Hypothesis Testing Alternative hypotheses can be either one-sided (ex., μ > 0), or two-sided (ex., μ ≠ 0). The alternative hypothesis should express the hopes or suspicions we bring to the data. Two-sided alternatives are generally used unless we have a specific direction firmly in mind beforehand.
Finding Evidence: Test Statistics Hypothesis testing is based on a statistic that estimates the parameter that appears in the hypotheses, usually the same estimate we use in a confidence interval for the parameter. When H0 is true, we expect the estimate to take a value near the parameter value specified by H0. Values of the estimate far from the parameter value specified by H0 give evidence against H0. The alternative hypothesis determines which directions count against H0.
Finding Evidence: Test Statistics A test statistic measures compatibility between the null hypothesis and the data. A test statistic is a random variable with a known distribution. Once a sample is drawn from the population, we can observe a value for our test statistic. Question: How probable is it that our test statistic takes a value as extreme as or more extreme than that which we actually observed, if the null hypothesis is true?
Finding Evidence: P-values A significance test assesses the evidence against the null hypothesis in terms of probability. I.e., if the observed outcome is unlikely if the null hypothesis is true, but is more probable under the alternative hypothesis, the outcome we observe is evidence for Ha against H0. The less probable the outcome, the stronger the evidence that H0 is false. Not all test statistics are normal, so we translate the value of a test statistic into a probability.
Finding Evidence: P-values A test of significance finds the probability of getting an outcome as extreme or more extreme than the actually observed outcome. “Extreme” in this context means “far from what we would expect if H0 were true.” The direction is determined by Ha as well as H0.
Finding Evidence: P-values The P-value of a test is the probability of the test statistic taking a value as extreme or more extreme than that actually observed. The P-value of a test provides information about the amount of evidence that is in favor of the alternative hypothesis and against the null. The smaller the P-value of a test is, the stronger the evidence against the null hypothesis provided by the data.
Finding Evidence: Significance Levels How should we draw conclusions about our hypothesis test based on the P-value? We need a cut-off point (decisive value) that we can compare our P-value to so that we can draw a conclusion or make a decision about our test. This cut-off point is a significance level. It is a number announced in advance and serves as a standard on how much evidence against H0 we need to reject H0. Usually denoted as α, and the corresponding test is called a level α test.
Statistical Significance When the P-value is as small or smaller than the significance level, i.e., P-value ≤ α, we say that the data are statistically significant at level α. In other words, we have significant evidence against the null. Whether data is considered statistically significant or not depends on the significance level; data with a P-value of 0.03 are statistically significant at level 0.05, but not at level 0.01. The P-value itself is the smallest level α at which the data are significant.
Statistical Significance If the P-value is less than 0.01, there is overwhelming evidence against the null. If the P-value is between 0.01 and 0.05, there is strong evidence against the null. If the P-value is between 0.05 and 0.10, there is weak evidence against the null. If the P-value exceeds 0.10, we are led to believe that there is no real evidence against the null.
General Procedures of Hypothesis Testing Step 1. State the null hypothesis H0 and alternative hypothesis Ha. Specify the significance level. Step 2. Calculate the value of the test statistic on which the test will be based. This statistic usually measures how far the data are from H0. Step 3. Find the P-value for the observed data. Step 4. State a conclusion. If the P-value is less than or equal to the significance level α, reject the null in favor of the alternative hypothesis; if it is greater than α, conclude that the data do not provide sufficient evidence to reject the null hypothesis.
z Test for a Population Mean Let X1,…., Xn be a simple random sample from N(μ, σ). σ is known, μ is the unknown parameter of interest. The null hypothesis is H0: μ = μ0 The alternative hypothesis could be: Ha: μ ≠ μ0 Ha: μ > μ0 Ha: μ < μ0
z Test for a Population Mean The sample mean is normally distributed with If H0 is true, and has a standard normal distribution. Once an SRS is drawn, we will be observe If H0 is true, z should be close to 0.
Example 1 A new billing system for a store will be cost effective only if the mean monthly account is more than $170. An SRS of 400 monthly accounts has a mean of $178. If the accounts are normally distributed with σ = $65, can we conclude that the new system will be cost-effective? Carry out a level 0.05 test.
Example 2 A manufacturer of sprinkler systems used for fire protection in office buildings claims that the true average system-activation temperature is 130° F. A sample of n = 9 systems, when tested, yields a sample average activation temperature of 131.08° F. If the distribution of activation times is normal with standard deviation 1.5° F, does the data contradict the manufacturer’s claim at significance level α = 0.01?
Example 3 The melting point of each of 16 samples of a certain brand of vegetable oil was determined, with 94.32 the sample mean. Assume that the distribution of melting point is normal with σ = 1.20 Test H0: μ = 95 versus Ha: μ ≠ 95 using a two-tailed level 0.01 test.
Rejection Region The rejection region is a range of values such that if the test statistic falls into that range, the null hypothesis is rejected in favor of the alternative hypothesis. To use the rejection region method, State hypotheses and specify significance level. Find corresponding rejection region. Calculate test statistic. Reject null hypothesis only if value of test statistic falls within rejection region; otherwise, do not reject null.
Example Bottles of a popular cola drink are supposed to contain 300 milliliters(ml) of cola, but there is some variation from bottle to bottle. The distribution of the contents is normal with standard deviation 3ml. A student who suspects that the bottle is are being under-filled measures the contents of six bottles. The results are 299.4; 297.7; 301.0; 298.9; 300.2; 297.0 Is this convincing evidence that the mean contents of cola bottles is less than the advertised 300ml? Carry out the test at significance level 0.05.
Duality between Confidence Intervals and Tests Suppose we construct a 95% confidence interval for the population mean μ. Then the values of μ that are not in our interval would seem to be incompatible with the data. This sounds like a significance test with α = 0.05 In particular, any level α two-sided significance test rejects a hypothesis H0: μ = μ0 exactly when the value of μ0 falls outside a level 1 – α confidence interval for μ.
Using/Abusing Hypothesis Tests Carrying out a hypothesis test is simple; using a test wisely not quite so simple. Things to consider when using hypothesis tests: Choosing a level of significance What statistical significance does not mean Ignoring lack of significance Validity of statistical inference on some data sets Searching for significance
Choosing Significance Levels A significance test is designed to give a clear statement of the degree of evidence provided by the sample against the null hypothesis. Choosing a level α in advance makes sense if you need to make a decision, but not if you wish only to describe the strength of your evidence. Choose α by asking how much evidence is required to reject the null hypothesis. This depends on how plausible the null really is.
Choosing Significance Levels If the null is a widely-believed assumption, strong evidence will be needed to reject it. Level of evidence required to reject the null is affected by the consequences of such a decision. Standard levels of significance are 1%, 5%, and 10%, but there is no sharp border between “significant” and “insignificant.” For example, suppose one test yields P-value of 0.0501 and another yields P-value of .0499, and our chosen level is α = 0.05.
Statistical Significance Rejecting a null hypothesis at one of the usual levels suggests that there is good evidence that an effect is present. The magnitude of the effect may be extremely small. In particular, for large samples, even tiny deviations from the null will be significant. I.e., we will almost invariably reject the null. Statistical significance ≠ practical significance.
Statistical Significance Don’t attach too much importance to statistical significance – pay attention to actual experimental results. Examine plots of data – if effect you are seeking is not visible in plots, it might not be large enough to be practically important. Giving confidence intervals for parameters of interest is wise – size of effect is estimated rather than simply asking if it is too large to occur through chance alone.
Validity of Statistical Inference Badly designed surveys/experiments produce invalid results. Formal statistical inference cannot correct the flaws in a design. Hypothesis tests and confidence intervals are based on laws of probability, and randomization ensures the applicability of those laws. Not all data to analyze will arise from randomized samples/experiments. Confidence in a probability model for the data
Searching for Significance Statistical significance is highly desired by researchers. The reasoning behind statistical significance works well if you decide what effect you are seeking, design an experiment or sample to search for it, and use a test of significance to weigh the evidence you get. Tempting to make significance itself the object of the search.
Searching for Significance Taking many tests on the same data will allow you to find significance. Not convincing to search for any effect or pattern and find one. Usual reasoning of statistical inference does not apply for a successful search for a pattern. Cannot legitimately perform a hypothesis test on the same data that first suggested that hypothesis.