Inferences Based on Two Samples 9 Inferences Based on Two Samples Copyright © Cengage Learning. All rights reserved.
Copyright © Cengage Learning. All rights reserved. z Tests and Confidence Intervals for a Difference Between Two Population Means 9.1 Copyright © Cengage Learning. All rights reserved.
z Tests and Confidence Intervals for a Difference Between Two Population Means The inferences discussed in this section concern a difference 1 – 2 between the means of two different population distributions. An investigator might, for example, wish to test hypotheses about the difference between true average breaking strengths of two different types of corrugated fiberboard.
z Tests and Confidence Intervals for a Difference Between Two Population Means One such hypothesis would state that 1 – 2 = 0 that is, that 1 = 2. Alternatively, it may be appropriate to estimate 1 – 2 by computing a 95% CI. Such inferences necessitate obtaining a sample of strength observations for each type of fiberboard.
z Tests and Confidence Intervals for a Difference Between Two Population Means
z Tests and Confidence Intervals for a Difference Between Two Population Means The use of m for the number of observations in the first sample and n for the number of observations in the second sample allows for the two sample sizes to be different. Sometimes this is because it is more difficult or expensive to sample one population than another. In other situations, equal sample sizes may initially be specified, but for reasons beyond the scope of the experiment, the actual sample sizes may differ.
Test Procedures for Normal Populations with Known Variances
Test Procedures for Normal Populations with Known Variances
Example 9.1 Analysis of a random sample consisting of m = 20 specimens of cold-rolled steel to determine yield strengths resulted in a sample average strength of A second random sample of n = 25 two-sided galvanized steel specimens gave a sample average strength of
Example 9.1 cont’d Assuming that the two yield-strength distributions are normal with 1 = 4.0 and 2 = 5.0 (suggested by a graph in the article “Zinc-Coated Sheet Steel: An Overview,” Automotive Engr., Dec. 1984: 39–43), does the data indicate that the corresponding true average yield strengths 1 and 2 are different? Let’s carry out a test at significance level = 0.1.
Example 9.1 cont’d 1. The parameter of interest is 1 – 2, the difference between the true average strengths for the two types of steel. 2. The null hypothesis is H0 : 1 – 2 = 0 3. The alternative hypothesis is Ha : 1 – 2 ≠ 0 if Ha is true, then 1 and 2 are different. 4. With 0 = 0,the test statistic value is
Example 9.1 5. Substituting m = 20, = 29.8, = 16.0, n = 25, = 34.7 cont’d 5. Substituting m = 20, = 29.8, = 16.0, n = 25, = 34.7 and = 25.0 into the formula for z yields That is, the observed value of is more than 3 standard deviations below what would be expected were H0 true.
Example 9.1 6. The ≠ inequality in 𝐻 𝑎 implies that a two-tailed test is appropriate. The P-value is
Example 9.1 cont’d 7. Since P-value ≈0≤.01=𝛼, 𝐻 𝑎 is therefore rejected at level .01 in favor of the conclusion that 𝜇 1 ≠ 𝜇 2 . In fact, with a P-value this small, the null hypothesis would be rejected at any sensible significance level. The sample data strongly suggests that the true average yield strength for cold-rolled steel differs from that for galvanized steel.
Using a Comparison to Identify Causality
Using a Comparison to Identify Causality Investigators are often interested in comparing either the effects of two different treatments on a response or the response after treatment with the response after no treatment (treatment vs. control). If the individuals or objects to be used in the comparison are not assigned by the investigators to the two different conditions, the study is said to be observational.
Using a Comparison to Identify Causality The difficulty with drawing conclusions based on an observational study is that although statistical analysis may indicate a significant difference in response between the two groups. The difference may be due to some underlying factors that had not been controlled rather than to any difference in treatments.
Example 9.2 A letter in the Journal of the American Medical Association (May 19, 1978) reported that of 215 male physicians who were Harvard graduates and died between November 1974 and October 1977. The 125 in full-time practice lived an average of 48.9 years beyond graduation, whereas the 90 with academic affiliations lived an average of 43.2 years beyond graduation.
Example 9.2 cont’d Does the data suggest that the mean lifetime after graduation for doctors in full-time practice exceeds the mean lifetime for those who have an academic affiliation? (If so, those medical students who say that they are “dying to obtain an academic affiliation” may be closer to the truth than they realize; in other words, is “publish or perish” really “publish and perish”?)
Example 9.2 cont’d Let 1 denote the true average number of years lived beyond graduation for physicians in full-time practice, and let 2 denote the same quantity for physicians with academic affiliations. Assume the 125 and 90 physicians to be random samples from populations 1 and 2, respectively (which may not be reasonable if there is reason to believe that Harvard graduates have special characteristics that differentiate them from all other physicians—in this case inferences would be restricted just to the “Harvard populations”).
Example 9.2 cont’d The letter from which the data was taken gave no information about variances. So for illustration assume that 1 = 14.6 and 2 = 14.4. The hypotheses are H0 = 1 – 2 = 0 versus Ha = 1 – 2 > 0, so 0 is zero.
Example 9.2 cont’d The computed value of the test statistic is
Example 9.2 cont’d The P-value for an upper-tailed test is 1 – F(2.85) = .0022. At significance level .01, H0 is rejected (because > P-value) in favor of the conclusion that 1 – 2 > 0 (1 > 2). This is consistent with the information reported in the letter.
Example 9.2 cont’d This data resulted from a retrospective observational study; the investigator did not start out by selecting a sample of doctors and assigning some to the “academic affiliation” treatment and the others to the “full-time practice” treatment, but instead identified members of the two groups by looking backward in time (through obituaries!) to past records.
Example 9.2 cont’d Can the statistically significant result here really be attributed to a difference in the type of medical practice after graduation, or is there some other underlying factor (e.g., age at graduation, exercise regimens, etc.) that might also furnish a plausible explanation for the difference? Observational studies have been used to argue for a causal link between smoking and lung cancer.
Example 9.2 cont’d There are many studies that show that the incidence of lung cancer is significantly higher among smokers than among nonsmokers. However, individuals had decided whether to become smokers long before investigators arrived on the scene, and factors in making this decision may have played a causal role in the contraction of lung cancer.
Using a Comparison to Identify Causality A randomized controlled experiment results when investigators assign subjects to the two treatments in a random fashion. When statistical significance is observed in such an experiment, the investigator and other interested parties will have more confidence in the conclusion that the difference in response has been caused by a difference in treatments.
Large-Sample Tests
Large-Sample Tests
Example 9.4 What impact does fast-food consumption have on various dietary and health characteristics? The article “Effects of Fast-Food Consumption on Energy Intake and Diet Quality Among Children in a National Household Study” (Pediatrics, 2004:112–118) reported the accompanying summary data on daily calorie intake both for a sample of teens who said they did not typically eat fast food and another sample of teens who said they did usually eat fast food.
Example 9.4 cont’d Does this data provide strong evidence for concluding that true average calorie intake for teens who typically eat fast food exceeds by more than 200 calories per day the true average intake for those who don’t typically eat fast food? Let’s investigate by carrying out a test of hypotheses at a significance level of approximately .05.
Example 9.4 cont’d The parameter of interest is 1 – 2, where 1 is the true average calorie intake for teens who don’t typically eat fast food and 2 is true average intake for teens who do typically eat fast food. The hypotheses of interest are H0 : 1 – 2 = –200 versus Ha : 1 – 2 < –200 The alternative hypothesis asserts that true average daily intake for those who typically eat fast food exceeds that for those who don’t by more than 200 calories.
Example 9.4 The test statistic value is cont’d The test statistic value is The inequality in Ha implies that the test is lower-tailed; H0 should be rejected if z –z0.5 = –1.645. The calculated test statistic value is
Example 9.4 cont’d The inequality in 𝐻 𝑎 implies that P-value = Φ(-2.20) = .0139 Since –2.20 –1.645, the null hypothesis is rejected. At a significance level of .05, it does appear that true average daily calorie intake for teens who typically eat fast food exceeds by more than 200 the true average intake for those who don’t typically eat such food.
Example 9.4 cont’d However, the P-value is not small enough to justify rejecting H0 at significance level .01. Notice that if the label 1 had instead been used for the fast-food condition and 2 had been used for the no-fast-food condition, then 200 would have replaced –200 in both hypotheses and Ha would have contained the inequality >, implying an upper-tailed test. The resulting test statistic value would have been 2.20, giving the same P-value as before.
9.2 The Two-Sample t Test and Confidence Interval Copyright © Cengage Learning. All rights reserved.
The Two-Sample t Test and Confidence Interval We could, for example, assume that both population distributions are members of the Weibull family or that they are both Poisson distributions. It shouldn’t surprise you to learn that normality is typically the most reasonable assumption. Assumptions
The Two-Sample t Test and Confidence Interval
Example: Among the 𝑛1=10 subjects who followed diet A, their mean weight loss was 𝑥 1 =4.5 lb with a standard deviation of 𝑠1=6.5 lb. Among the 𝑛2=10 subjects who followed diet B, their mean weight loss was 𝑥 2 =3.2 lb with a standard deviation of 𝑠2=4.5 lb. Test the claim that the mean weight loss of diet A is more than that of diet B. Assume the two populations have the same variance. Use α = 0.05.
Example The parameters about which the claim is made are Assume equal population variances. Test statistic:
Example P-value = 0.305 > α = 0.05. Technical conclusion: Do not reject H0 Final conclusion: There is not sufficient evidence to support the claim that the mean weight loss from diet A is more than the mean weight loss from diet B.
9.3 Analysis of Paired Data Copyright © Cengage Learning. All rights reserved.
Analysis of Paired Data We considered making an inference about a difference between two means 1 and 2. This was done by utilizing the results of a random sample X1, X2,…Xm from the distribution with mean 1 and a completely independent (of the X’s) sample Y1,…,Yn from the distribution with mean 2. That is, either m individuals were selected from population 1 and n different individuals from population 2, or m individuals (or experimental objects) were given one treatment and another set of n individuals were given the other treatment.
Analysis of Paired Data In contrast, there are a number of experimental situations in which there is only one set of n individuals or experimental objects; making two observations on each one results in a natural pairing of values.
Analysis of Paired Data Assumptions
The Paired t Test
Example 9.9 Musculoskeletal neck-and-shoulder disorders are all too common among office staff who perform repetitive tasks using visual display units. The article “Upper-Arm Elevation During Office Work” (Ergonomics, 1996: 1221 – 1230) reported on a study to determine whether more varied work conditions would have any impact on arm movement.
Example 9.9 The accompanying data was obtained from a sample of cont’d The accompanying data was obtained from a sample of n = 16 subjects.
Example 9.9 cont’d Each observation is the amount of time, expressed as a proportion of total time observed, during which arm elevation was below 30°. The two measurements from each subject were obtained 18 months apart. During this period, work conditions were changed, and subjects were allowed to engage in a wider variety of work tasks. Does the data suggest that true average time during which elevation is below 30° differs after the change from what it was before the change?
Example 9.9 cont’d Figure 9.5 shows a normal probability plot of the 16 differences; the pattern in the plot is quite straight, supporting the normality assumption. A normal probability plot from Minitab of the differences in Example 9 Figure 9.5
Example 9.9 cont’d A boxplot of these differences appears in Figure 9.6; the boxplot is located considerably to the right of zero, suggesting that perhaps D > 0 (note also that 13 of the 16 differences are positive and only two are negative). A boxplot of the differences in Example 9.9 Figure 9.6
Example 9.9 Let’s now test the appropriate hypotheses. cont’d Let’s now test the appropriate hypotheses. Let D denote the true average difference between elevation time before the change in work conditions and time after the change. 2. H0: D = 0 (there is no difference between true average time before the change and true average time after the change) 3. H0: D ≠ 0
Example 9.9 4. 5. n = 16, di = 108, and = 1746, from which = 6.75, cont’d 4. 5. n = 16, di = 108, and = 1746, from which = 6.75, sD = 8.234, and 6. Appendix Table A.8 shows that the area to the right of 3.3 under the t curve with 15 df is .002. The inequality in Ha implies that a two-tailed test is appropriate, so the P-value is approximately 2(.002) = .004 (Minitab gives .0051).
Example 9.9 cont’d 7. Since .004 < .01, the null hypothesis can be rejected at either significance level .05 or .01. It does appear that the true average difference between times is something other than zero; that is, true average time after the change is different from that before the change.
9.4 Inferences Concerning a Difference Between Population Proportions Copyright © Cengage Learning. All rights reserved.
Inferences Concerning a Difference Between Population Proportions Proposition
Example 9.11 The article “Aspirin Use and Survival After Diagnosis of Colorectal Cancer” (J. of the Amer. Med. Assoc., 2009: 649–658) reported that of 549 study participants who regularly used aspirin after being diagnosed with colorectal cancer, there were 81 colorectal cancer-specific deaths, whereas among 730 similarly diagnosed individuals who did not subsequently use aspirin, there were 141 colorectal cancer-specific deaths. Does this data suggest that the regular use of aspirin after diagnosis will decrease the incidence rate of colorectal cancer-specific deaths? Let’s test the appropriate hypotheses using a significance level of .05.
Example 9.11 cont’d The parameter of interest is the difference p1 – p2, where p1 is the true proportion of deaths for those who regularly used aspirin and p2 is the true proportion of deaths for those who did not use aspirin. The use of aspirin is beneficial if p1 < p2 which corresponds to a negative difference between the two proportions. The relevant hypotheses are therefore H0: p1 – p2 = 0 versus Ha: p1 – p2 < 0
Example 9.11 Parameter estimates are = 81/549 = .1475, cont’d Parameter estimates are = 81/549 = .1475, = 141/730 = .1932 and =(81 + 141)/(549 + 730) = .1736. A z test is appropriate here because all of and are at least 10. The resulting test statistic value is The corresponding P-value for a lower-tailed z test is (– 2.14) = .0162.
Example 9.11 cont’d Because .0162 .05, the null hypothesis can be rejected at significance level .05. So anyone adopting this significance level would be convinced that the use of aspirin in these circumstances is beneficial. However, someone looking for more compelling evidence might select a significance level .01 and then not be persuaded.
9.5 Inferences Concerning Two Population Variances Copyright © Cengage Learning. All rights reserved.
The F Distribution
The F Distribution The F probability distribution has two parameters, denoted by v1 and v2. The parameter v1 is called the number of numerator degrees of freedom, and v2 is the number of denominator degrees of freedom; here v1 and v2 are positive integers. A random variable that has an F distribution cannot assume a negative value. Since the density function is complicated and will not be used explicitly, we omit the formula. There is an important connection between an F variable and chi-squared variables.
The F Distribution If X1 and X2 are independent chi-squared rv’s with v1 and v2 df, respectively, then the rv (the ratio of the two chi-squared variables divided by their respective degrees of freedom), can be shown to have an F distribution. (9.8)
The F Distribution Figure 9.7 illustrates the graph of a typical F density function. An F density curve and critical value Figure 9.7
The F Test for Equality of Variances
The F Test for Equality of Variances
Example 9.14 A random sample of 200 vehicles traveling on gravel roads in a county with a posted speed limit of 35 mph on such roads resulted in a sample mean speed of 37.5 mph and a sample standard deviation of 8.6 mph, whereas another random sample of 200 vehicles in a county with a posted speed limit of 55 mph resulted in a sample mean and sample standard deviation of 35.8 mph and 9.2 mph, respectively (these means and standard deviations were reported in the article “Evaluation of Criteria for Setting Speed Limits on Gravel Roads” (J. of Transp. Engr., 2011: 57–63); the actual sample sizes result in dfs that exceed the largest of those in our F table).
Example 9.14 Let’s carry out a test at significance level .10 to decide whether the two population distribution variances are identical. 𝜎 1 2 is the variance of the speed distribution on the 35 mph roads, and 𝜎 2 2 is the variance of the speed distribution on 55 mph roads. 2. 𝐻 0 : 𝜎 1 2 = 𝜎 2 2 3. 𝐻 𝑎 : 𝜎 1 2 ≠ 𝜎 2 2 4. Test statistic value:𝑓= (8.9) 2 / (9.2) 2 =.87
Example 9.14 5. Calculation: f 5 (8.6)2y(9.2)2 5 .87 6. P-value determination: .87 lies in the lower tail of the F curve with 199 numerator df and 199 denominator df. A glance at the F table shows that 𝐹 .10,199,200 ≈ 𝐹 .10,200,200 ≈ 1.20 (consult the 𝑣 1 =120 and 𝑣 1 = 1000 columns), implying 𝐹 .90,199,199 ≈1/1.20 =.83 (these values are confirmed by software). That is, the area under the relevant F curve to the left of .83 is .10. Thus the area under the curve to the left of .87 exceeds .10, and so P-value > 2(.10) = .2 (software gives .342).
Example 9.14 7. The P-value clearly exceeds the mandated significance level. The null hypothesis therefore cannot be rejected; it is plausible that the two speed distribution variances are identical. The sample sizes in the cited article were 2665 and 1868, respectively, and the P-value reported there was .0008. So for the actual data, the hypothesis of equal variances would be rejected not only at significance level .10—in contrast to our conclusion—but also at level .05, .01, and even .001.
Example 9.14 This illustrates again how quite large sample sizes can magnify a small difference in estimated values. Note also that the sample mean speed for the county with the lower posted speed limit was higher than for the county with the lower limit, a counterintuitive result that surprised the investigators; and because of the very large sample sizes, this difference in means is highly statistically significant.