Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 13 Inference About Comparing Two Populations
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Comparing Two Populations… Previously we looked at techniques to estimate and test parameters for one population: Population Mean, Population Variance, and Population Proportion p We will still consider these parameters when we are looking at two populations, however our interest will now be: The difference between two means. The ratio of two variances. The difference between two proportions.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Difference of Two Means… In order to test and estimate the difference between two population means, we draw random samples from each of two populations. Initially, we will consider independent samples, that is, samples that are completely unrelated to one another. (Likewise, we consider for Population 2) Sample, size: n 1 Population 1 Parameters: Statistics:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Difference of Two Means… In order to test and estimate the difference between two population means, we draw random samples from each of two populations. Initially, we will consider independent samples, that is, samples that are completely unrelated to one another. Because we are compare two population means, we use the statistic:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Sampling Distribution of 1. is normally distributed if the original populations are normal –or– approximately normal if the populations are nonnormal and the sample sizes are large (n 1, n 2 > 30) 2. The expected value of is 3. The variance of is and the standard error is:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Making Inferences About Since is normally distributed if the original populations are normal –or– approximately normal if the populations are nonnormal and the sample sizes are large (n 1, n 2 > 30), then: is a standard normal (or approximately normal) random variable. We could use this to build test statistics or confidence interval estimators for …
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Making Inferences About …except that, in practice, the z statistic is rarely used since the population variances are unknown. Instead we use a t-statistic. We consider two cases for the unknown population variances: when we believe they are equal and conversely when they are not equal. ??
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc When are variances equal? How do we know when the population variances are equal? Since the population variances are unknown, we can’t know for certain whether they’re equal, but we can examine the sample variances and informally judge their relative values to determine whether we can assume that the population variances are equal or not.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Test Statistic for (equal variances) 1)Calculate – the pooled variance estimator as… 2)…and use it here: degrees of freedom
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc CI Estimator for (equal variances) The confidence interval estimator for when the population variances are equal is given by: degrees of freedom pooled variance estimator
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Test Statistic for (unequal variances) The test statistic for when the population variances are unequal is given by: Likewise, the confidence interval estimator is: degrees of freedom
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Which case to use? Which case to use? Equal variance or unequal variance? Whenever there is insufficient evidence that the variances are unequal, it is preferable to perform the equal variances t-test. This is so, because for any two given samples: The number of degrees of freedom for the equal variances case The number of degrees of freedom for the unequal variances case ≥ Larger numbers of degrees of freedom have the same effect as having larger sample sizes ≥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… Do people who eat high-fiber cereal for breakfast consume, on average, fewer calories for lunch than people who do not eat high-fiber cereal for breakfast? What are we trying to show? What is our research hypothesis? The mean caloric intake of high fiber cereal eaters ( ) is less than that of non-consumers ( ), i.e. is ?
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… The mean caloric intake of high fiber cereal eaters ( ) is less than that of non-consumers ( ), translates to: (i.e. ) Thus, H 1 : Hence our null hypothesis becomes: H 0 : IDENTIFY Phrase H 0 & H 1 as a “difference of means”
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… A sample of 150 people was randomly drawn. Each person was identified as a consumer or a non-consumer of high- fiber cereal. For each person the number of calories consumed at lunch was recorded. The data:recorded Independent Pop’ns; Either you eat high fiber cereal or you don’t n 1 +n 2 =150 There is reason to believe the population variances are unequal… Recall H 1 :
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… Thus, our test statistic is: The number of degrees of freedom is: Hence the rejection region is… COMPUTE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… Our rejection region: Our test statistic: Since our test statistic (-2.09) is less than our critical value of t (-1.658), we reject H 0 in favor of H 1 — that is, there is sufficient evidence to support the claim that high fiber cereal eaters consume less calories at lunch. COMPUTE Compare
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… Likewise, we can use Excel to do the calculations…calculations COMPUTE Recall H 0 :
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.1… …however, we still need to be able to interpret the Excel output: INTERPRET Compare… Beware! Excel gives a right tail critical value! i.e vs. – !! …or look at p-value
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Confidence Interval… Suppose we wanted to compute a 95% confidence interval estimate of the difference between mean caloric intake for consumers and non-consumers of high-fiber cereals… That is, we estimate that non-consumers of high fiber cereal eat between 1.56 and more calories than consumers.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Confidence Interval… Alternatively, you can use the Estimators workbook… values in bold face are calculated for you…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.2… Two methods are being tested for assembling office chairs. Assembly times are recorded (25 times for each method). At a 5% significance level, do the assembly times for the two methods differ? That is, H 1 : Hence, our null hypothesis becomes: H 0 : Reminder: since our null hypothesis is a “not equals” type, it is a two-tailed test. IDENTIFY
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.2… The assembly times for each of the two methods are recorded and preliminary data is prepared…assembly times COMPUTE The sample variances are similar, hence we will assume that the population variances are equal…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.2… Recall, we are doing a two-tailed test, hence the rejection region will be: The number of degrees of freedom is: Hence our critical values of t (and our rejection region) becomes: COMPUTE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.2… In order to calculate our t-statistic, we need to first calculate the pooled variance estimator, followed by the t-statistic… COMPUTE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.2… Since our calculated t-statistic does not fall into the rejection region, we cannot reject H 0 in favor of H 1, that is, there is not sufficient evidence to infer that the mean assembly times differ. INTERPRET
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.2… Excel, of course, also provides us with the information… INTERPRET Compare… …or look at p-value
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Confidence Interval… We can compute a 95% confidence interval estimate for the difference in mean assembly times as: That is, we estimate the mean difference between the two assembly methods between –.36 and.96 minutes. Note: zero is included in this confidence interval…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Terminology… If all the observations in one sample appear in one column and all the observations of the second sample appear in another column, the data is unstacked. If all the data from both samples is in the same column, the data is said to be stacked.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Identifying Factors I… Factors that identify the equal-variances t-test and estimator of :
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Identifying Factors II… Factors that identify the unequal-variances t-test and estimator of :
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Matched Pairs Experiment… Previously when comparing two populations, we examined independent samples. If, however, an observation in one sample is matched with an observation in a second sample, this is called a matched pairs experiment. To help understand this concept, let’s consider example 13.4
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.4… Is there a difference between starting salaries offered to MBA grads going into Finance vs. Marketing careers? More precisely, are Finance majors offered higher salaries than Marketing majors? In this experiment, MBAs are grouped by their GPA into 25 groups. Students from the same group (but with different majors) were selected and their highest salary offer recorded. Here’s how the data looks…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.4… The numbers in black are the original starting salary data; the number in blue were calculated.starting salary data although a student is either in Finance OR in Marketing (i.e. independent), that the data is grouped in this fashion makes it a matched pairs experiment (i.e. the two students in group #1 are ‘matched’ by their GPA range the difference of the means is equal to the mean of the differences, hence we will consider the “mean of the paired differences” as our parameter of interest:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.4… Do Finance majors have higher salary offers than Marketing majors? Since: We want to research this hypothesis: H 1 : (and our null hypothesis becomes H 0 : ) IDENTIFY
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Test Statistic for The test statistic for the mean of the population of differences ( ) is: which is Student t distributed with n D –1 degrees of freedom, provided that the differences are normally distributed. Thus our rejection region becomes:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.4… From the data, we calculate… …which in turn we use for our t-statistic… …which we compare to our critical value of t: COMPUTE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.4… Since our calculated value of t (3.81) is greater than our critical value of t (1.711), it falls in the rejection region, hence we reject H 0 in favor of H 1 ; that is, there is overwhelming evidence (since the p-value =.0004) that Finance majors do obtain higher starting salary offers than their peers in Marketing. INTERPRET Compare…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Confidence Interval Estimator for We can derive the confidence interval estimator for algebraically as: In the previous example, what is the 95% confidence interval estimate of the mean difference in salary offers between the two business majors? That is, the mean of the population differences is between LCL=2,321 and UCL=7,809 dollars.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Identifying Factors… Factors that identify the t-test and estimator of :
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Inference about the ratio of two variances So far we’ve looked at comparing measures of central location, namely the mean of two populations. When looking at two population variances, we consider the ratio of the variances, i.e. the parameter of interest to us is: The sampling statistic: is F distributed with degrees of freedom.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Inference about the ratio of two variances Our null hypothesis is always: H 0 : (i.e. the variances of the two populations will be equal, hence their ratio will be one) Therefore, our statistic simplifies to:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.6… In example 13.1, we looked at the variances of the samples of people who consumed high fiber cereal and those who did not and assumed they were not equal. We can use the ideas just developed to test if this is in fact the case. We want to show: H 1 : (the variances are not equal to each other) Hence we have our null hypothesis: H 0 : IDENTIFY
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.6… Since our research hypothesis is: H 1 : We are doing a two-tailed test, and our rejection region is: CALCULATE F
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.6… Our test statistic is: Hence there is sufficient evidence to reject the null hypothesis in favor of the alternative; that is, there is a difference in the variance between the two populations. CALCULATE F
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.6… We may need to work with the Excel output before drawing conclusions… INTERPRET Our research hypothesis H 1 : requires two-tail testing, but Excel only gives us values for one-tail testing… If we double the one-tail p-value Excel gives us, we have the p-value of the test we’re conducting (i.e. 2 x = ). Refer to the text and CD Appendices for more detail.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.6… If we wanted to determine the 95% confidence interval estimate of the ratio of the two population variances in Example 13.1, we would proceed as follows… The confidence interval estimator for, is: CALCULATE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.6… The 95% confidence interval estimate of the ratio of the two population variances in Example 13.1 is: That is, we estimate that lies between.2388 and.6614 Note that one (1.00) is not within this interval… CALCULATE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Identifying Factors Factors that identify the F-test and estimator of :
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Difference Between Two Population Proportions We will now look at procedures for drawing inferences about the difference between populations whose data are nominal (i.e. categorical). As mentioned previously, with nominal data, calculate proportions of occurrences of each type of outcome. Thus, the parameter to be tested and estimated in this section is the difference between two population proportions: p 1 –p 2.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Statistic and Sampling Distribution… To draw inferences about the the parameter p 1 –p 2, we take samples of population, calculate the sample proportions and look at their difference. is an unbiased estimator for p 1 –p 2. x 1 successes in a sample of size n 1 from population 1
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Sampling Distribution The statistic is approximately normally distributed if the sample sizes are large enough so that: Since its “approximately normal” we can describe the normal distribution in terms of mean and variance… …hence this z-variable will also be approximately standard normally distributed:
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Testing and Estimating p 1 –p 2 … Because the population proportions ( p 1 & p 2 ) are unknown, the standard error: is unknown. Thus, we have two different estimators for the standard error of, which depend upon the null hypothesis. We’ll look at these cases on the next slide…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Test Statistic for p 1 –p 2 … There are two cases to consider…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.8… A consumer packaged goods (CPG) company is test marketing two new versions of soap packaging. Version one (bright colors) is distributed in one supermarket, while version two (simple colors) is in another. Since the first version is more expensive, it must outsell the other design, that is its market share, p 1, must be greater than that of the other soap package design, i.e. p 2. That is, we want to know, is p 1 > p 2 ? or, using the language of statistics: H 1 : ( p 1 – p 2 ) > 0 Hence our null hypothesis will be H 0 : ( p 1 – p 2 ) = 0 [case 1] IDENTIFY
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.8… Here is the summary data… Our null hypothesis is H 0 : ( p 1 – p 2 ) = 0, i.e. is a “case 1” type problem, hence we need to calculate the pooled proportion: IDENTIFY
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.8… At a 5% significance level, our rejection region is: The value of our z-statistic is… Since 2.90 > 1.645, we reject H 0 in favor of H 1, that is, there is enough evidence to infer that the brightly colored design is more popular than the simple design. CALCULATE Compare…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.8… In Excel, we can use the Z-Test: 2 Proportions tool in the Data Analysis Plus package to “crunch the numbers”…numbers CALCULATE Compare… p-value…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.9… Suppose in our test marketing of soap packages scenario that instead of just a difference between the two package versions, the brightly colored design had to outsell the simple design by at least 3% Our research hypothesis now becomes: H 1 : ( p 1 – p 2 ) >.03 And so our null hypothesis is: H 0 : ( p 1 – p 2 ) =.03 IDENTIFY Since the r.h.s. of the H 0 equation is not zero, it’s a “case 2” type problem
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.9… Same summary data as before: Since this is a “case 2” type problem, we don’t need to calculate the pooled proportion, we can go straight to z: IDENTIFY
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.9… Since our calculated z-statistic (1.15) does not fall into our rejection region, there is not enough evidence to infer that the brightly colored design outsells the other design by 3% or more. INTERPRET
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Confidence Intervals… The confidence interval estimator for p 1 – p 2 is given by: and as you may suspect, its valid when…
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Example 13.10… Create a 95% confidence interval for the difference between the two proportions of packaged soap sales from Ex. 13.8: COMPUTE
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Identifying Factors… Factors that identify the z-test and estimator for p 1 – p 2