Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Inference for Distributions of Categorical Data

Similar presentations


Presentation on theme: "Chapter 11 Inference for Distributions of Categorical Data"— Presentation transcript:

1 Chapter 11 Inference for Distributions of Categorical Data
Section 11.1 Chi-Square Tests for Goodness of Fit

2 Chi-Square Tests for Goodness of Fit
STATE appropriate hypotheses and COMPUTE the expected counts and chi-square test statistic for a chi-square test for goodness of fit. STATE and CHECK the Random, 10%, and Large Counts conditions for performing a chi-square test for goodness of fit. CALCULATE the degrees of freedom and P-value for a chi-square test for goodness of fit. PERFORM a chi-square test for goodness of fit. CONDUCT a follow-up analysis when the results of a chi-square test are statistically significant.

3 Chi-Square Tests for Goodness of Fit
Here’s what the Consumer Affairs Department of Mars, Inc. says about the distribution of color for M&M’S® Milk Chocolate Candies produced at its Hackettstown, New Jersey, factory: Ramón Rivera Moret

4 Chi-Square Tests for Goodness of Fit
Here’s what the Consumer Affairs Department of Mars, Inc. says about the distribution of color for M&M’S® Milk Chocolate Candies produced at its Hackettstown, New Jersey, factory: Brown: 12.5% Red: 12.5% Yellow: 12.5% Green: 12.5% Orange: 25% Blue: 25% Ramón Rivera Moret

5 Stating Hypotheses Ramón Rivera Moret Jerome’s class wondered if the distribution of color in a large bag of M&M’S Milk Chocolate Candies differed from the claimed distribution.

6 Stating Hypotheses Ramón Rivera Moret Jerome’s class wondered if the distribution of color in a large bag of M&M’S Milk Chocolate Candies differed from the claimed distribution. Here are data from a random sample of 60 candies from that bag:

7 Stating Hypotheses Ramón Rivera Moret H0: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is the same as the claimed distribution.

8 Stating Hypotheses Ramón Rivera Moret H0: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is the same as the claimed distribution. Ha: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is not the same as the claimed distribution.

9 Stating Hypotheses Ramón Rivera Moret H0: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is the same as the claimed distribution. Ha: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is not the same as the claimed distribution. We can also write the hypotheses in symbols.

10 Stating Hypotheses Ramón Rivera Moret H0: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is the same as the claimed distribution. Ha: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is not the same as the claimed distribution. We can also write the hypotheses in symbols. H0: pbrown = 0.125, pred = 0.125, pyellow = 0.125, pgreen = 0.125, porange = 0.25, pblue = 0.25 Ha: At least two of the pi’s are incorrect where pcolor = the true proportion of M&M’S Milk Chocolate Candies in the large bag of that color.

11 Stating Hypotheses Ramón Rivera Moret H0: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is the same as the claimed distribution. Ha: The distribution of color in the large bag of M&M’S Milk Chocolate Candies is not the same as the claimed distribution. Don’t state the alternative hypothesis in a way that suggests that all the proportions in the hypothesized distribution are wrong. CAUTION: We can also write the hypotheses in symbols. H0: pbrown = 0.125, pred = 0.125, pyellow = 0.125, pgreen = 0.125, porange = 0.25, pblue = 0.25 Ha: At least two of the pi’s are incorrect where pcolor = the true proportion of M&M’S Milk Chocolate Candies in the large bag of that color.

12 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
How many candies of each color should Jerome’s class expect to find in their sample of 60 candies?

13 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
How many candies of each color should Jerome’s class expect to find in their sample of 60 candies? Brown: (60)(0.125) = 7.5 Red: (60)(0.125) = 7.5 Yellow: (60)(0.125) = 7.5

14 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
How many candies of each color should Jerome’s class expect to find in their sample of 60 candies? Brown: (60)(0.125) = 7.5 Red: (60)(0.125) = 7.5 Yellow: (60)(0.125) = 7.5 Green: (60)(0.125) = 7.5 Orange: (60)(0.25) = 15 Blue: (60)(0.25) = 15

15 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
How many candies of each color should Jerome’s class expect to find in their sample of 60 candies? Brown: (60)(0.125) = 7.5 Red: (60)(0.125) = 7.5 Yellow: (60)(0.125) = 7.5 Green: (60)(0.125) = 7.5 Orange: (60)(0.25) = 15 Blue: (60)(0.25) = 15 Calculating Expected Counts in a Chi-Square Test for Goodness of Fit The expected count for category i in the distribution of a categorical variable is npi where pi is the relative frequency for category i specified by the null hypothesis.

16 Comparing Observed and Expected Counts: The Chi-Square Test Statistic

17 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
How likely is it that differences this large or larger would occur just by chance in random samples of size 60 from the population distribution claimed by Mars, Inc.?

18 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
How likely is it that differences this large or larger would occur just by chance in random samples of size 60 from the population distribution claimed by Mars, Inc.? The statistic we use to make the comparison is the chi-square test statistic 2.

19 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
The chi-square test statistic is a measure of how far the observed counts are from the expected counts. The formula for the statistic is 𝜒 2 = (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 −𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡) 2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 where the sum is over all possible values of the categorical variable.

20 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
The chi-square test statistic is a measure of how far the observed counts are from the expected counts. The formula for the statistic is 𝜒 2 = (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 −𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡) 2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 where the sum is over all possible values of the categorical variable. For Jerome’s data, we add six terms—one for each color category: 𝜒 2 = (12−7.5) (3−7.5) (7−7.5) (9−7.5) (9−15) (20−15) 2 15

21 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
The chi-square test statistic is a measure of how far the observed counts are from the expected counts. The formula for the statistic is 𝜒 2 = (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 −𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡) 2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 where the sum is over all possible values of the categorical variable. For Jerome’s data, we add six terms—one for each color category: 𝜒 2 = (12−7.5) (3−7.5) (7−7.5) (9−7.5) (9−15) (20−15) 2 15 = = 9.8

22 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
Problem: Carrie made a 6-sided die in her ceramics class and rolled it 90 times to test if each side was equally likely to show up. The table summarizes the outcomes of her 90 rolls. (a) State the hypotheses that Carrie should test. (b) Calculate the expected count for each of the possible outcomes. (c) Calculate the value of the chi-square test statistic.

23 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
Problem: Carrie made a 6-sided die in her ceramics class and rolled it 90 times to test if each side was equally likely to show up. The table summarizes the outcomes of her 90 rolls. (a) State the hypotheses that Carrie should test. (b) Calculate the expected count for each of the possible outcomes. (c) Calculate the value of the chi-square test statistic. (a) H0: The sides of Carrie’s die are equally likely to show up. Ha: The sides of Carrie’s die are not equally likely to show up.

24 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
Problem: Carrie made a 6-sided die in her ceramics class and rolled it 90 times to test if each side was equally likely to show up. The table summarizes the outcomes of her 90 rolls. (a) State the hypotheses that Carrie should test. (b) Calculate the expected count for each of the possible outcomes. (c) Calculate the value of the chi-square test statistic. (b) If H0 is true, each of the 6 sides should show up 1/6 of the time. The expected count is 90(1/6) = 15 for each side.

25 Comparing Observed and Expected Counts: The Chi-Square Test Statistic
Problem: Carrie made a 6-sided die in her ceramics class and rolled it 90 times to test if each side was equally likely to show up. The table summarizes the outcomes of her 90 rolls. (a) State the hypotheses that Carrie should test. (b) Calculate the expected count for each of the possible outcomes. (c) Calculate the value of the chi-square test statistic. (c) 𝜒 2 = (12−15) (28−15) (12−15) (13−15) (10−15) (15−15) 2 15 = = 14.41

26 The Chi-Square Distributions and P-Values
We used software to simulate taking 1000 random samples of size 60 from the population distribution of M&M’S Milk Chocolate Candies given by Mars, Inc. Here are the values of the chi-square test statistic for these 1000 samples.

27 The Chi-Square Distributions and P-Values
We used software to simulate taking 1000 random samples of size 60 from the population distribution of M&M’S Milk Chocolate Candies given by Mars, Inc. Here are the values of the chi-square test statistic for these 1000 samples.

28 The Chi-Square Distributions and P-Values
We used software to simulate taking 1000 random samples of size 60 from the population distribution of M&M’S Milk Chocolate Candies given by Mars, Inc. Here are the values of the chi-square test statistic for these 1000 samples. Larger values of 2 give more convincing evidence against H0 and in favor of Ha.

29 The Chi-Square Distributions and P-Values
We used software to simulate taking 1000 random samples of size 60 from the population distribution of M&M’S Milk Chocolate Candies given by Mars, Inc. Here are the values of the chi-square test statistic for these 1000 samples. Larger values of 2 give more convincing evidence against H0 and in favor of Ha. Our estimated P-value is 87/1000 = 0.087

30 The Chi-Square Distributions and P-Values
A chi-square distribution is defined by a density curve that takes only nonnegative values and is skewed to the right. A particular chi-square distribution is specified by its degrees of freedom.

31 The Chi-Square Distributions and P-Values
A chi-square distribution is defined by a density curve that takes only nonnegative values and is skewed to the right. A particular chi-square distribution is specified by its degrees of freedom.

32 The Chi-Square Distributions and P-Values
A chi-square distribution is defined by a density curve that takes only nonnegative values and is skewed to the right. A particular chi-square distribution is specified by its degrees of freedom. As the degrees of freedom (df) increase, the density curves become less skewed. The mean of a particular chi-square distribution is equal to its degrees of freedom. For df > 2, the mode (peak) of the chi-square density curve is at df – 2.

33 The Chi-Square Distributions and P-Values
There are 6 color categories for M&M’s® Milk Chocolate Candies, so df = 6 – 1 = 5.

34 The Chi-Square Distributions and P-Values
There are 6 color categories for M&M’s® Milk Chocolate Candies, so df = 6 – 1 = 5. The P-value is the probability of getting a 2 value of as large as or larger than 9.8 when H0 is true.

35 The Chi-Square Distributions and P-Values
There are 6 color categories for M&M’s® Milk Chocolate Candies, so df = 6 – 1 = 5. The P-value is the probability of getting a 2 value of as large as or larger than 9.8 when H0 is true.

36 The Chi-Square Distributions and P-Values
There are 6 color categories for M&M’s® Milk Chocolate Candies, so df = 6 – 1 = 5. The P-value is the probability of getting a 2 value of as large as or larger than 9.8 when H0 is true. The P-value for a test based on Jerome’s data is between 0.05 and 0.10.

37 The Chi-Square Distributions and P-Values
There are 6 color categories for M&M’s® Milk Chocolate Candies, so df = 6 – 1 = 5. The P-value is the probability of getting a 2 value of as large as or larger than 9.8 when H0 is true. The P-value for a test based on Jerome’s data is between 0.05 and 0.10.

38 The Chi-Square Distributions and P-Values
There are 6 color categories for M&M’s® Milk Chocolate Candies, so df = 6 – 1 = 5. The P-value is the probability of getting a 2 value of as large as or larger than 9.8 when H0 is true. Failing to reject H0 does not mean that the null hypothesis is true! CAUTION: The P-value for a test based on Jerome’s data is between 0.05 and 0.10.

39 The Chi-Square Distributions and P-Values
Problem: Carrie made a 6-sided die in her ceramics class and rolled it 90 times to test if each side was equally likely to show up. The table summarizes the observed and expected counts. In the preceding example, we calculated 2 = Find the P-value using Table C. Then calculate a more precise value using technology.

40 The Chi-Square Distributions and P-Values
Problem: Carrie made a 6-sided die in her ceramics class and rolled it 90 times to test if each side was equally likely to show up. The table summarizes the observed and expected counts. In the preceding example, we calculated 2 = Find the P-value using Table C. Then calculate a more precise value using technology. df = 6 – 1 = 5 Using Table C: the P-value is between 0.01 and 0.02 Using technology: 2 cdf(lower:14.41, upper:10000, df:5) =

41 Conditions for Performing a Chi-Square Test for Goodness of Fit
Carrying Out a Test Conditions for Performing a Chi-Square Test for Goodness of Fit • Random: The data come from a random sample from the population of interest. ◦ 10%: When sampling without replacement, n < 0.10N. • Large Counts: All expected counts are at least 5.

42 Conditions for Performing a Chi-Square Test for Goodness of Fit
Carrying Out a Test Conditions for Performing a Chi-Square Test for Goodness of Fit • Random: The data come from a random sample from the population of interest. ◦ 10%: When sampling without replacement, n < 0.10N. • Large Counts: All expected counts are at least 5. AP® Exam Tip When checking the Large Counts condition, be sure to examine the expected counts, not the observed counts. And make sure to write and label the expected counts on your paper or you won’t receive credit.

43 The Chi-Square Test for Goodness of Fit
Carrying Out a Test The Chi-Square Test for Goodness of Fit Suppose the conditions are met. To perform a test of H0: The stated distribution of a categorical variable in the population of interest is correct compute the chi-square test statistic: 𝜒 2 = (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 −𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡) 2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 where the sum is over the k different categories. The P-value is the area to the right of 2 under the chi-square density curve with k – 1 degrees of freedom.

44 Carrying Out a Test Problem: In his book Outliers, Malcolm Gladwell suggests that a hockey player’s birth month has a big influence on his chance to make it to the highest levels of the game. Specifically, because January 1 is the cut-off date for youth leagues in Canada [where many National Hockey League (NHL) players come from], players born in January will be competing against players up to 12 months younger. The older players tend to be bigger, stronger, and more coordinated and hence get more playing time, more coaching, and have a better chance of being successful. To see if birth date is related to success (judged by whether a player makes it into the NHL), a random sample of 80 NHL players from a recent season was selected and their birthdays were recorded. The one-way table summarizes the data on birthdays for these 80 players. Do these data provide convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year? Canva Pty Ltd/Alamy

45 Carrying Out a Test Problem: … a random sample of 80 NHL players from a recent season was selected and their birthdays were recorded. The one-way table summarizes the data on birthdays for these 80 players. Do these data provide convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year? Canva Pty Ltd/Alamy

46 Carrying Out a Test Problem: … a random sample of 80 NHL players from a recent season was selected and their birthdays were recorded. The one-way table summarizes the data on birthdays for these 80 players. Do these data provide convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year? Canva Pty Ltd/Alamy STATE H0: The birthdays of all NHL players are uniformly distributed across the four quarters of the year. Ha: The birthdays of all NHL players are not uniformly distributed across the four quarters of the year. We’ll use α = 0.05.

47 Carrying Out a Test Problem: … a random sample of 80 NHL players from a recent season was selected and their birthdays were recorded. The one-way table summarizes the data on birthdays for these 80 players. Do these data provide convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year? Canva Pty Ltd/Alamy PLAN Chi-square test for goodness of fit • Random: The data came from a random sample of NHL players ✓ ◦ 10%: We must assume that 80 is less than 10% of all NHL players ✓ • Large Counts: All expected counts = 80(1/4) = 20 ≥ 5 ✓

48 Carrying Out a Test Problem: … a random sample of 80 NHL players from a recent season was selected and their birthdays were recorded. The one-way table summarizes the data on birthdays for these 80 players. Do these data provide convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year? Canva Pty Ltd/Alamy DO df = 4 – 1 = 3 Using Table C: The P-value is between 0.01 and 0.02 Using technology: 2 cdf(lower:11.2, upper:10000, df:3) = 0.011 𝜒 2 = (32−20) (20−20) (16−20) (12−20) = =11.2

49 Carrying Out a Test Problem: … a random sample of 80 NHL players from a recent season was selected and their birthdays were recorded. The one-way table summarizes the data on birthdays for these 80 players. Do these data provide convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year? Canva Pty Ltd/Alamy CONCLUDE Because the P-value of < α = 0.05, we reject H0. We have convincing evidence that the birthdays of NHL players are not uniformly distributed across the four quarters of the year.

50 Carrying Out a Test

51 Carrying Out a Test

52 Carrying Out a Test

53 Carrying Out a Test Note: When you run the chi-square test for goodness of fit on the TI-84 calculator, a list of these individual components will be produced and stored in a list called CNTRB (for contribution).

54 Carrying Out a Test AP® Exam Tip
You can use your calculator to carry out the mechanics of a significance test on the AP® Statistics exam. But there’s a risk involved. If you just give the calculator answer with no work, and one or more of your values is incorrect, you will likely get no credit for the “Do” step. We recommend writing out the first few terms of the chi-square calculation followed by “. . .”. This approach might help you earn partial credit if you enter a number incorrectly. Be sure to name the procedure (chi-square test for goodness of fit) and to report the test statistic (2 = 11.2), degrees of freedom (df = 3), and P-value (0.011).

55 Follow-up Analysis If the sample data lead to a statistically significant result, we can conclude that our variable has a distribution different from the one stated.

56 Follow-up Analysis If the sample data lead to a statistically significant result, we can conclude that our variable has a distribution different from the one stated. To investigate how the distribution is different, start by identifying the categories that contribute the most to the chi-square statistic.

57 Follow-up Analysis If the sample data lead to a statistically significant result, we can conclude that our variable has a distribution different from the one stated. To investigate how the distribution is different, start by identifying the categories that contribute the most to the chi-square statistic.

58 Follow-up Analysis

59 Follow-up Analysis The two biggest contributions to the chi-square statistic came from Jan–Mar and Oct–Dec.

60 Follow-up Analysis The two biggest contributions to the chi-square statistic came from Jan–Mar and Oct–Dec. In October through December, 8 fewer players were born than expected.

61 Follow-up Analysis The two biggest contributions to the chi-square statistic came from Jan–Mar and Oct–Dec. In October through December, 8 fewer players were born than expected. In January through March, 12 more players were born than expected.

62 Section Summary STATE appropriate hypotheses and COMPUTE the expected counts and chi-square test statistic for a chi-square test for goodness of fit. STATE and CHECK the Random, 10%, and Large Counts conditions for performing a chi-square test for goodness of fit. CALCULATE the degrees of freedom and P-value for a chi-square test for goodness of fit. PERFORM a chi-square test for goodness of fit. CONDUCT a follow-up analysis when the results of a chi-square test are statistically significant.

63 Assignment 11.1 p #2, 6, 10, 16, all If you are stuck on any of these, look at the odd before or after and the answer in the back of your book. If you are still not sure text a friend or me for help (before 8pm). Tomorrow we will check homework and review for 11.1 Quiz.


Download ppt "Chapter 11 Inference for Distributions of Categorical Data"

Similar presentations


Ads by Google