Presentation is loading. Please wait.

Presentation is loading. Please wait.

Associations between Categorical Variables Chapter 10: Chi-Square Procedures.

Similar presentations


Presentation on theme: "Associations between Categorical Variables Chapter 10: Chi-Square Procedures."— Presentation transcript:

1 Associations between Categorical Variables Chapter 10: Chi-Square Procedures

2 Three tests... Two procedures... Chi-Square Test of Goodness of Fit  one-way tables, 1 variable Chi-Square Test for Homogeneity Two-way table, 2 categorical variables; 2 populations Chi-Square Test of Association/Independence Two-way table, 2 categorical variables, 1 population; same exact procedure as Homogeneity

3 M&M Mars Company claims... that you will receive, in each bag of milk chocolate candies M&M’s... 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow What is one way we can we test this claim? Discuss for 1 minute...

4 Claim: Customers will receive 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow We could conduct a one-proportion z procedure/ hypothesis test for each of the following: H o :p Brown = 0.13H a :p Brown ≠ 0.13 H o :p Red = 0.13H a :p Red ≠ 0.13 H o :p Orange = 0.20H a :p Orange ≠ 0.20 H o :p Blue = 0.24H a :p Blue ≠ 0.24 H o :p Green = 0.14H a :p Green ≠ 0.14 H o :p Yellow = 0.14H a :p Yellow ≠ 0.14 Lots of work, time-consuming...

5 Claim: Customers will receive 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow We could conduct a one-proportion z procedure/ hypothesis test six times... Lots of work, time- consuming... But more importantly, this process would not tell us how likely it is that six sample proportions differ from the claim made by M&M’s Mars. So, we have a single test to determine if our observed sample distribution is significantly different in some way from the hypothesized population distribution of all M&M colors

6 Chi-Square (X 2 ) Test for Goodness of Fit Are you more likely to have a motor vehicle collision when using a cell phone on a given day of the week? A study of 699 randomly-selected drivers who were using a cell phone when they were involved in a collision examined this question. These drivers made 26,798 cell phone calls during a 14-month study period. Each of the 699 collisions was classified in various ways. Here are the counts for each day of the week:

7 Are accidents equally likely to occur on any day of the week? Observation: In this sample, #’s are not equal... But is this just due to sampling variability OR are accidents truly not equally likely to occur on any day of the week? To answer this question, we must conduct an hypothesis test; Chi-Square Good of Fit test

8 SIDE NOTE... In this situation, the question is “Are accidents equally likely to occur on any day of the week?” Versus In the M&M situation, the question was “Will we receive, in each bag of milk chocolate candies M&M’s, 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow ?

9 H o :Motor vehicle accidents involving cell phone use are equally likely to occur on each of 7 days of the week ORP sunday = p monday = p tuesday =... = p saturday = 1/7 H a :The probability of a motor vehicle accident involving cell phone use vary from day-to-day (not all the same; at least one day is different from another). ORAt least one of the proportions differs from the stated value(s)

10 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) GENERAL IDEA: Goodness of Fit Test compares observed (our sample) versus what we expect (our null hypothesis, H o ) Does our sample support our null hypothesis or does it support our alternative hypothesis?

11 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) We have observed count data (our sample data above), now we need our expected count data 699 total accidents that given week Expected accidents each day (if H o is true) = 699 ÷ 7 = 99.86 each day (H o ; null hypothesis; 1/7 of total each day)

12 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Expected accidents each day = 699 ÷ 7 = 99.86 each day (H o ; null hypothesis; 1/7 of total each day) NOTE: Observed is count data; no such thing as 3-1/2 accidents on a given day; either 3 or 4; BUT the EXPECTED accidents CAN be a decimal/fraction.

13 H o : P sunday = p monday = p tuesday =... = p saturday = 1/7 H a : At least one of the proportions differs from the stated value(s) Conditions: Random sample. Independent. “Large Sample.” Randomly selected – stated in problem Independent – assume one individual does not influence any other “Large Sample” – this is all about expected counts, not the actual sample counts; all individual expected counts must be at least 5

14 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Expected = 99.86 each day; therefore, all individual expected counts are at least 5; ✔

15 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Chi-Square Goodness of Fit procedure has degrees of freedom (like t procedures) degrees of freedom = k – 1 (where k is # of categories)

16 A little X 2 detour... properties of X 2 Distribution is skewed to the right; area under density curve still = 1 Degrees of freedom = k – 1 As df increase (as k increases), density curve becomes less right skewed P-value: use StatCrunch “Rejection zone” idea still same Default α still 5%

17 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Expected accidents each day = 699 ÷ 7 = 99.86 each day (H o ; null hypothesis; 1/7 of total each day) Need a method to measure how well the observed counts (“O”) fit the expected counts (“E”); i.e., are the differences between observed and expected just due to sampling variability or is H o incorrect?

18 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Need a method to measure how well the observed counts (“O”) fit the expected counts (“E”); i.e., are the differences between observed and expected just due to sampling variability or is H o incorrect? X 2 = Chi-Square =

19 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Calculations: StatCrunch, stat, goodness of fit, chi-square test X 2 = 208.84df = k – 1 = 7 – 1 = 6 p-value ≈ 0 Remember, name of procedure somewhere in your problem (always)

20 H o :P sunday = p monday = p tuesday =... = p saturday = 1/7 H a :At least one of the proportions differs from the stated value(s) Interpretation: Reject H o. As p-value ≈ 0, which is less than any reasonable α, we conclude these types of accidents are not equally likely to occur on each of the seven days of the week (at least one of the proportions differs from the stated values).

21 Saving birds from windows... Many birds are injured or killed by flying into windows. It appears that birds see windows as open space. Can tilting windows down so that they reflect earth rather than sky reduce bird strikes? Suppose we randomly place three windows at the edge of a woods: one vertical, one tilted 20 degrees, and one tilted 40 degrees. During a randomly-selected four- month period, there are 53 bird strikes: 31 on the vertical window, 14 on the 20-degree window, and 8 on the 40- degree window. If tilting had no effect, we would expect strikes on all three windows to have equal probability. Test this null hypothesis. What do you conclude?

22 53 bird strikes; 31 on the vertical window, 14 on the 20 degree window, 8 on the 40 degree window We want to test H o : p v = p t20 = p t40 = 1/3 H a : At least one of these proportions of bird hits is different

23 53 bird strikes; 31 on the vertical window, 14 on the 20 degree window, 8 on the 40 degree window Conditions: 53 birds total in our sample; so expected counts are for each 53 ÷ 3 = 17.67; so each expected count is > 5 ✔ Random – stated in problem ✔ Independent – reasonable to assume population is at least (10)(53) ✔

24 53 bird strikes; 31 on the vertical window, 14 on the 20 degree window, 8 on the 40 degree window Calculations: stat, goodness of it, chi square test X 2 = 16.11p-value ≈ 0

25 53 bird strikes; 31 on the vertical window, 14 on the 20 degree window, 8 on the 40 degree window Interpretation Reject null hypothesis. Since the p-value ≈ 0 (less than α = 5%), there is evidence that at least one proportion of bird hits is different.

26 The University of Chicago's General Social Survey (GSS) is the nation's most important social science sample survey. The GSS regularly asks randomly-selected subjects their astrological sign. Above are the counts of responses in the most recent year this question was asked. If births were spread uniformly across the year, we would expect all 12 signs to be equally likely. Are they?

27 We want to test: Chi Square Goodness of Fit H o : p Aries = p Taurus = p Gemini =... = p Pisces = 1/12 H a : At least one of the proportions differs from 1/12

28 H o : p Aries = p Taurus = p Gemini =... = p Pisces = 1/12 H a : At least one of the proportions differs from 1/12 Conditions; Random, Expected Counts > 5, Independent Calculations; StatCrunch Interpretation; Fail to reject the null hypothesis. With a p- value of 0.21, which is larger than any reasonable α, there is not enough evidence to conclude that births are not uniformly spread throughout the year.

29 Housing... According to the Census Bureau, the distribution by ethnic background of the New York City population in a recent year was: Hispanic: 28%Black: 24% White: 35%Asian: 12%Other:1% The manager of a large housing complex in the city wonders whether the distribution by race of the complex’s residents is consistent with the population distribution. To find out, she records data from a random sample of 800 residents. The following table displays the sample data.

30 Census bureau says: 28% Hispanic; 24% Black; 35% White; 12% Asian; 1% Other Are these data significantly different from the city’s distribution by race? Carry out an appropriate test at the alpha = 0.05 level to support your answer. RaceHispanicBlackWhiteAsianOtherTotal Count2122022709422800

31 Census bureau says: 28% Hispanic; 24% Black; 35% White; 12% Asian; 1% Other Ho:p Hispanic = 28%p Black =24% p White = 35% p Asian = 12%p other = 1% Ha:At least one population proportion is different Conditions: Random, Expected Counts, Independent RaceHispanicBlackWhiteAsianOtherTotal Count2122022709422800

32 Census bureau says: 28% Hispanic; 24% Black; 35% White; 12% Asian; 1% Other Random – stated in problem Expected Counts – all expected counts must be > 5 Hispanic: (800)(.28) = 224 > 5 ✔ Black: (800)(.24) = 192 > 5 ✔ White: (800)(.35) = 280 > 5 ✔ Asian: (800)(.12) = 96 > 5 ✔ Other: (800)(.01) = 8 > 5 ✔ Independent – Reasonable to assume population at least (10)(800) RaceHispanicBlackWhiteAsianOtherTotal Count2122022709422800

33 Census bureau says: 28% Hispanic; 24% Black; 35% White; 12% Asian; 1% Other Calculations: StatCrunch, stat, goodness of fit, chi square test (expected, select column) X 2 = 26.06p-value < 0.0001df = 4 Reject Ho. With a p-value of about zero, and an alpha level of 5%, we have sufficient evidence to support that at least one proportion is different (or all residents like these in a housing complex do not follow the ethnic background distribution of New York City as a whole). RaceHispanicBlackWhiteAsianOtherTotal Count2122022709422800

34 Aw, Nuts! A company claims that each batch of its deluxe mixed nuts contains 52% cashews, 27% almonds, 13% macadamia nuts, and 8% brazil nuts. To test this claim, a quality control inspector takes a random sample of 150 nuts from the latest batch. The following table below displays the sample data: NutCashewAlmondMacBrazil Count83292018

35 Aw, nuts! Claim: 52% cashews, 27% almonds, 13% macadamia nuts, and 8% brazil nuts. Test the following null and alternative hypotheses: Ho: : p cashews = 52% p almonds = 27% p mac = 13% p brazil = 8% Ha: At least one proportion is different Conditions: Random – stated in the problem NutCashewAlmondMacBrazil Actual Count83292018

36 Aw, nuts! Test the following null and alternative hypotheses: Ho: : p cashews = 52% p almonds = 27%p mac = 13% p brazil = 8% Ha: At least one proportion is different Conditions: Expected Counts must all be > 5 Cashews: (150)(.52)Almond: (150)(.27) Mac: (150)(.13)Brazil: (150)(.08) NutCashewAlmondMacBrazil Actual Count83292018

37 Aw, nuts! Test the following null and alternative hypotheses: Ho: : p cashews = 52% p almonds = 27% p mac = 13% p brazil = 8% Ha: At least one proportion is different Conditions: Assume at least (10)(150). NutCashewAlmondMacBrazil Actual Count83292018

38 Aw, nuts! Test the following null and alternative hypotheses: Ho: : p cashews = 52% p almonds = 27% p mac = 13% p brazil = 8% Ha: At least one proportion is different Calculations: StatCrunch, stat, goodness of fit, chi square test; X2 = 6.598, p-value = 0.0858, df = 3 NutCashewAlmondMacBrazil Actual Count83292018

39 Aw, nuts! Test the following null and alternative hypotheses: Ho: p cashews = 52% p almonds = 27%p mac = 13% p brazil = 8% Ha: At least one proportion is different Fail to reject Ho. With a p-value of about 9% and an alpha level of 5%, we do not have sufficient evidence to conclude that the proportions of nuts are different from the claim. NutCashewAlmondMacBrazil Actual Count83292018

40 M&M’s Activity... Mars claims: 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow in each bag of milk chocolate M&Ms Based on our sample data, do we have reason to doubt the color distribution claim made by M&M/Mars Company? Give appropriate statistical evidence to support your conclusion. Let’s randomly form groups and turn in one paper, start to finish, with all of your group’s names on the paper.

41 Mars claims: 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow in each bag of milk chocolate M&Ms H o : p red = 13% p brown = 13% p orange = 20% p blue = 24% p green = 16% p yellow = 14% H a : At least one of these proportions is different

42 Mars claims: 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow in each bag of milk chocolate M&Ms Conditions: All individual expected counts are at least 5 (0.13)(Total)(0.13)(Total)(0.20)(Total) (0.24)(Total)(0.16)(Total)(0.14)(Total)

43 Mars claims: 13% red, 13% brown, 20% orange, 24% blue, 16% green, 14% yellow in each bag of milk chocolate M&Ms Interpretation: Reject/Fail to reject; reference α and p-value, in context of problem

44 Inference for 2-way tables...(categorical data) Two tests possible with 2-way tables (for categorical data) Test of Homogeneity – independent SRS’s from each of ‘c’ populations (2, 3, 4, etc. populations, treatments, etc.) OR Test of Independence/Association – one SRS, one population

45 Inference for 2-way tables... Same test (Chi-Square Test) used for either homogeneity or independence/association Test of Homogeneity – independent SRS’s from each of ‘c’ populations (2, 3, 4, etc. populations, treatments, etc.) OR Test of Independence/Association – one SRS, one population

46 Test of Homogeneity of Populations... two-way tables... 2 independent random samples 3 randomly selected groups

47 Chi-Square Test of Homogeneity of Populations... (2 or more populations) rows & columns r x c always; (never c x r) don’t count the ‘totals’ as a row or a column categories in rows and other categories in columns separate and independent SRS’s from each population

48 Chi-Square Test of Homogeneity of Populations... (2 or more populations) H o (null hypothesis): xxx (context) is independent of xxx (context) OR there is no association between xxx and xxx H a (alternative hypothesis): xxx (context) is dependent of xxx (context) or there is an association between xxx and xxx Still use:

49 Chi-Square Test of Homogeneity of Populations... (2 or more populations) Calculations for expected cell count, for EACH cell: So, in this table, you would have to calculate 15 expected values... lots of time and work...

50 Chi-Square Test of Homogeneity of Populations... (2 or more populations) degrees of freedom... up to now, general rule k – 1 now, 2-way tables, so df = (r – 1)(c – 1) p-value is area to the right, under the X 2 density curve

51 Chi-Square Test of Homogeneity of Populations... (2 or more populations) Conditions.... Random Samples Independent Measurements – each measurement on an individual is independent of all other measurements Large Sample - All individual expected counts must be ≥ 5; it’s all about expected

52 Chi-Square Test of Homogeneity of Populations... (2 or more populations) Market researchers know that background music can influence the mood and purchasing behavior of customers. One study in a supermarket in Northern Ireland compared three treatments: no music, French accordion music, and Italian string music. Under each condition, the researchers recorded the numbers of bottles of French, Italian, and other wine. We may assume each a SRS.

53 Chi-Square Test of Homogeneity of Populations... (2 or more populations) Are the distributions of wines selected different in all three populations of music types?

54 We want to use X 2 test of homogeneity to compare the distribution of types of wine selected for each type of music. Our hypotheses are: H 0 : The distributions of wine selected are the same (are not associated; are independent) in all three populations of music types. H a : The distributions of wine selected are not all the same (are dependent, are associated).

55 Conditions To use the chi-square test for homogeneity of populations: Random – stated in problem. Independence – Must assume that one wine purchaser did not influence any other; & that each population is at least (10) times each sample size. Large Samples - All expected cell counts are ≥ 5 (how can we check this?)

56 Calculations (& checking expected counts) Input data; then StatCrunch, stat, tables, contingency, with summary X 2 = 18.27p = 0.00101df = 4

57 Chi-Square Test of Homogeneity of Populations... (2 or more populations) H 0 : The distributions of wine selected are the same (no association; independent) in all three populations of music types. H a : The distributions of wine selected are not all the same (dependent; associated). Interpretation Reject null hypothesis. There is evidence to reject H o (p- value ≈ 0 < any reasonable α) and conclude that the type of music being played has an effect on wine sales (distribution of wine sales are not the same, are associated, are dependent).

58 How are schools doing? The nonprofit group Public Agenda conducted telephone interviews with 3 independent, randomly selected groups of parents of high school children. There were 202 black parents, 202 Hispanic parents, and 201 white parents. One question asked was “Are the high schools in your state doing an excellent, good, fair, or poor job, or don't you know enough to say?” Here are the survey results:

59 How are schools doing? Conduct a test of significance to determine if the distributions for these three populations are the same.

60 How are schools doing? H o :The distributions of responses to this question are the same for each group (independent, not associated). H a :The distributions of responses to this question are not the same (dependent, associated).

61 How are schools doing? Conditions: Data is from independent random samples ✔ All expected cell counts must be ≥ 5

62 How are schools doing? Conditions continued... all expected cell counts must be ≥ 5 22.722.7022.59 68.4568.4568.12 65.4465.4465.12 24.0424.0423.92 21.3621.3721.26

63 How are schools doing? Calculations... X 2 = 22.43p = 0.004df = 8

64 How are schools doing? X 2 = 22.43p = 0.004df = 8 Interpretation: With a p-value of 0.004 which is less than any reasonable α level, reject H o and conclude that the 4 groups have different opinions about performance of high schools in their state (dependent, associated).

65 One more X 2 Test.... When do we use Chi-Square tests? Why haven’t we used them before now? Chi-Square Goodness of Fit Test Chi-Square Test of Homogeneity Now, Chi-Square Test of Association/Independence

66 Chi-Square Test of Association/Independence... Two-way table (like X 2 test of homogeneity) BUT this test involves only a single population/sample; single SRS, with each individual classified according to both of two categorical variables Bonus (or not....), we use the exact same procedure in the calculator for BOTH of these tests; conditions are the same as well

67 Chi-Square Test of Association/Independence... H o :No association between categorical variables OR Categorical variables are independent H a :There is an association (or they are not independent) Caution: Do not use ‘cause;’ the most we can say is that there is/is not a relation or they are/are not associated or they are/are not independent

68 Franchises that succeed... Many popular businesses, like McDonalds, are franchises. The owner of a local franchise benefits from the brand recognition, national advertising, and detailed guidelines provided by the franchise chain. In return, he or she pays fees to the franchise firm and agrees to follow its policies. The relationship between the local entrepreneur and the franchise firm is spelled out in a detailed contract. One clause that the contract may contain is the entrepreneur's right to an exclusive territory. This means that the new outlet will be the only representative of the franchise in a specified territory and will not have to compete with other outlets of the same chain. How does the presence of an exclusive-territory clause in the contract relate to the survival of the business? A study designed to address this question collected data from a random sample of 170 new franchise firms.

69 Franchises that succeed... How does the presence of an exclusive territory clause in a contract relate to the survival of a business? i.e., are these two qualities related/associated? Are they independent?

70 Franchises that succeed... H o :Success and exclusive-territory are independent (not associated). H a :Success and exclusive-territory are dependent (associated)

71 H o :Success and exclusive-territory are independent (not associated). H a :Success and exclusive-territory are dependent (associated) Conditions: SRS – stated in problem Independence – Must assume that one respondent does not influence any other; and (10)(170) Expected cell counts – all must be ≥ 5

72 H o :Success and exclusive-territory are independent (not associated). H a :Success and exclusive-territory are dependent (associated) Conditions continued... Expected cell counts – all must be ≥ 5 Expected cell counts:102.7420.26 39.267.74

73 H o :Success and exclusive-territory are independent (not associated). H a :Success and exclusive-territory are dependent (associated) Calculations X 2 = 5.91p = 0.015df = 1

74 H o :Success and exclusive-territory are independent (not associated). H a :Success and exclusive-territory are dependent (associated) X 2 = 5.91p = 0.015df = 1 Interpretation: Reject H o. There is sufficient evidence against H o (p = 0.015) at α = 0.05 level. Conclude success and exclusive territory clause are dependent (there is an association between the two qualities) Caution: This procedure does not show ‘cause’

75 Extracurricular Activities & Grades... North Carolina State University studied student performance in a course required by its chemical engineering major. One question of interest is the relationship between time spent in extracurricular activities and whether a student earned a C or better in the course. Here are the data from a SRS of 119 students who answered a question about extracurricular activities:

76 Extracurricular Activities & Grades... Is there statistically significant evidence of an association between the amount of time spent on extracurricular activities and grades earned in the course?

77 Extracurricular Activities & Grades... H o :There is no association (they are independent) between amount of time spent on extracurricular activities and grades earned in the course H a :There is an association (they are not independent)

78 Extracurricular Activities & Grades... Conditions: SRS: stated in the problem Independence: stated in problem Each expected cell count ≥ 5

79 Calculations: X 2 = 6.93p = 0.031df = 2 Interpretation: Reject H o. With a p-value of 0.031 and an α = 0.05, there is sufficient evidence supporting H a. Conclude that there is an association between extracurricular activities and grades earned in the course.

80 Let’s collect data... PizzaNachos Horror Action/Adventure Comedy

81 Chapter 10 HW Quiz... a little longer than usual... a little more challenging than usual...


Download ppt "Associations between Categorical Variables Chapter 10: Chi-Square Procedures."

Similar presentations


Ads by Google