Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.

3 We now study problems in which the data also consists of counts or frequencies, but the data table will now have I rows (I  2) and J columns, so IJ cells. There are two commonly encountered situations in which such data arises: 1. There are I populations of interest, each corresponding to a different row of the table, and each population is divided into the same J categories. A sample is taken from the ith population (i = 1,…, I) and the counts are entered in the cells in the ith row of the table.

4 Two-Way Contingency Tables For example, customers of each of I = 3 department- store chains might have available the same J = 5 payment categories: cash, check, store credit card, Visa, and MasterCard. 2. There is a single population of interest, with each individual in the population categorized with respect to two different factors. There are I categories associated with the first factor and J categories associated with the second factor.

5 Two-Way Contingency Tables A single sample is taken, and the number of individuals belonging in both category i of factor 1 and category j of factor 2 is entered in the cell in row i, column j (i = 1,…, I; j = 1,…, J). As an example, customers making a purchase might be classified according to both department in which the purchase was made, with I = 6 departments, and according to method of payment, with J = 5 as in (1) above.

6 Two-Way Contingency Tables Let n ij denote the number of individuals in the sample (s) falling in the (i, j)th cell (row i, column j) of the table—that is, the (i, j)th cell count. The table displaying the n ij ’s is called a two-way contingency table; a prototype is shown in Table 14.9. Figure 14.9 A Two-Way Contingency Table

7 Two-Way Contingency Tables In situations of type 1, we want to investigate whether the proportions in the different categories are the same for all populations. The null hypothesis states that the populations are homogeneous with respect to these categories. In type 2 situations, we investigate whether the categories of the two factors occur independently of one another in the population.

8 Testing for Homogeneity

9 Suppose each individual in every one of the I populations belongs in exactly one of the same J categories. A sample of n i individuals is taken from the ith population; let n =  n i and n ij = the number of individuals in the ith sample who fall into category j n j = = the total number of individuals among the n sample who fall into category j

10 Testing for Homogeneity The n ij ’s are recorded in a two-way contingency table with I rows and J columns. The sum of the n ij ’s in the ith row is n i, and the sum of entries in the jth column will be denoted by n. j. Let p ij = the proportion of the individuals in population i who fall into category j Thus, for population 1, the J proportions are P 11, P 12,…, P 1J (which sum to 1) and similarly for the other populations.

11 Testing for Homogeneity The null hypothesis of homogeneity states that the proportion of individuals in category j is the same for each population and that this is true for every category; that is, for every category; that is, for every j, P 1J = P 2J = … = P IJ. When H 0 is true, we can use P 1, P 2,…, P J to denote the population proportions in the J different categories; these proportions are common to all I populations.

12 Testing for Homogeneity The expected number of individuals in the ith sample who fall in the jth category when H 0 is true is then E(N ij ) = n i  P j. To estimate E(N ij ), we must first estimate p j, the proportion in category j. Among the total sample of n individuals, fall into category j, so we use = N j ln as the estimator (this can be shown to be the maximum likelihood estimator of p j ).

13 Testing for Homogeneity Substitution of the estimate for P j in n i P j yields a simple formula for estimated expected counts under H 0 : = estimated expected count in cell (i, j) = n i  The test statistic also has the same form as in previous problem situations. (14.9)

14 Testing for Homogeneity The number of degrees of freedom comes from the general rule of thumb. In each row of Table 14.9 there are J – 1 freely determined cell counts (each sample size n i is fixed), so there are a total of I(J – 1) freely determined cells. Parameters P 1,…,P J are estimated, but because  P i = 1, only J – 1 of these are independent. Thus df = I(J – 1) – (J – 1) = (J – 1)(I – 1).

15 Testing for Homogeneity Null hypothesis: H 0 : p 1j = p 2j = … = P Ij j = 1,2,…, J Alternative hypothesis: H a : H 0 is not true Test statistic value: X 2 = Rejection region: X 2  The test can safely be applied as long as  5 for all cells.

16 Example 13 A company packages a particular product in cans of three different sizes, each one using a different production line. Most cans conform to specifications, but a quality control engineer has identified the following reasons for nonconformance: 1. Blemish on can 2. Crack in can 3. Improper pull tab location 4. Pull tab missing 5. Other

17 Example 13 A sample of nonconforming units is selected from each of the three lines, and each unit is categorized according to reason for nonconformity, resulting in the following contingency table data: cont’d

18 Example 13 Does the data suggest that the proportions falling in the various nonconformance categories are not the same for the three lines? The parameters of interest are the various proportions, and the relevant hypotheses are H 0 : the production lines are homogeneous with respect to the five nonconformance categories; that is, P 1j = P 2j = P 3j for j = 1,…, 5 H a : the production lines are not homogeneous with respect to the categories cont’d

19 Example 13 The estimated expected frequencies (assuming homogeneity) must now be calculated. Consider the first nonconformance category for the first production line. When the lines are homogeneous, estimated expected number among the 150 selected units that are blemished cont’d

20 Example 13 The contribution of the cell in the upper-left corner to is then cont’d

21 Example 13 The other contributions are calculated in a similar manner. Figure 14.4 shows Minitab output for the chi-squared test. Figure 14.4 Minitab output for the chi-squared test of Example 13 cont’d

22 Example 13 The observed count is the top number in each cell, and directly below it is the estimated expected count. The contribution of each cell to appears below the counts, and the test statistic value is = 14.159. All estimated expected counts are at least 5, so combining categories is unnecessary. The test is based on (3 – 1)(5 – 1) = 8 df. Appendix Table A.11 shows that the values that capture upper-tail areas of.08 and.075 under the 8 df curve are 14.06 and 14.26, respectively. cont’d

23 Example 13 Thus the P-value is between.075 and.08; Minitab gives P-value =.079. The null hypothesis of homogeneity should not be rejected at the usual significance levels of.05 or.01, but it would be rejected for the higher  of.10. cont’d

24 Testing for Independence

25 Testing for Independence We focus now on the relationship between two different factors in a single population. Each individual in the population is assumed to belong in exactly one of the I categories associated with the first factor and exactly one of the J categories associated with the second factor. For example, the population of interest might consist of all individuals who regularly watch the national news on television, with the first factor being preferred network (ABC, CBS, NBC, or PBS, so I = 4) and the second factor political philosophy (liberal, moderate, or conservative, giving J = 3).

26 Testing for Independence For a sample of n individuals taken from the population, let n ij denote the number among the n who fall both in category i of the first factor and category j of the second factor. The n ij ’s can be displayed in a two-way contingency table with I rows and J columns. In the case of homogeneity for I populations, the row totals were fixed in advance, and only the J column totals were random.

27 Testing for Independence Now only the total sample size is fixed, and both the n i. ’s and n.j ’s are observed values of random variables. To state the hypotheses of interest, let P ij = the proportion of individuals in the population who belong in category i of factor 1 and category j of factor 2 = P (a randomly selected individual falls in both category i of factor 1 and category j of factor 2)

28 Testing for Independence Then P i. = = P (a randomly selected individual falls in category i of factor 1) P.j = = P (a randomly selected individual falls in category j of factor 2) Recall that two events, A and B, are independent if P(A ∩ B) = P(A)  P(B).

29 Testing for Independence The null hypothesis here says that an individual’s category with respect to factor 1 is independent of the category with respect to factor 2. In symbols, this becomes P ij = P i  P. j for every pair (i, j). The expected count in cell (i, j) is n  P ij, so when the null hypothesis is true, E(N ij ) = n i  P i.  P. j. To obtain a chi-squared statistic, we must therefore estimate the P i. ’s(i = 1,…,I) and P. j ’s(j = 1,…,J).

30 Testing for Independence The (maximum likelihood) estimates are = sample proportion for category i of factor 1 and = sample proportion for category j of factor 2 This gives estimated expected cell counts identical to those in the case of homogeneity.

31 Testing for Independence The test statistic is also identical to that used in testing for homogeneity, as is the number of degrees of freedom. This is because the number of freely determined cell counts is IJ – 1, since only the total n is fixed in advance.

32 Testing for Independence There are I estimated P i. ’s, but only I – 1 are independently estimated since  P i. = 1; and similarly J – 1P. j ’s are independently estimated, so I + J – 2 parameters are independently estimated. The rule of thumb now yields df = IJ – 1 – (I + J – 2) = IJ – I – J + 1 = (I – 1)  (J – 1).

33 Testing for Independence Null hypothesis: H 0 : p ij = p.i  p.j i = 1,…, I; j = 1,…,J Alternative hypothesis: H a : H 0 is not true Test statistic value: Rejection region: Again, P-value information can be obtained as described in Section 14.1.The test can safely be applied as long as  5 for all cells.

34 Example 14 A study of the relationship between facility conditions at gasoline stations and aggressiveness in the pricing of gasoline (“An Analysis of Price Aggressiveness in Gasoline Marketing,” J. of Marketing Research, 1970: 36–42) reports the accompanying data based on a sample of n = 441 stations. cont’d

35 Example 14 At level.01, does the data suggest that facility conditions and pricing policy are independent of one another? Observed and estimated expected counts are given in Table 14.10. Table 14.10 Observed and Estimated Expected Counts for Example 14 cont’d

36 Example 14 Thus and because = 13.277, the hypothesis of independence is rejected. We conclude that knowledge of a station’s pricing policy does give information about the condition of facilities at the station. In particular, stations with an aggressive pricing policy appear more likely to have substandard facilities than stations with a neutral or nonaggressive policy. cont’d

Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Similar presentations

Presentation on theme: "Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Similar presentations

Presentation on theme: "Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis."— Presentation transcript:

Similar presentations

About project

Feedback