Download presentation
Presentation is loading. Please wait.
1
ChiSq Tests: 1 Chi-Square Tests of Association and Homogeneity
2
ChiSq Tests: 2 Exploring the Association between 2 Variables A common goal of many research studies: Investigation of the association of 2 factors. For example, using the low birth weight dataset, we might be interested in: Is low birth weight status associated with maternal age? Is low birth weight status associated with smoking? Is maternal pre-pregnancy weight associated with maternal age?
3
ChiSq Tests: 3 We have already learned one technique for dealing with the first question. Is low birth weight associated with maternal age? This can be re-worded as Is there a difference in age between women who have low birth weight babies and those with normal birth weight babies?”
4
ChiSq Tests: 4 We are evaluating the association between a categorical factor: low/normal birth weight and a continuous factor: maternal age. We consider the categorical factor, low birth weight, as dividing our data into two samples of women (those with / without low birth weight infants) and we compare mean age for these 2 samples using a 2-sample t-test.
5
ChiSq Tests: 5 We will focus on Association of two categorical variables (such as smoking status and low birth weight) Use of chi-square tests of independence or homogeneity Later, we discuss: Association of two continuous, numeric scale variables (such as age and weight) Introduce correlation and regression analysis.
6
ChiSq Tests: 6 The Chi-Square Test Many widely used applications: 1.Tests a hypothesis about a single variance 2.Tests of association (or independence). 3.Tests of homogeneity. 4.Goodness-of-fit will not discuss in this course
7
ChiSq Tests: 7 Suppose we do a study in which we want to determine the relationship between Smoking and Impairment of lung function measured by forced vital capacity (FVC). Suppose n = 100 people are selected for the study For each person we note smoking behavior (yes or no) and FVC (normal or abnormal) We count the number with each combination of the two factors:
8
ChiSq Tests: 8 ab cd FVC abnormal normal Smoke don’t smoke a + b c + d a + c b + dn = a + b + c + d Common display of such data is in a Contingency Table, where the count in each table cell is displayed:
9
ChiSq Tests: 9 500 0 FVC abnormal normal Smoke don’t smoke 50 100 Consider the following result: What can be said about the relationship between FVC and smoking? There is perfect association between the 2 factors: all smokers have abnormal FVC and all non-smokers have normal FVC
10
ChiSq Tests: 10 25 FVC abnormal normal Smoke don’t smoke 50 100 What about this result? Half the smokers have normal and half abnormal FVC the same is true for non-smokers. There appears to be no association between smoking status and FVC status. We also say that lung function is independent of smoking status.
11
ChiSq Tests: 11 We now want to define a test statistic that will clearly distinguish between these two situations. We will use the following notation: O1O1 O2O2 O3O3 O4O4 FVC abnormal normal Smoke don’t smoke O 1 + O 2 O 3 + O 4 O 1 + O 3 O 2 + O 4 n = O 1 + O 2 + O 3 + O 4 The O i is the “observed frequency” in the i th table cell. Observed frequency: observed number in our sample.
12
ChiSq Tests: 12 Now, the hypothesis of interest is H o : There is no association between the two variables (they are independent) vs. H a : The two variables are associated that is, H o : The proportion of smokers with normal FVC ( 1 ) is the same as the proportion of non-smokers with normal FVC ( 2 ) vs. H a : The proportions differ. that is, H o : H a :
13
ChiSq Tests: 13 O1O1 O2O2 O3O3 O4O4 FVC abnormal normal Smoke don’t smoke O 1 + O 2 O 3 + O 4 O 1 + O 3 O 2 + O 4 n = O 1 + O 2 + O 3 + O 4 The observed proportions, or sample estimates of 1 and 2 are: P 1 = O 2 /(O 1 +O 2 ) = (# smokers-normal / # smokers) and P 2 = O 4 /(O 3 +O 4 )
14
ChiSq Tests: 14 Now, suppose H o : 1 = 2 is true. How many smokers would we expect to see with normal FVC? Ignoring smoking status: The proportion in the overall sample with normal FVC is: P = (O 2 +O 4 ) / n When H o is true, we expect That is, we expect the proportion with normal FVC among smokers and among non-smokers to be the same as the proportion with normal FVC in the overall sample.
15
ChiSq Tests: 15 500 0 FVC abnormal normal Smoke don’t smoke 50 100 O1O1 O2O2 O3O3 O4O4 Our sample estimate of , ignoring smoking status is P = (O 2 +O 4 )/n = 50/100 =.5 or 50% Then, if H o is true (no association or independence), the proportion with normal FVC among smokers as well as among non-smokers should be the same as in the overall population, that is,
16
ChiSq Tests: 16 There are a total of 50 smokers we expect 50(.5) = 25 to have normal FVC we expect 50 (1–.5) = 25 to have abnormal FVC Similarly, there are a total of 50 non-smokers we expect 50(.5)=25 to have normal FVC we expect 50(1–.5)=25 to have abnormal FVC
17
ChiSq Tests: 17 Let e i = “expected” frequency in the i th cell when H o is true: 25 FVC abnormal normal Smoke don’t smoke 50 100 e1e1 e2e2 e3e3 e4e4 Note: The row and column totals, known as “marginals” remain unchanged.
18
ChiSq Tests: 18 We now define our test statistic as This is distributed as chi-square with 1 degree of freedom. When the observed counts are close to the expected counts 2 will have a small value (fail to reject H o ) When the observed counts are far from the expected counts 2 will have a large value (reject H o )
19
ChiSq Tests: 19 In our example, 50 25 0 25 0 25 50 25 FVC abnormal normal Smoke don’t smoke 50 100 o1e1o1e1 o2e2o2e2 o3e3o3e3 o4e4o4e4
20
ChiSq Tests: 20 We can compute the achieved significance from the 2 distribution, with 1 degree of freedom: Chi-Square with 1 DF x P( X <= x) 100 1.0000 Cumulative prob 1 df
21
ChiSq Tests: 21.05 22 Achieved Significance (p-value): p = Pr( 2 > 100) = 1– Pr( 2 100) = 1 – 1 = 0 Since p<<.05, we reject H o, and conclude that smoking status and lung function are associated. (Recall this is a made-up case). Reject H o for large values of 2
22
ChiSq Tests: 22 Now for the case made up to show “No Association”: The expected counts are the same since the overall proportion in the sample with normal FVC is unchanged: 25 FVC abnormal normal Smoke don’t smoke 50 100 o1e1o1e1 o2e2o2e2 o3e3o3e3 o4e4o4e4 The achieved significance is 1: Fail-to-reject H o – FVC and smoking are unrelated.
23
ChiSq Tests: 23 An easy way to compute the expected value for a given cell is For example: = c 2 /n : expected proportion overall with normal FVC, under H o e 4 = (r 2 ) = (r 2 )(c 2 )/n: expected frequency of non- smokers with normal FVC, under H o O1O1 O2O2 O3O3 O4O4 FVC abnormal normal Smoke don’t smoke r1r2r1r2 c 1 c 2 n
24
ChiSq Tests: 24 Example: To test the effectiveness of a new vaccine, 120 experimental animals were given a vaccine 180 control animals were not vaccinated All 300 were then exposed to the disease Among the vaccinated, 6 died from the disease Among the controls 18 died from the disease Can we conclude that the vaccine affected the mortality rate?
25
ChiSq Tests: 25 This is an example of a test of “homogeneity.” In the design of this study we pre-set the number of animals given the vaccine and the number not vaccinated. Then we observe the outcome for each animal: lived or died. For a test of “association” or “independence” we take a random sample from a population and observe 2 characteristics (e.g., smoking status and lung function) Then ask if these are associated
26
ChiSq Tests: 26 We could, in fact design a different study where we pre-set the number of smokers/non-smokers: take a sample of n 1 smokers take a sample of n 2 non-smokers ask if these populations are “homogeneous” with respect to lung function – are the proportions with normal lung function the same in the two groups. The analysis is the same for either study design.
27
ChiSq Tests: 27 Back to our example: 1.Our research question: Are the vaccinated and unvaccinated animals homogeneous with respect to mortality? 2.Assumptions: We have independent random samples of animals large enough samples so that the sample proportions are approximately normally distributed. [most of the cells(>25%) should have expected frequencies 5, or approximate normality of the sample proportions will not hold]
28
ChiSq Tests: 28 3.H o : The two groups of animals are homogeneous with respect to mortality, or the proportions dying in the 2 groups are the same. H a : The two groups are heterogeneous with respect to mortality, or the proportions dying in the 2 groups are different. That is H o : 1 = 2 vs.H a : 1 2
29
ChiSq Tests: 29 % of total who die % of unvaccinated who die % of vaccinated who die Here, our overall sample proportion dying is: P = (# deaths)/(n 1 +n 2 ) = 24/300 =.08 = 8% Or H o : 1 = 2 If H o is true, then
30
ChiSq Tests: 30 1146 16218 Lived Died Vaccinated Not Vacc’d 120 180 276 24 300 4. Our test statistic is 5. I will use an =.05 level of significance, and reject the null hypothesis for a p-value less than.05.
31
ChiSq Tests: 31 6.Computations: e i = (row total)(col. total)/(grand total) e 1 = 120(276)/300 = 110.4 e 2 = 120(24)/300 = 9.6 e 3 = 180(276)/300 = 165.6 e 4 = 180(24)/300 = 14.4 1146 16218 Lived Died Vaccinated Not Vacc’d 120 180 276 24 300 Or can compute by subtraction once first is found
32
ChiSq Tests: 32 114 110.4 6 9.6 162 165.6 18 14.4 Lived Died Vaccinated Not Vacc’d 120 180 276 24 300 o1e1o1e1 Achieved significance (p-value): p = Pr( 2 >2.45) = 1 – Pr ( 2 2.45) = 1 –.883 =.117
33
ChiSq Tests: 33 7.Our achieved significance is greater than =.05. We therefore fail to reject the null hypothesis. 8.I conclude that the observed mortality rate among vaccinated animals of 6/120 =.05 = 5% is not significantly different from the mortality rate of 18/180 =.10 = 10% observed among unvaccinated animals. The vaccine does not appear to significantly improve survival rates.
34
ChiSq Tests: 34 Computer Analysis To enter summary or table data in Minitab, Enter only the table cells, and DO NOT enter the column or row totals:
35
ChiSq Tests: 35 Select Stat Tables Chi-square test:
36
ChiSq Tests: 36 Chi-Square Test Expected counts are printed below observed counts lived died Total 1 114 6 120 110.40 9.60 2 162 18 180 165.60 14.40 Total 276 24 300 Chi-Sq = 0.117 + 1.350 + 0.078 + 0.900 = 2.446 DF = 1, P-Value = 0.118
37
ChiSq Tests: 37 Chi-square Tests of Association or Homogeneity Not restricted to 2-by-2 tables Apply to r-by-c tables r = # of rows c = # of columns 1 2 … c 12…r12…r The chi-square test is applicable for evaluating the association between two categorical variables, each with any number of categories.
38
ChiSq Tests: 38 Q.How do we determine the appropriate number of degrees of freedom? A.Note that in a 2x2 table, when computing expected frequencies: once we fill in any one cell the other three are known by subtraction since the marginals (row and column totals) are fixed. x we have "freedom" to fill in only one of the cells 1 degree of freedom n3n3 n4n4 n1n1 n2n2 n n3n3 -x n2n2 -n3n3 -x () n1n1 -x
39
ChiSq Tests: 39 In general, then: = 2 d.f. = 3 d.f. = 4 d.f. xx xxx xxxx xxxx # df = (# rows – 1)(# cols – 1) = (r – 1) (c – 1) 2 2 : (1)(1)=1 d.f. 2 4 : (1)(3)=3 d.f. 3 3 : (2)(2)=4 d.f. 2 5 : (1)(4)=4 d.f. and so on
40
ChiSq Tests: 40 EXAMPLE 2: In the low birth weight study, let us study the relationship between RACE and LOW birth weight. 1.Research question: Is Low birth weight associated with maternal race? 2.Assumptions: We have a simple random sample from the population of interest n is large enough for approximate normality of sample proportions to hold
41
ChiSq Tests: 41 3.Hypotheses: H o : There is no association between race and low birth weight (low birth weight and race are independent) Ha: Race and low birth weight are associated (not independent). 4.Test Statistic: 5.Decision Rule: Reject H o for achieved significance less than =.05.
42
ChiSq Tests: 42 6.Computations: The following steps in Minitab will Produce a summary table compute the test statistic and p-value. Note that with a data set with individual values for each subject the steps are a little different than the example for table or summarized data:
43
ChiSq Tests: 43 Stat Tables Cross Tabulation … 1 st named variable defines table rows; 2 nd columns Check chi-square analysis
44
ChiSq Tests: 44 Tabulated Statistics Rows: RACE Columns: LOW 0 1 All 1 73 23 96 66.03 29.97 96.00 2 15 11 26 17.88 8.12 26.00 3 42 25 67 46.08 20.92 67.00 All 130 59 189 130.00 59.00 189.00 Chi-Square = 5.005, DF = 2, P-Value = 0.082 Cell Contents – Count Exp Freq
45
ChiSq Tests: 45 The resulting 2 value is 5.005, df=2 and the achieved significance is p=.082. 7.Statistical decision: Since p>.05, I fail to reject the null hypothesis. 8.Conclusion: The proportion of mothers with low birth weight babies does not differ significantly by mother’s race.
46
ChiSq Tests: 46 Comments: Report the achieved significance – if our test were at the =.10 level, our conclusion would be different. When I see a relatively low p-value (.082) in an example like this, – it may be worth exploring a 2-way comparison combining groups or dropping a group from the analysis – for example, a secondary research question: is there a difference in rate of low birth weight between white and black women?
47
ChiSq Tests: 47 Comments: This could be evaluated with the LOW birth weight data by: Sub-setting the worksheet (under MANIP menu) to exclude RACE=3 (other) Repeating chi-square test using this subset
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.