Chi-Square (Association between categorical variables) Benjamin Kamala kamala8086@gmail.com
Different Scales, Different Measures of Association Scale of Both Variables Measures of Association Nominal Scale Pearson Chi-Square: χ2 Ordinal Scale Spearman’s rho Interval or Ratio Scale Pearson r
Chi-Square (χ2) and Frequency Data Today the data that we analyze consists of frequencies; that is, the number of individuals falling into categories. In other words, the variables are measured on a nominal scale. The test statistic for frequency data is Pearson Chi-Square. The magnitude of Pearson Chi-Square reflects the amount of discrepancy between observed frequencies and expected frequencies.
Steps in Test of Hypothesis Determine the appropriate test Establish the level of significance:α Formulate the statistical hypothesis Calculate the test statistic Determine the degree of freedom Compare computed test statistic against a tabled/critical value
1. Determine Appropriate Test Chi Square is used when both variables are measured on a nominal scale. It can be applied to interval or ratio data that have been categorized into a small number of groups. It assumes that the observations are randomly sampled from the population. All observations are independent (an individual can appear only once in a table and there are no overlapping categories). It does not make any assumptions about the shape of the distribution nor about the homogeneity of variances.
2. Establish Level of Significance α is a predetermined value The convention α = .05 α = .01 α = .001
3. Determine The Hypothesis: Whether There is an Association or Not Ho : The two variables are independent Ha : The two variables are associated
When using chi-square in , there is some vocabulary we must know: Hypothesis = a proposed explanation of an observed phenomenon Observed results = what you can observe during the course of an experiment Expected results = what you expect to see based on your hypothesis (predictions)
4. Calculating Test Statistics Contrasts observed frequencies in each cell of a contingency table with expected frequencies. The expected frequencies represent the number of cases that would be found in each cell if the null hypothesis were true ( i.e. the nominal variables are unrelated). Expected frequency of two unrelated events is product of the row and column frequency divided by number of cases. Fe= Fr Fc / N Mean difference between pairs of values
The formula includes: Our final formula: X2 = chi-square (o - e) = observed minus expected [sometimes you may see this represented with a d which means the difference between the expected and observed results] e = expected results o = observed results and = sum of Our final formula:
4. Calculating Test Statistics Continued 4. Calculating Test Statistics Mean difference between pairs of values
4. Calculating Test Statistics Continued 4. Calculating Test Statistics Observed frequencies Expected frequency Mean difference between pairs of values Expected frequency
5. Determine Degrees of Freedom df = (R-1)(C-1) Number of levels in column variable Number of levels in row variable
6. Compare computed test statistic against a tabled/critical value The computed value of the Pearson chi- square statistic is compared with the critical value to determine if the computed value is improbable The critical tabled values are based on sampling distributions of the Pearson chi-square statistic If calculated 2 is greater than 2 table value, reject Ho
Example Suppose a researcher is interested in voting preferences on gun control issues. A questionnaire was developed and sent to a random sample of 90 voters. The researcher also collects information about the political party membership of the sample of 90 respondents.
Bivariate Frequency Table or Contingency Table Favor Neutral Oppose f row D 10 30 50 R 15 40 f column 25 n = 90
Bivariate Frequency Table or Contingency Table Favor Neutral Oppose f row D 10 30 50 R 15 40 f column 25 n = 90 Observed frequencies
Bivariate Frequency Table or Contingency Table Row frequency Bivariate Frequency Table or Contingency Table Favor Neutral Oppose f row D 10 30 50 R 15 40 f column 25 n = 90
Bivariate Frequency Table or Contingency Table Favor Neutral Oppose f row D 10 30 50 R 15 40 f column 25 n = 90 Column frequency
1. Determine Appropriate Test Party Membership ( 2 levels) and Nominal Voting Preference ( 3 levels) and Nominal
2. Establish Level of Significance Alpha of .05
3. Determine The Hypothesis Ho : There is no difference between D & R in their opinion on gun control issue. Ha : There is an association between responses to the gun control survey and the party membership in the population.
4. Calculating Test Statistics Favor Neutral Oppose f row D fo =10 fe =13.9 fo =30 fe=22.2 50 R fo =15 fe =11.1 fe =17.8 40 f column 25 n = 90
4. Calculating Test Statistics Continued 4. Calculating Test Statistics Favor Neutral Oppose f row Democrat fo =10 fe =13.9 fo =30 fe=22.2 50 Republican fo =15 fe =11.1 fe =17.8 40 f column 25 n = 90 = 50*25/90
4. Calculating Test Statistics Continued 4. Calculating Test Statistics Favor Neutral Oppose f row Democrat fo =10 fe =13.9 fo =30 fe=22.2 50 Republican fo =15 fe =11.1 fe =17.8 40 f column 25 n = 90 = 40* 25/90
4. Calculating Test Statistics Continued 4. Calculating Test Statistics = 11.03
5. Determine Degrees of Freedom df = (R-1)(C-1) = (2-1)(3-1) = 2
6. Compare computed test statistic against a tabled/critical value α = 0.05 df = 2 Critical tabled value = 5.991 Test statistic, 11.03, exceeds critical value Null hypothesis is rejected Democrats & Republicans differ significantly in their opinions on gun control issues
SPSS Output for Gun Control Example
Example You want to look at the association between a certain disease and sex of the person. You have collected data as shown in the table below. Complete the table Calculate the expected values Calculate the chi square Calculate the degrees of freedom Comment on the association at 95% level of confidence. Sick Healthy Total Men 50 150 Women 60 940
The results of the clinical trial in which the proportions of patients dying who received either treatment A or B were compared, can be presented in a 2 x 2 table as follows: Treatment Outcome Total Died Survived A 41 216 257 B 64 180 244 105 396 501
Observed (O) Expected (E) O-E (O-E)2/E 41 53.86 - 12.86 3.07 216 203.14 12.86 0.81 64 51.14 3.23 180 192.86 0.86 501 501.00 0.00 7.97
Oral Hygiene among 10-year-olds, by type of school The following data show a sample of 10-year-old children classified according to the state of oral hygiene and type of school attended. Oral Hygiene among 10-year-olds, by type of school Oral Hygiene Total Type of School Good Fair+ Fair- Poor Below average 62 103 57 11 233 Average 50 36 26 7 119 Above average 80 69 18 2 169 192 208 101 20 521
Please calculate the 2 value. Question In the study of the factors affecting the utilization of antenatal clinics found that 64% of the women lived within 10 km of the clinic came for antenatal care, compared to only 47% of those who lived more than 10 km away. This suggests that antenatal care is used more often by women who live close to the clinics. The complete results are presented below: Utilization of Antenatal Clinic by Women Living Far From and Near the Clinics From the table we determine that there seems to be a difference in utilization of antenatal care between those who live close to and those who live far from the clinic. We want to know whether this observable difference is statistically significant. Please calculate the 2 value. Distance from ANC Used ANC Did not use ANC Total Less than 10 km 51 (64%) 29 (36%) 80 (100%) 10 km or more 35 (47%) 40 (53%) 75 (100%) 86 69 155
Additional Information in SPSS Output Exceptions that might distort χ2 Assumptions Associations in some but not all categories Low expected frequency per cell Extent of association is not same as statistical significance Demonstrated through an example
Another Example Heparin Lock Placement Time: 1 = 72 hrs 2 = 96 hrs from Polit Text: Table 8-1
Hypotheses in Heparin Lock Placement Continued Hypotheses in Heparin Lock Placement Ho: There is no association between complication incidence and length of heparin lock placement. (The variables are independent). Ha: There is an association between complication incidence and length of heparin lock placement. (The variables are related).
Continued More of SPSS Output
Pearson Chi-Square Pearson Chi-Square = .250, p = .617 Since the p > .05, we fail to reject the null hypothesis that the complication rate is unrelated to heparin lock placement time. Continuity correction is used in situations in which the expected frequency for any cell in a 2 by 2 table is less than 10.
Continued More SPSS Output
Phi Coefficient Pearson Chi-Square provides information about the existence of relationship between 2 nominal variables, but not about the magnitude of the relationship Phi coefficient is the measure of the strength of the association
Cramer’s V When the table is larger than 2 by 2, a different index must be used to measure the strength of the relationship between the variables. One such index is Cramer’s V. If Cramer’s V is large, it means that there is a tendency for particular categories of the first variable to be associated with particular categories of the second variable.
Smallest of number of rows or columns Cramer’s V When the table is larger than 2 by 2, a different index must be used to measure the strength of the relationship between the variables. One such index is Cramer’s V. If Cramer’s V is large, it means that there is a tendency for particular categories of the first variable to be associated with particular categories of the second variable. Number of cases Smallest of number of rows or columns
How to Test Association between Frequency of Two Nominal Variables Take Home Lesson How to Test Association between Frequency of Two Nominal Variables