Association between two categorical variables

Slides:

Advertisements

Similar presentations

CHI-SQUARE(X2) DISTRIBUTION

Advertisements

Basic Statistics The Chi Square Test of Independence.

Hypothesis Testing IV Chi Square.

Chapter 12 Chi-Square Tests and Nonparametric Tests

Chapter 12 Chi-Square Tests and Nonparametric Tests

Chi-square Test of Independence

Presentation 12 Chi-Square test.

Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.

Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 12-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.

Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.

ANOVA (Analysis of Variance) by Aziza Munir

Biostatistics course Part 12 Association between two categorical variables Dr. Sc. Nicolas Padilla Raygoza Department of Nursing and Obstetrics Division.

Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.

BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.

Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics: A First Course Fifth Edition.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests and Nonparametric Tests Statistics for.

Chapter Eight: Using Statistics to Answer Questions.

Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.

Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.

BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.

Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.

Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &

Chi Square Test Dr. Asif Rehman.

Comparing Counts Chi Square Tests Independence.

Hypothesis Testing: One-Sample Inference

Basic Statistics The Chi Square Test of Independence.

CHI-SQUARE(X2) DISTRIBUTION

Test of independence: Contingency Table

Chapter 12 Chi-Square Tests and Nonparametric Tests

CHAPTER 26 Comparing Counts.

Chapter 9: Non-parametric Tests

Presentation 12 Chi-Square test.

Lecture8 Test forcomparison of proportion

Chapter 11 Chi-Square Tests.

Hypothesis Testing Review

Chapter 12 Tests with Qualitative Data

CHAPTER 11 Inference for Distributions of Categorical Data

Essential Statistics Two Categorical Variables: The Chi-Square Test

Chapter 25 Comparing Counts.

Qualitative data – tests of association

Data Analysis for Two-Way Tables

Chapter 11 Goodness-of-Fit and Contingency Tables

Chi Square Two-way Tables

Chapter 11: Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

Chapter 10 Analyzing the Association Between Categorical Variables

Contingency Tables (cross tabs)

Contingency Tables: Independence and Homogeneity

Chi Square (2) Dr. Richard Jackson

Chapter 11 Chi-Square Tests.

CHAPTER 11 Inference for Distributions of Categorical Data

Analyzing the Association Between Categorical Variables

Chapter 26 Comparing Counts.

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.

Inference for Two Way Tables

UNIT V CHISQUARE DISTRIBUTION

Chapter Nine: Using Statistics to Answer Questions

CHAPTER 11 Inference for Distributions of Categorical Data

S.M.JOSHI COLLEGE, HADAPSAR

CHAPTER 11 Inference for Distributions of Categorical Data

Chapter 26 Comparing Counts.

CHAPTER 11 Inference for Distributions of Categorical Data

Chapter 11 Chi-Square Tests.

CHAPTER 11 Inference for Distributions of Categorical Data

Analysis of two-way tables

What is Chi-Square and its used in Hypothesis? Kinza malik 1.

Presentation transcript:

Association between two categorical variables Dr. Ahmed Samir Al-Naaimi MBChB, MSc epid, PhD Assistant Professor / Department of Community Medicine Baghdad College of Medicine

Learning objectives The student will analyze the relationship between two categorical variables with two or more categories. Apply the Chi-square test for hypothesis testing. Understand the concept of observed and expected frequencies. Explore the link between Chi-square test and multiplication rule used for joint probability under the assumption of independence between 2 classification criteria.

Learning objectives Relate the magnitude of difference between observed and expected frequency to resulting test statistic and the conclusion of statistical significance. Evaluate the link between Z test and Chi-square test in the special condition of 2 x 2 contingency table. List the conditions for a valid Chi-square test.

Introduction The table shows the distribution of individuals according to 3 categories of Socioeconomic Index Level (SEIL). SEIL N % Low 50 25 Average 110 55 High 40 20 Total 200 100 In descriptive epidemiology, we learned how to tabulate a frequency distribution for a categorical variable. This table shows how individuals are distributed in each category of a variable. For example, in a rural community, a random sample of 200 people were distributed according to the level of socioeconomic status.

Introduction In the same sample the location of residence was also classified into 3 sectors: south, center and north. N % South 44 22 Center 96 48 North 60 30 Total 200 100 In the same sample the location of residence was classified into 3 sectors: south, center and north.

Introduction When we examine the relationship between two categorical variables, tabulated one against other. This is a two way table or cross-tabulation. Location SEIL South Center North Total Low 33 7 10 50 Average 9 81 20 110 High 2 8 30 40 44 96 60 200 Table show the residence area with a second variable, socioeconomic index level (SEIL).

Interpretation of a two by two table There is an association between two categorical variables, if the distribution of a variable varies according to the value of the other. The question we are interested in “Does the Socio- economic Index level (SEIL) varies by place of residence? To answer this question we need to assess a cross- tabulation and calculate relative frequencies (percentages).

Interpretation of a two by two table To answer the question of interest, what should we consider the relative frequencies of column or row totals? Place of residence SEIL South n % Center North Low 33 75 7 7.3 10 16.7 Average 9 20.5 81 84.4 20 33.3 High 2 4.5 8 8.3 30 50 Total 44 100.0 96 100.0 60 100.0

Interpretation of a two by two table If the distribution of SEIL is the same in each place of residence, the percentage of columns would be the same for each place of residence. It appears that the percentage of low SEIL differ between sites of residency, but the data are subject to sampling errors, so we need to assess whether these differences in the proportions of the sample reflect differences in populations. To do this, we need a hypothesis test.

Expected frequencies If the null hypothesis is true, there is no association between SEIL and area of residence, the percentages for each level of SEIL in each area, should be the same as the column of percentages in the total column. Or one can state the hypothesis as “the 2 methods of classification for people: SEIL and place of residence are independent” For example: The percentage of people in low SEIL in the total sample is 25%. If the null hypothesis is true, we should expect that 25% of people in any place of residence are low SEIL , so the frequency of people in Center sector of residence with low SEIL, is 0.25 x 96 = 24.

Interpretation of a two ways table Place of residence SEIL South n % Center n % North Total Low 33 75 7 7.3 10 16.7 50 25 Regular 9 20.5 81 84.4 20 33.3 110 55 High 2 4.5 8 8.3 30 50 40 20 44 100.0 96 100.0 60 100.0 200 100.0 Also, we should expect than 25% of people in the South have low SEIL. so the frequency (count) of people in South sector of residence with low SEIL is 0.25 x 44 = 11.

Expected frequencies If there are no differences in the distribution of SEIL by places of residence, we should expect that the relative frequency of people with low SEIL is the same in each place of residence. Note that the expected frequencies do not have to be integers. Using the totals of columns and rows, we can calculate the expected frequency (count) in each cell. E = (row total x column total) / grand total Expected frequency = row total x column total /grand total

Expected frequencies Under the null hypothesis of independence for 2 events, the joint probability is equal to the product of the probability of each event. P (Low SEIL) = 50/200 P (South) = 44/200 P (Low SEIL and South) = 50/200 x 44/200 The frequency expected in (Low SEIL and South) is equal to the P (Low SEIL and South) multiplied by total sample size of 200. Expected frequency (E) = 50/200 x 44/200 x 200 E = (row total x column total) / grand total Location SEIL South Center North Total Low 33 50 Average 110 High 40 44 96 60 200 Expected frequency = row total x column total /grand total

Chi-square test Expected frequencies are those that we should expect if the null hypothesis were true. To test the null hypothesis, we must compare the expected frequencies with observed frequencies, using the following formula. Where O = observed frequency, E = expected frequency, Ʃ = sum of all cells in the table. X2 statistic is the result of which is referred to tables of the X2 distribution for the value of p. It is known as the Chi square test.

Chi-Square test From the formula we can see that: If there is a large or significant difference between the observed and expected values, the calculated (test statistic) 2 will be large, while if there is a small (or statistically insignificant) difference between the observed and expected values, the resulting 2 will be small also.

Chi-Square test If the calculated (test statistic) 2 is large, then the sample data provides enough evidence to reject the null hypothesis (Ho) because the observed values are not what we expect under the null hypothesis. If the calculated (test statistic) 2 is small in magnitude, then the sample data agrees with (accepts) the null hypothesis (Ho), which states that the observed values are similar to or not significantly different from those expected under the null hypothesis of independence.

Chi-Square distribution The values of test statistic in Chi-square distribution is between zero and + ∞. No negative values are present since they are squared values. The Chi-square distribution has one tail only (positively skewed distribution). The higher the df the more flattened is the curve. Hypothesis testing is always one tailed

Chi-Square test The X2 distribution is obtained from the sum of the squares of many standard Normal variables. The number of independent variables commonly used in this sum is the “degrees of freedom”, df = (r-1) x (c-1), where r is the count of rows in the table and c is the count of columns. The tabulated X2 for 2x2 table with df=1 and alpha error = 0.05 is equal to (Z1-alpha/2)2 = (1.96)2 = 3.84. This procedure is similar to that we used in other presentations, where we referrer results of Z in Normal distribution tables or t results in t distribution table.

Chi-Square test Place of residence SEIL South O E Center O E North Total n Low 33 11 7 24 10 15 50 Regular 9 24.2 81 52.8 20 33 110 High 2 8.8 8 19.2 30 12 40 44 44 96 96 60 60 200 Expected frequency = row total x column total /grand total. Example: the expected frequency in the first cell of the table (the left upper) = (50 x 44) / 200 = 11, while the observed frequency is 33.

Chi-Square test SEIL Place of residence Observed Expected O - E (O-E)2 (O-E)2/E Low South 33 11 22 484 44 Center 9 24 - 15 225 9.38 North 2 15 - 13 169 11.27 Regular 7 24.2 -17.2 295.8 12.2 81 52.8 28.2 795.2 15.1 8 - 25 625 18.9 High 10 8.8 1.2 1.44 0.2 20 19.2 0.8 0.64 0.03 30 12 18 324 27 Total 138.1 Knowing the value of X2 and degrees of freedom, we can obtain the probability of obtaining the observed or more extreme if the null hypothesis were true. We see the tables of the distribution of X2 and the line of 4 degrees of freedom we seek the value obtained (138.1) in the columns at 4 degrees of freedom and the value of 0.0001

Steps for hypothesis testing 1. State the statistical hypothesis Ho: There is no association between SEIL and residence location HA: There is an association 2. Fill in the observed frequencies for contingency table. 3. Calculate expected frequencies. 4. Calculate the test statistic (Chi-square) 5. Calculate the degrees of freedom (df) = (r-1) x (c-1) = (3-1) x (3-1) = 2 x 2 = 4 6. Get the tabulated 2 (decision rule) for the specified df.

Steps for hypothesis testing 6. The tabulated X2 (decision rule) for df=4 is 9.5 7. Compare the test statistic (calculated X2) and decision rule. Since 138.1 is > 9.5, then reject the Ho in favor of HA. 8. Conclusion: there is a statistically significant association between SEIL and residence location.

Chi-Square test in 2 x 2 tables When both variables are binary (dichotomous), the cross-tabulation table becomes a 2 x 2. The 2 test can be applied in the same way as for a larger number of categories table. This special condition for 2 is very common in medical literature. It will give the same result as that of Z test used for the difference between 2 proportions studied earlier in the biostatistics module. Remember that the decision rule for 2 at df=1 is 3.841 which is the square value of Z at alpha 0.05 = 1.96.

Example (2 x 2 table) There was a study of the bacteriological efficacy of clarithromycin Vs penicillin, in acute pharyngo-tonsillitis in children by Streptococcus Beta Haemolytic Group A. The results are shown below Drug Cure Not cure Total Clarithromycin 91 9 100 Penicillin 82 18 173 27 200

Example (2 x 2) table Statistical hypothesis Ho: There is no association between type of treatment and cure. While in case of Z test we would say “There is no difference in bacteriological efficacy (response rate) between the two treatments, against Streptococcus Beta Hemolytic Group A. HA: There is an association between type of treatment and patient’s response to treatment. Drug Cure O E Not cure Total Clarithromycin 91 86.5 9 13.5 100 Penicillin 82 86.5 18 13.5 173 27 200

Example (2 x 2) table df = (r-1) x (c-1) = (2-1) x (2-1) = 1 x 1 = 1 Calculate expected frequencies Calculate the test statistic (2) for each cell in the table and its sum = 3.47 Get the decision rule 2 at df=1 which is 3.841 Drug Effect Observed Expected O - E (O-E)2 (O-E)2/E Clarithromycin Cure 91 86.5 4.5 20.25 0.234 Not cure 9 13.5 - 4.5 1.5 Penicillin 82 18 Total 3.47

Example (2 x 2) table Compare the test statistic (3.47) and decision rule (3.841), since the test statistic is larger, we accept the Ho. Conclusion: There is no statistically significant association between the type of treatment and the patients response to treatment Try to solve this example by Z test and compare the results obtained by both methods.

A quick formula for 2 x 2 tables 2 can be calculated without the need for expected frequencies in the special case of 2 x 2 table. Use the observed frequencies in a table and marginal totals. If we labeled the cells and marginal totals as follow: Exposure Result Yes No Total a b a + b c d c + d a + c b + d N When the sample size is small, we should disminsh the difference between the observed and expected values, in each cell of the table. This is obtained by a modification in the previous formulae: X2 =(|ad – bc|) – N/2)2 x N /(a+b) (c+d) (a+c) (b+d) Vertical bars to the sides |ad-bc |, show that we need take the absolute values from ad-bc. So, if this value is negative, we do not take into account the sign and take the positive value. N/2 is called the continuous correction. The results between the formulae with and without continuous correction are slightly different, because to the continuous correction. 2=[(ad – bc)2 x N ]/[(a+b) (c+d) (a+c) (b+d)]

Validity of Chi-Square tests Chi square tests are based on the assumption that the test statistic follows approximately the 2 distribution. This is reasonable for large samples but for the small one we should use the following guidelines: a) For 2 x 2 tables If the total sample size is> 40, then 2 can be used. If n is between 20 and 40, and the smallest expected value is > 5, 2 can be used. Otherwise, use the Fisher exact significance test. b) For r x c tables: The 2 test is valid if not more than 20% of expected values is less than 5 and none is less than 1.