Analysis of Categorical Data. Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples.

Slides:



Advertisements
Similar presentations
Comparing Two Proportions (p1 vs. p2)
Advertisements

Chapter 11 Inference for Distributions of Categorical Data
Chapter 13: Inference for Distributions of Categorical Data
Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the.
1 If we live with a deep sense of gratitude, our life will be greatly embellished.
Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables Chapter 15 1.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17.
CHAPTER 11 Inference for Distributions of Categorical Data
ChiSq Tests: 1 Chi-Square Tests of Association and Homogeneity.
Two-Way Tables Two-way tables come about when we are interested in the relationship between two categorical variables. –One of the variables is the row.
Statistics 303 Chapter 9 Two-Way Tables. Relationships Between Two Categorical Variables Relationships between two categorical variables –Depending on.
Stat 512 – Lecture 13 Chi-Square Analysis (Ch. 8).
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
Cross-Tabulations.
1 Nominal Data Greg C Elvers. 2 Parametric Statistics The inferential statistics that we have discussed, such as t and ANOVA, are parametric statistics.
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Measures of Association for Contingency Tables. Measures of Association General measures of association that can be used with any variable types. Measures.
Presentation 12 Chi-Square test.
AS 737 Categorical Data Analysis For Multivariate
Analysis of Categorical Data
September 15. In Chapter 18: 18.1 Types of Samples 18.2 Naturalistic and Cohort Samples 18.3 Chi-Square Test of Association 18.4 Test for Trend 18.5 Case-Control.
CHP400: Community Health Program - lI Research Methodology. Data analysis Hypothesis testing Statistical Inference test t-test and 22 Test of Significance.
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 15 Inference for Counts:
More About Significance Tests
Chapter 11: Applications of Chi-Square. Count or Frequency Data Many problems for which the data is categorized and the results shown by way of counts.
Dr.Shaikh Shaffi Ahamed Ph.D., Dept. of Family & Community Medicine
Copyright © 2010, 2007, 2004 Pearson Education, Inc. 1.. Section 11-2 Goodness of Fit.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.
Lecture 9 Chapter 22. Tests for two-way tables. Objectives The chi-square test for two-way tables (Award: NHST Test for Independence)  Two-way tables.
Appendix A: Additional Topics A.1 Categorical Platform (Optional)
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter-8 Chi-square test. Ⅰ The mathematical properties of chi-square distribution  Types of chi-square tests  Chi-square test  Chi-square distribution.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
+ Chi Square Test Homogeneity or Independence( Association)
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
Chapter 11 Chi- Square Test for Homogeneity Target Goal: I can use a chi-square test to compare 3 or more proportions. I can use a chi-square test for.
Contingency Tables 1.Explain  2 Test of Independence 2.Measure of Association.
Copyright © 2010 Pearson Education, Inc. Slide
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
Chapter 13 Inference for Counts: Chi-Square Tests © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Lecture 9 Chapter 22. Tests for two-way tables. Objectives (PSLS Chapter 22) The chi-square test for two-way tables (Award: NHST Test for Independence)[B.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
More Contingency Tables & Paired Categorical Data Lecture 8.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
WINKS 7 Tutorial 3 Analyzing Summary Data (Using Student’s t-test) Permission granted for use for instruction and for personal use. ©
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Week 6 Dr. Jenne Meyer.  Article review  Rules of variance  Keep unaccounted variance small (you want to be able to explain why the variance occurs)
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
THE CHI-SQUARE TEST BACKGROUND AND NEED OF THE TEST Data collected in the field of medicine is often qualitative. --- For example, the presence or absence.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Introduction to Biostatistics, Harvard Extension School, Fall, 2005 © Scott Evans, Ph.D.1 Contingency Tables.
Fall 2002Biostat Inference for two-way tables General R x C tables Tests of homogeneity of a factor across groups or independence of two factors.
Chi-Squared Test of Homogeneity Are different populations the same across some characteristic?
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Comparing Observed Distributions A test comparing the distribution of counts for two or more groups on the same categorical variable is called a chi-square.
Goodness-of-Fit and Contingency Tables Chapter 11.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
Textbook Section * We already know how to compare two proportions for two populations/groups. * What if we want to compare the distributions of.
Chi-Square (Association between categorical variables)
Chapter 10 Analyzing the Association Between Categorical Variables
Analyzing the Association Between Categorical Variables
Analysis of Categorical Data
Presentation transcript:

Analysis of Categorical Data

Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples (Fisher’s Exact Test) Comparing two population proportions using dependent samples (McNemar’s Test) Relative Risk (RR), Odds Ratios (OR), Risk Difference, Attributable Risk (AR), & NNT/NNH o Data in r X c Tables Tests of Independence/Association and Homogeneity.

Cervical-Cancer and Age at First Pregnancy – 2 X 2 Data Table These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer. In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer. These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer. In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer.

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) Healthy(Control) ColumnTotals

Previously o We have compared the proportions of women with the risk factor in both groups (p 1 vs. p 2 ) using the z-test, a CI for (p1 – p2) & Fisher’s Exact Test. o Computed the Odds Ratio (OR) and found a CI for the population OR.

Development of a Test Statistic to Measure Lack of Independence One way to generalize the question of interest to the researchers is to think of it as follows: Q: Is there an association between cervical cancer status and whether or not a woman had her 1 st pregnancy at or before the age of 25?

Development of a Test Statistic to Measure Lack of Independence If there is not an association, we say that the variables are independent. In the probability notes we saw that two events A and B are said to be independent if P(A|B) = P(A).

Development of a Test Statistic to Measure Lack of Independence In the context of our study this would mean P(Age < 25|Cancer Status) = P(Age < 25) i.e. knowing something about disease status tells you nothing about the presence of the risk factor of having their first pregnancy at or before age 25.

Development of a Test Statistic to Measure Lack of Independence When we consider this percentage conditioning on disease status we see that relationship for independence does not hold for these data. P(Age < 25|Cervical Cancer) = 42/47 =.8936 P(Age < 25|Healthy Control) = 203/317 =.6404 P(Age < 25) = 245/366 =.6694 In this study 66.94% of the women sampled had their first pregnancy at or before the age of 25. Should both be equal to.6694

Development of a Test Statistic to Measure Lack of Independence o Of course the observed differences could be due to random variation and in truth it may be the case that disease and risk factor status are independent. o Therefore we need a means of assessing how different the observed results are from what we would expect to see if the these two factors were independent.

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) Healthy(Control) ColumnTotals C1C1 C2C2 R1R1 R2R2 n a b c d

Development of a Test Statistic to Measure Lack of Independence From this table we can calculate the conditional probability of having the risk factor of early pregnancy given the disease status of the subject as follows: The unconditional probability of risk presence of these data is given by: and setting these to equal we have

Development of a Test Statistic to Measure Lack of Independence Thus we expect the frequency in the a cell to be equal to: Similarly we find the following expected frequencies for the cells making up the 2 X 2 table

Development of a Test Statistic to Measure Lack of Independence In general we denote the observed frequency in the i th row and j th column as or just O for short. We denote the expected frequency for the i th row and j th column as or just E for short.

Development of a Test Statistic to Measure Lack of Independence o To measure how different our observed results are from what we expected to see if the two variables in question were independent we intuitively should look at the difference between the observed (O) and expected (E) frequencies, i.e. O – E or more specifically o However this will give too much weight to differences where these frequencies are both large in size.

Development of a Test Statistic to Measure Lack of Independence o One test statistic that addresses the “size” of the frequencies issue is Pearson’s Chi-Square    Notice this test statistic still uses (O – E) as the basic building block. This statistic will be large when the observed frequencies do NOT match the expected values for independence.

Chi-square Distribution    This is a graph of the chi-square distribution with 4 degrees of freedom. The area to the right of Pearson’s chi-square statistic give the p-value. The p-value is always the area to the right! p-value 

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) Healthy(Control) ColumnTotals C1C1 C2C2 R1R1 R2R2 n O 11 O 12 O 21 O 22

Calculating Expected Frequencies Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) Healthy(Control) ColumnTotals (32.80) (16.20) (212.20)(104.80) C1C1 C2C2 R1R1 R2R2 n

Calculating the Pearson Chi-square

Chi-square Probability Calculator in JMP Enter the test statistic value and df and the p-value is automatically calculated. p-value = P(   

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Conclusion: We have strong evidence to suggest that at age at first pregnancy and cervical cancer status are NOT independent, and that they are associated or related (p =.0027). In particular we found that the proportion of women having their first pregnancy at or before the age of 25 was higher amongst women with cervical cancer than for those without.

Other things we could do… o Odds Ratio (OR) and CI for OR - case-control study means no RR. o Fisher’s Exact Test - Pearson’s chi-square is an approximation that requires “large” sample sizes * typically we would like all E ij > 5 * or at least 80% of cells should have E ij > 5 * thus the approximation should be good here as both of these conditions are met for this study. * thus the approximation should be good here as both of these conditions are met for this study.

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease In this study a random sample of 538 patients diagnosed with some form of Hodgkin’s Disease was taken and the histological type: nodular sclerosis (NS), mixed cellularity (MC), lymphocyte predominance (LP), or lymphocyte depletion (LD) was recorded along with the outcome from standard treatment which was recorded as being none, partial, or complete remission. Q: Is there an association between type of Hodgkin’s and response to treatment? If so, what is the nature of the relationship?

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease TypeNonePartialPositiveRowTotals LD LP MC NS ColumnTotals n = 538 Some Probabilities of Potential Interest Probability of Positive Response to Treatment P(positive) = 314/538 =.5836 Probability of Positive Response to Treatment Given Disease Type P(positive|LD) = 18/72 =.2500 P(positive|LP) = 74/104 =.7115 P(positive|MC) = 154/266 =.5789 P(positive|NS) = 68/96 =.7083 Notice the conditional probabilities are not equal to the unconditional!!!

Mosaic plot of the results Response to Treatment vs. Histological Type Clearly we see that LP and NS respond most favorably to treatment with over 70% of those sampled having experiencing complete remission, whereas lymphocyte depletion has a majority (61.1%) of patients having no response to treatment. A statistical test at this point seems unnecessary as it seems clear that there is an association between the type of Hodgkin’s disease and the response to treatment, nonetheless we will proceed…

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease TypeNonePartialPositiveRowTotals LD LP MC NS ColumnTotals n = 538 (16.86) (13.11) (42.02) (24.36)(18.94)(60.69) (62.30)(48.45)(155.25) (22.48)(17.49)(56.03)

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease TypeNonePartialPositiveRowTotals LD LP MC NS ColumnTotals n = 538 (16.86) (13.11) (42.02) (24.36)(18.94)(60.69) (62.30)(48.45)(155.25) (22.48)(17.49)(56.03) We have strong evidence of an association between the type of Hodgkin’s and response to treatment (p <.0001).

Measures of Association Between Two Categorical Variables This can be applied to the cervical cancer case- control study.

Measures of Association Between Two Categorical Variables This can be used for general r x c tables. This can be used for the Hodgkin’s example:

Measures of Association Between Two Categorical Variables For the Hodgkin’s study

Measures of Association Between Two Categorical Variables There are lots of other measures of association. When both variables are nominal the previous measures are fine and there are certainly many more. For cases where both variables are ordinal common measures include Kendall’s tau and Somer’s D. In some cases we wish to measure the degree of exact agreement between two nominal or ordinal variables measured using the same levels or scales in which case we generally use Cohen’s Kappa (  ).

Measures of Association Between Two Categorical Variables Cohen’s Kappa (  ) – measures the degree of agreement between two variables on the same scales. Example 3: Medicare Study – General health at baseline and 2-yr. follow-up, how well do they agree?  excellent agreement  good agreement 0 <  marginal agreement There is a fairly good agreement between the general assessment of overall health baseline and at follow-up. However, there appears to be some general trend for improvement as well.

Testing for Lack of Symmetry o Bowker’s Test of Symmetry is a generalization of McNemar’s Test to r x r tables where there where the row and column variables are on the same scale. o The general health of the subjects in the Medicare study is an example of where this test could be used as both the health at baseline and follow-up is recorded using the same 5-point ordinal scale.

Bowker’s Test of Symmetry 12…r Row Totals 1 O 11 O 12 … O 1r 2 O 21 O 22 … O 2r …………… r O r1 O r2 … O rr Column Totals Y X The test looks for the frequencies to be generally larger on one side of the diagonal than the other.

Bowker’s Test of Symmetry When will this test statistics be “large”? If there was a general trend or tendency for X > Y or for X < Y then we would expect the off diagonal cells of the table to larger on one side than the other. For example if Y tended to be larger than X, perhaps indicating an improvement in health, then we expect the frequencies above the diagonal to be larger than those below.

Bowker’s Test of Symmetry Symmetry of Disagreement Bowker’s test suggests the differences are asymmetric (p <.0001). Examining the percentages suggests a majority of patients either stayed the same or improved in each group based on baseline score. Therefore it is reasonable to state that we have evidence that in general subjects health stayed the same or if it did change it was generally for the better (p <.0001).

Other Approaches o Wilcoxon Sign-Rank Test for the paired differences in the ordinal health score (p <.0001). o Direct examination of the distribution of the changes in general health score. Follow-up – Baseline There is a slight advantage for improvement vs. decline in health. The plot on the right shows the change in general health vs. baseline health. With the exception of those with the lowest health at baseline a majority (50%+) of patients stayed the same. The shading for improvement is larger than the shading for health decline.

Other Tests for Categorical Data o Chi-square Test for Trend in Binomial Proportions tests whether or not p 1 < p 2 < p 3 < … < p k where 1, 2, …, k are levels of an ordinal variable, i.e. 2 X k table. o Chi-square Goodness-of-Fit Tests – used test whether observations come from some hypothesized distribution. o Cochran-Mantel-Haenszel Test – Looks at whether or not there is a relationship in a 2 X 2 table situation adjusting for the level of a third factor. For example, is there a relationship between heavy drinking (Y or N) and lung cancer (Y or N) adjusting for smoking status.