If we can reduce our desire, then all worries that bother us will disappear.
Statistical Package Usage Topic: Basic Categorical Data Analysis By Prof Kelly Fan, Cal. State Univ., East Bay
Outline Only categorical variables are discussed here. Verify the hypothesized distribution One-sample Chi-square test Test the independence between two categorical variables Chi-square test for two-way contingency table McNemar’s test for paired data Measure the dependence (Phil and Kappa Coefficients) Odds ratios and relative risk Test the trend of a binary response Chi-square test for trend Meta-analysis
Example: Hair Color Distribution Fair Red Medium Dark Black Frequency 76 19 83 65 3 % 30.89 7.72 33.74 26.42 1.22 Test % 30 12 25 From a random sample of 246 children
One-sample Chi-Square Test Must be a random sample The sample size must be large enough so that expected frequencies are greater than or equal to 5 for 80% or more of the categories
One-sample Chi-Square Test Test statistic: Oi = the observed frequency of i-th category ei = the expected frequency of i-th category
Chi-Square Test for Specified Proportions SAS Output Chi-Square Test for Specified Proportions Chi-Square 7.7602 DF 4 Pr > ChiSq 0.1008
Two-way Contingency Tables Report frequencies on two variables Such tables are also called crosstabs.
Contingency Tables (Crosstabs) 1991 General Social Survey Frequency Party Identification Democrat Independent Republican Race White 341 105 405 Black 103 15 11
Crosstabs Analysis (SAS: p.88-90; SPSS: p.369-371) Chi-square test for testing the independence between two variables: For a fixed column, the distribution of frequencies over rows keeps the same regardless of the column For a fixed row, the distribution of frequencies over columns keeps the same regardless of the row
Crosstabs Analysis The phi coefficient measures the association between two categorical variables -1 < phi < 1 | phi | indicates the strength of the association If the two variables are both ordinal, then the sign of phi indicate the direction of association
SAS Output Statistic DF Value Prob Chi-Square 2 79.4310 <.0001 Likelihood Ratio Chi-Square 2 90.3311 <.0001 Mantel-Haenszel Chi-Square 1 79.3336 <.0001 Phi Coefficient 0.2847 Contingency Coefficient 0.2738 Cramer's V 0.2847 Sample Size = 980
Fisher’s Exact Test for Independence The Chi-squared tests are for large samples The sample size must be large enough so that expected frequencies are greater than or equal to 5 for 80% or more of the categories
SAS Output Fisher's Exact Test Table Probability (P) 3.823E-22 Pr <= P 2.787E-20 Sample Size = 980
Matched-pair Data Comparing categorical responses for two “paired” samples When either Each sample has the same subjects (or say subjects are measured twice) Or A natural pairing exists between each subject in one sample and a subject form the other sample (eg. Twins)
Example: Rating for Prime Minister Second Survey First Survey Approve Disapprove 794 150 86 570
Marginal Homogeneity The probabilities of “success” for both samples are identical Eg. The probability of approve at the first and 2nd surveys are identical
McNemar Test (for 2x2 Tables only) See SAS textbook Section 3.L Ho: marginal homogeneity Ha: no marginal homogeneity Exact p-value Approximate p-value (When n12+n21>10)
SAS Output McNemar's Test Statistic (S) 17.3559 DF 1 Asymptotic Pr > S <.0001 Exact Pr >= S 3.716E-05 Simple Kappa Coefficient Kappa 0.6996 ASE 0.0180 95% Lower Conf Limit 0.6644 95% Upper Conf Limit 0.7348 Sample Size = 1600 Level of agreement
Comparing Proportions in 2x2 Tables Difference of proportions: pi1-pi2 Relative risk: pi1/pi2 Odds Ratio: odds1/odds2 odds1=pi1/(1-pi1) odds2=pi2/(1-pi2)
Example: Aspirin vs. Heart Attack Prospective sampling; Row totals were fixed Frequency Heart attack No Heart attack Placebo 189 10845 Aspirin 104 10933
Chi-square Test for Trend Situation: A binary response (success, failure) + an ordinal explanatory variable Question: Is there a trend? Are the proportions (of success) in each of the levels of the explanatory variable increasing or decreasing in a linear fashion?
Example: Shoulder Harness Usage Use? Large Cars Medium Cars Small Cars No 226 165 175 Yes 83 70 71 Question: Is the proportion of shoulder harness usage increasing or decreasing linearly as the car size gets larger?
SAS Output Statistics for Table of response by car_size Statistic DF Value Prob Chi-Square 2 0.6080 0.7379 Likelihood Ratio Chi-Square 2 0.6092 0.7374 Mantel-Haenszel Chi-Square 1 0.3073 0.5793 Phi Coefficient 0.0277 Contingency Coefficient 0.0277 Cramer's V 0.0277
Meta Analysis Also known as Mantel-Haenszel test; stratified analysis Situation: When another variable (strata) may “pollute” the effect of a categorical explanatory variable on a categorical response Goal: Study the effect of the explanatory while controlling the stratification variable
Example: Respiratory Improvement Center Treatment Yes No Total 1 Test 29 16 45 Placebo 14 31 43 47 90 2 37 9 24 21 61
SAS Output Statistics for Table 1 of trtmnt by response Controlling for center=1 Statistic DF Value Prob Chi-Square 1 10.0198 0.0015 Likelihood Ratio Chi-Square 1 10.2162 0.0014 Continuity Adj. Chi-Square 1 8.7284 0.0031 Mantel-Haenszel Chi-Square 1 9.9085 0.0016 Phi Coefficient 0.3337 Contingency Coefficient 0.3165 Cramer's V 0.3337 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits Case-Control (Odds Ratio) 4.0134 1.6680 9.6564 Cohort (Col1 Risk) 2.0714 1.2742 3.3675 Cohort (Col2 Risk) 0.5161 0.3325 0.8011 Sample Size = 90
SAS Output Summary Statistics for trtmnt by response Controlling for center Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 Nonzero Correlation 1 18.4106 <.0001 2 Row Mean Scores Differ 1 18.4106 <.0001 3 General Association 1 18.4106 <.0001
SAS Output Estimates of the Common Relative Risk (Row1/Row2) Type of Study Method Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control Mantel-Haenszel 4.0288 2.1057 7.7084 (Odds Ratio) Logit 4.0286 2.1057 7.7072 Cohort Mantel-Haenszel 1.7368 1.3301 2.2680 (Col1 Risk) Logit 1.6760 1.2943 2.1703 Cohort Mantel-Haenszel 0.4615 0.3162 0.6737 (Col2 Risk) Logit 0.4738 0.3264 0.6877 Breslow-Day Test for Homogeneity of the Odds Ratios ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 0.0002 DF 1 Pr > ChiSq 0.9900 Total Sample Size = 180