Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1
Overview Data Types Contingency Tables Logit Models ◦ Binomial ◦ Ordinal ◦ Nominal 2
Things not covered (but still fit into the topic) Matched pairs/repeated measures ◦ McNemar’s Chi-Square Reliability ◦ Cohen’s Kappa ◦ ROC Poisson (Count) models Categorical SEM ◦ Tetrachoric Correlation Bernoulli Trials 3
Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Continuous/ Quantitative Nominal/Multinomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Rank Order/Ordinal: Properties: Values semi-arbitrary (no magnitude?) Have direction (ordering) Example: Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree Measures: Mode, relative frequency, median Mean? Binary/Dichotomous/ Binomial: Properties: 2 Levels Special case of Ordinal or Multinomial Examples: Gender (Multinomial) Disease (Y/N) Measures: Mode, relative frequency, Mean? 4
Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female Male Code 1.1
Contingency Tables: Test of Independence Chi-Square Test of Independence (χ 2 ) ◦ Calculate χ 2 ◦ Determine DF: (I-1) * (J-1) ◦ Compare to χ 2 critical value for given DF. 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female Male C1=265C2=331C3=264 R1=156 R2=664 N=820 Where: O i = Observed Freq E i = Expected Freq n = number of cells in table 6
Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ 2 ) ◦ H 0 : No Association ◦ H A : Association….where, how? Not appropriate when Expected ( E i ) cell size freq < 5 ◦ Use Fisher’s Exact Chi-Square 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female Male C1=265C2=331C3=264 R1=156 R2=664 N=820 7 Code 1.2
Contingency Tables 2x2 ab cd a+ba+b c+dc+d b+db+da+ca+c a+b+c+d Disorder (Outcome) Risk Factor/ Exposure YesNo Yes No 8
Contingency Tables: Measures of Association a= 25 b= 10 c= 20 d= Depression Alcohol Use YesNo Yes No Probability : Odds: Contrasting Probability: Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9
Why Odds Ratios? a= 25 b= 10*i c= 20 d= 45*i ( *i) 55*i45 Depression Alcohol Use YesNo Yes No ( *i) ( *i) i=1 to 45 10
The Generalized Linear Model General Linear Model (LM) ◦ Continuous Outcomes (DV) ◦ Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) ◦ John Nelder and Robert Wedderburn ◦ Maximum Likelihood Estimation ◦ Continuous, Categorical, and Count outcomes. ◦ Distribution Family and Link Functions Error distributions that are not normal 11
Logistic Regression “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2 nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): ◦ Independence ◦ NOT Homoscedasticity or Normal Errors ◦ Linearity (in the Log Odds) ◦ Also….adequate cell sizes. 12
Logistic Regression 13
Logistic Regression: Example The Output as Logits ◦ Logits: H 0 : β=0 Y=DepressedCoefSEZPCI α (_constant) < , Freq.Percent Not Depressed Depressed Code 2.1
Logistic Regression: Example Y=DepressedORSEZPCI α (_constant) < , Freq.Percent Not Depressed Depressed Code 2.2
Logistic Regression: Example Y=DepressedCoefSEZPCI α (_constant) < , β (age) , AS LOGITS: Interpretation: A 1 unit increase in age results in a increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = For a 1 unit increase in age, there is a increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16 Code 2.3
Logistic Regression: GOF 17 Overall Model Likelihood-Ratio Chi-Square Omnibus test for the model Overall model fit? Relative to other models Compares specified model with Null model (no predictors) Χ 2 =-2*(LL 0 -LL 1 ), DF=K parameters estimated
Logistic Regression: GOF (Summary Measures) Pseudo-R 2 Pseudo-R 2 ◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden) ◦ Only comparable within nested models of the same outcome. Hosmer-Lemeshow ◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X 2 ◦ H 0 : Good Fit for Data, so we want p>0.05 ◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 ◦ Conservative (rarely rejects the null) Pearson Chi-Square ◦ Models with categorical predictors ◦ Similar to Hosmer-Lemeshow ROC-Area Under the Curve ◦ Predictive accuracy/Classification 18 Code 2.4
Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) ◦ Pearson Residuals Square root of the contribution to the Pearson χ 2 ◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) ◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values Influential Observations ◦ Pregibon’s Delta-Beta influence statistic ◦ Similar to Cook’s-D in linear regression Detecting Problems ◦ Residuals vs Predictors ◦ Leverage Vs Residuals ◦ Boxplot of Delta-Beta 19 Code 2.5
Logistic Regression: GOF Y=DepressedCoefSEZPCI α (_constant) < , β (age) , H-L GOF: Number of Groups: 10 H-L Chi 2 :7.12 DF:8 P: McFadden’s R 2 : L-R χ 2 (df=1): 2.47, p=0.1162
Logistic Regression: Diagnostics Linearity in the Log-Odds ◦ Use a lowess (loess) plot ◦ Depressed vs Age 21 Code 2.6
Logistic Regression: Example Y=DepressedORSEZPCI α (_constant) < , β (male) < , AS OR: Interpretation: The odds of depression are times lower for males compared to females. We could also say: The odds of depression are ( =.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = Code 2.7
Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! ◦ Proportional Odds BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age) 23
Ordinal Logistic Regression 24
Ordinal Logistic Regression Example Y=bmi3grpCoefSEZPCI β1 (age) < , β2 (blood_press) , Threshold1/cut , Threshold2/cut , AS LOGITS: Y=bmi3grpORSEZPCI β1 (age) < , β2 (blood_press) , Threshold1/cut , Threshold2/cut , AS OR: For a 1 unit increase in Blood Pressure there is a increase in the log-odds of being in a higher bmi category For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are times greater. 25 Code 3.1
Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions ◦ Brant Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 Tests each predictor separately and overall ◦ Score Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 ◦ Approx Likelihood-ratio test H 0 : Proportional Odds, thus want p > Code 3.2
Ordinal Logistic Regression: GOF Pseudo R 2 Diagnostics Measures ◦ Performed on the j-1 binomial logistic regressions 27 Code 3.3
Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses ◦ Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28
Multinomial Logistic Regression 29
Multinomial Logistic Regression Example Y=religion (ref=Catholic(1)) ORSEZPCI Protestant (2) β (supernatural) , α (_constant) , Evangelical (3) β (supernatural) , α (_constant) < , Does degree of supernatural belief indicate a religious preference? AS OR: For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30 Code 4.1
Multinomial Logistic Regression GOF Limited GOF tests. ◦ Look at LR Chi-square and compare nested models. ◦ “Essentially, all models are wrong, but some are useful” –George E.P. Box Pseudo R 2 Similar to Ordinal ◦ Perform tests on the j-1 binomial logistic regressions 31
Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: 32