Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1.

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1

Overview Data Types Contingency Tables Logit Models ◦ Binomial ◦ Ordinal ◦ Nominal 2

Things not covered (but still fit into the topic) Matched pairs/repeated measures ◦ McNemar’s Chi-Square Reliability ◦ Cohen’s Kappa ◦ ROC Poisson (Count) models Categorical SEM ◦ Tetrachoric Correlation Bernoulli Trials 3

Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Continuous/ Quantitative Nominal/Multinomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Rank Order/Ordinal: Properties: Values semi-arbitrary (no magnitude?) Have direction (ordering) Example: Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree Measures: Mode, relative frequency, median Mean? Binary/Dichotomous/ Binomial: Properties: 2 Levels Special case of Ordinal or Multinomial Examples: Gender (Multinomial) Disease (Y/N) Measures: Mode, relative frequency, Mean? 4

Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female 256863 Male 240223201 5 Code 1.1

Contingency Tables: Test of Independence Chi-Square Test of Independence (χ 2 ) ◦ Calculate χ 2 ◦ Determine DF: (I-1) * (J-1) ◦ Compare to χ 2 critical value for given DF. 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female 256863 Male 240223201 C1=265C2=331C3=264 R1=156 R2=664 N=820 Where: O i = Observed Freq E i = Expected Freq n = number of cells in table 6

Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ 2 ) ◦ H 0 : No Association ◦ H A : Association….where, how? Not appropriate when Expected ( E i ) cell size freq < 5 ◦ Use Fisher’s Exact Chi-Square 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female 256863 Male 240223201 C1=265C2=331C3=264 R1=156 R2=664 N=820 7 Code 1.2

Contingency Tables 2x2 ab cd a+ba+b c+dc+d b+db+da+ca+c a+b+c+d Disorder (Outcome) Risk Factor/ Exposure YesNo Yes No 8

Contingency Tables: Measures of Association a= 25 b= 10 c= 20 d= 45 35 65 5545 100 Depression Alcohol Use YesNo Yes No Probability : Odds: Contrasting Probability: Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9

Why Odds Ratios? a= 25 b= 10*i c= 20 d= 45*i (25 + 10*i) 55*i45 Depression Alcohol Use YesNo Yes No (20 + 45*i) (45 + 55*i) i=1 to 45 10

The Generalized Linear Model General Linear Model (LM) ◦ Continuous Outcomes (DV) ◦ Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) ◦ John Nelder and Robert Wedderburn ◦ Maximum Likelihood Estimation ◦ Continuous, Categorical, and Count outcomes. ◦ Distribution Family and Link Functions  Error distributions that are not normal 11

Logistic Regression “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2 nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): ◦ Independence ◦ NOT Homoscedasticity or Normal Errors ◦ Linearity (in the Log Odds) ◦ Also….adequate cell sizes. 12

Logistic Regression 13

Logistic Regression: Example The Output as Logits ◦ Logits: H 0 : β=0 Y=DepressedCoefSEZPCI α (_constant)-1.510.091-16.7<0.001-1.69, -1.34 Freq.Percent Not Depressed 67281.95 Depressed14818.05 14 Code 2.1

Logistic Regression: Example Y=DepressedORSEZPCI α (_constant)0.2200.020-16.7<0.0010.184, 0.263 Freq.Percent Not Depressed 67281.95 Depressed14818.05 15 Code 2.2

Logistic Regression: Example Y=DepressedCoefSEZPCI α (_constant)-2.240.489-4.58<0.001-3.20, -1.28 β (age)0.0130.0091.520.127-0.004, 0.030 AS LOGITS: Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16 Code 2.3

Logistic Regression: GOF 17 Overall Model Likelihood-Ratio Chi-Square Omnibus test for the model Overall model fit? Relative to other models Compares specified model with Null model (no predictors) Χ 2 =-2*(LL 0 -LL 1 ), DF=K parameters estimated

Logistic Regression: GOF (Summary Measures) Pseudo-R 2 Pseudo-R 2 ◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden) ◦ Only comparable within nested models of the same outcome. Hosmer-Lemeshow ◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X 2 ◦ H 0 : Good Fit for Data, so we want p>0.05 ◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 ◦ Conservative (rarely rejects the null) Pearson Chi-Square ◦ Models with categorical predictors ◦ Similar to Hosmer-Lemeshow ROC-Area Under the Curve ◦ Predictive accuracy/Classification 18 Code 2.4

Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) ◦ Pearson Residuals  Square root of the contribution to the Pearson χ 2 ◦ Deviance Residuals  Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) ◦ Leverage (Hat Matrix/Projection Matrix)  Maps the influence of observed on fitted values Influential Observations ◦ Pregibon’s Delta-Beta influence statistic ◦ Similar to Cook’s-D in linear regression Detecting Problems ◦ Residuals vs Predictors ◦ Leverage Vs Residuals ◦ Boxplot of Delta-Beta 19 Code 2.5

Logistic Regression: GOF Y=DepressedCoefSEZPCI α (_constant)-2.240.489-4.58<0.001-3.20, -1.28 β (age)0.0130.0091.520.127-0.004, 0.030 H-L GOF: Number of Groups: 10 H-L Chi 2 :7.12 DF:8 P:0.5233 McFadden’s R 2 : 0.0030 20 L-R χ 2 (df=1): 2.47, p=0.1162

Logistic Regression: Diagnostics Linearity in the Log-Odds ◦ Use a lowess (loess) plot ◦ Depressed vs Age 21 Code 2.6

Logistic Regression: Example Y=DepressedORSEZPCI α (_constant)0.5450.091-3.63<0.0010.392, 0.756 β (male)0.2990.060-5.99<0.0010.202, 0.444 AS OR: Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34. 22 Code 2.7

Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! ◦ Proportional Odds  BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)  The predictors effect on the outcome is the same across levels of the outcome.  Bmi3grp (1 vs 2,3) = B(age)  Bmi3grp (1,2 vs 3) = B(age) 23

Ordinal Logistic Regression 24

Ordinal Logistic Regression Example Y=bmi3grpCoefSEZPCI β1 (age)-0.0260.006-4.15<0.001-0.381, -0.014 β2 (blood_press)0.0120.0052.480.0130.002, 0.021 Threshold1/cut1-0.6960.6678-2.004, 0.613 Threshold2/cut20.7730.6680-0.536, 2.082 AS LOGITS: Y=bmi3grpORSEZPCI β1 (age)0.9740.006-4.15<0.0010.962, 0.986 β2 (blood_press)1.0120.0052.480.0131.002, 1.022 Threshold1/cut1-0.6960.6678-2.004, 0.613 Threshold2/cut20.7730.6680-0.536, 2.082 AS OR: For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater. 25 Code 3.1

Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions ◦ Brant Test of Parallel Regression  H 0 : Proportional Odds, thus want p >0.05  Tests each predictor separately and overall ◦ Score Test of Parallel Regression  H 0 : Proportional Odds, thus want p >0.05 ◦ Approx Likelihood-ratio test  H 0 : Proportional Odds, thus want p >0.05 26 Code 3.2

Ordinal Logistic Regression: GOF Pseudo R 2 Diagnostics Measures ◦ Performed on the j-1 binomial logistic regressions 27 Code 3.3

Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses ◦ Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28

Multinomial Logistic Regression 29

Multinomial Logistic Regression Example Y=religion (ref=Catholic(1)) ORSEZPCI Protestant (2) β (supernatural)1.1260.0901.470.1410.961, 1.317 α (_constant)1.2190.0972.490.0131.043, 1.425 Evangelical (3) β (supernatural)1.2180.1172.060.0391.010, 1.469 α (_constant)0.6190.059-5.02<0.0010.512, 0.746 Does degree of supernatural belief indicate a religious preference? AS OR: For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30 Code 4.1

Multinomial Logistic Regression GOF Limited GOF tests. ◦ Look at LR Chi-square and compare nested models. ◦ “Essentially, all models are wrong, but some are useful” –George E.P. Box Pseudo R 2 Similar to Ordinal ◦ Perform tests on the j-1 binomial logistic regressions 31

Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/ 32

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1.

Similar presentations

Presentation on theme: "Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1.

Similar presentations

Presentation on theme: "Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1."— Presentation transcript:

Similar presentations

About project

Feedback