Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1.

Slides:



Advertisements
Similar presentations
To Select a Descriptive Statistic
Advertisements

Contingency Table Analysis Mary Whiteside, Ph.D..
Chapter 2 Describing Contingency Tables Reported by Liu Qi.
Two-sample tests. Binary or categorical outcomes (proportions) Outcome Variable Are the observations correlated?Alternative to the chi- square test if.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Chi-square A very brief intro. Distinctions The distribution The distribution –Chi-square is a probability distribution  A special case of the gamma.
Logistic Regression.
Introduction to Categorical Data Analysis
EPI 809 / Spring 2008 Final Review EPI 809 / Spring 2008 Ch11 Regression and correlation  Linear regression Model, interpretation. Model, interpretation.
Generalised linear models
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
PSYC512: Research Methods PSYC512: Research Methods Lecture 19 Brian P. Dyre University of Idaho.
Chi Square Test Dealing with categorical dependant variable.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Logistic regression for binary response variables.
Statistics Idiots Guide! Dr. Hamda Qotba, B.Med.Sc, M.D, ABCM.
AS 737 Categorical Data Analysis For Multivariate
Analysis of Categorical Data
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
HSRP 734: Advanced Statistical Methods June 19, 2008.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Linear correlation and linear regression + summary of tests
Week 5: Logistic regression analysis Overview Questions from last week What is logistic regression analysis? The mathematical model Interpreting the β.
Forecasting Choices. Types of Variable Variable Quantitative Qualitative Continuous Discrete (counting) Ordinal Nominal.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
1 Follow the three R’s: Respect for self, Respect for others and Responsibility for all your actions.
Qualitative and Limited Dependent Variable Models ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Dates Presentations Wed / Fri Ex. 4, logistic regression, Monday Dec 7 th Final Tues. Dec 8 th, 3:30.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Regression Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
1 Week 3 Association and correlation handout & additional course notes available at Trevor Thompson.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
THE CHI-SQUARE TEST BACKGROUND AND NEED OF THE TEST Data collected in the field of medicine is often qualitative. --- For example, the presence or absence.
Nonparametric Statistics
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
Approaches to quantitative data analysis Lara Traeger, PhD Methods in Supportive Oncology Research.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
 Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Advanced Quantitative Techniques
Chapter 13 Nonlinear and Multiple Regression
Generalized Linear Models
CHOOSING A STATISTICAL TEST
Basic Statistics Overview
Qualitative data – tests of association
Generalized Linear Models (GLM) in R
Introduction to logistic regression a.k.a. Varbrul
SA3202 Statistical Methods for Social Sciences
Nonparametric Statistics
Introduction to Statistics
The Chi-Square Distribution and Test for Independence
Hypothesis testing. Chi-square test
Association, correlation and regression in biomedical research
Categorical Data Analysis
Applied Statistics Using SPSS
Applied Statistics Using SPSS
Presentation transcript:

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1

Overview Data Types Contingency Tables Logit Models ◦ Binomial ◦ Ordinal ◦ Nominal 2

Things not covered (but still fit into the topic) Matched pairs/repeated measures ◦ McNemar’s Chi-Square Reliability ◦ Cohen’s Kappa ◦ ROC Poisson (Count) models Categorical SEM ◦ Tetrachoric Correlation Bernoulli Trials 3

Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Continuous/ Quantitative Nominal/Multinomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Rank Order/Ordinal: Properties: Values semi-arbitrary (no magnitude?) Have direction (ordering) Example: Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree Measures: Mode, relative frequency, median Mean? Binary/Dichotomous/ Binomial: Properties: 2 Levels Special case of Ordinal or Multinomial Examples: Gender (Multinomial) Disease (Y/N) Measures: Mode, relative frequency, Mean? 4

Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female Male Code 1.1

Contingency Tables: Test of Independence Chi-Square Test of Independence (χ 2 ) ◦ Calculate χ 2 ◦ Determine DF: (I-1) * (J-1) ◦ Compare to χ 2 critical value for given DF. 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female Male C1=265C2=331C3=264 R1=156 R2=664 N=820 Where: O i = Observed Freq E i = Expected Freq n = number of cells in table 6

Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ 2 ) ◦ H 0 : No Association ◦ H A : Association….where, how? Not appropriate when Expected ( E i ) cell size freq < 5 ◦ Use Fisher’s Exact Chi-Square 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female Male C1=265C2=331C3=264 R1=156 R2=664 N=820 7 Code 1.2

Contingency Tables 2x2 ab cd a+ba+b c+dc+d b+db+da+ca+c a+b+c+d Disorder (Outcome) Risk Factor/ Exposure YesNo Yes No 8

Contingency Tables: Measures of Association a= 25 b= 10 c= 20 d= Depression Alcohol Use YesNo Yes No Probability : Odds: Contrasting Probability: Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9

Why Odds Ratios? a= 25 b= 10*i c= 20 d= 45*i ( *i) 55*i45 Depression Alcohol Use YesNo Yes No ( *i) ( *i) i=1 to 45 10

The Generalized Linear Model General Linear Model (LM) ◦ Continuous Outcomes (DV) ◦ Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) ◦ John Nelder and Robert Wedderburn ◦ Maximum Likelihood Estimation ◦ Continuous, Categorical, and Count outcomes. ◦ Distribution Family and Link Functions  Error distributions that are not normal 11

Logistic Regression “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2 nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): ◦ Independence ◦ NOT Homoscedasticity or Normal Errors ◦ Linearity (in the Log Odds) ◦ Also….adequate cell sizes. 12

Logistic Regression 13

Logistic Regression: Example The Output as Logits ◦ Logits: H 0 : β=0 Y=DepressedCoefSEZPCI α (_constant) < , Freq.Percent Not Depressed Depressed Code 2.1

Logistic Regression: Example Y=DepressedORSEZPCI α (_constant) < , Freq.Percent Not Depressed Depressed Code 2.2

Logistic Regression: Example Y=DepressedCoefSEZPCI α (_constant) < , β (age) , AS LOGITS: Interpretation: A 1 unit increase in age results in a increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = For a 1 unit increase in age, there is a increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16 Code 2.3

Logistic Regression: GOF 17 Overall Model Likelihood-Ratio Chi-Square Omnibus test for the model Overall model fit? Relative to other models Compares specified model with Null model (no predictors) Χ 2 =-2*(LL 0 -LL 1 ), DF=K parameters estimated

Logistic Regression: GOF (Summary Measures) Pseudo-R 2 Pseudo-R 2 ◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden) ◦ Only comparable within nested models of the same outcome. Hosmer-Lemeshow ◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X 2 ◦ H 0 : Good Fit for Data, so we want p>0.05 ◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 ◦ Conservative (rarely rejects the null) Pearson Chi-Square ◦ Models with categorical predictors ◦ Similar to Hosmer-Lemeshow ROC-Area Under the Curve ◦ Predictive accuracy/Classification 18 Code 2.4

Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) ◦ Pearson Residuals  Square root of the contribution to the Pearson χ 2 ◦ Deviance Residuals  Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) ◦ Leverage (Hat Matrix/Projection Matrix)  Maps the influence of observed on fitted values Influential Observations ◦ Pregibon’s Delta-Beta influence statistic ◦ Similar to Cook’s-D in linear regression Detecting Problems ◦ Residuals vs Predictors ◦ Leverage Vs Residuals ◦ Boxplot of Delta-Beta 19 Code 2.5

Logistic Regression: GOF Y=DepressedCoefSEZPCI α (_constant) < , β (age) , H-L GOF: Number of Groups: 10 H-L Chi 2 :7.12 DF:8 P: McFadden’s R 2 : L-R χ 2 (df=1): 2.47, p=0.1162

Logistic Regression: Diagnostics Linearity in the Log-Odds ◦ Use a lowess (loess) plot ◦ Depressed vs Age 21 Code 2.6

Logistic Regression: Example Y=DepressedORSEZPCI α (_constant) < , β (male) < , AS OR: Interpretation: The odds of depression are times lower for males compared to females. We could also say: The odds of depression are ( =.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = Code 2.7

Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! ◦ Proportional Odds  BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)  The predictors effect on the outcome is the same across levels of the outcome.  Bmi3grp (1 vs 2,3) = B(age)  Bmi3grp (1,2 vs 3) = B(age) 23

Ordinal Logistic Regression 24

Ordinal Logistic Regression Example Y=bmi3grpCoefSEZPCI β1 (age) < , β2 (blood_press) , Threshold1/cut , Threshold2/cut , AS LOGITS: Y=bmi3grpORSEZPCI β1 (age) < , β2 (blood_press) , Threshold1/cut , Threshold2/cut , AS OR: For a 1 unit increase in Blood Pressure there is a increase in the log-odds of being in a higher bmi category For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are times greater. 25 Code 3.1

Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions ◦ Brant Test of Parallel Regression  H 0 : Proportional Odds, thus want p >0.05  Tests each predictor separately and overall ◦ Score Test of Parallel Regression  H 0 : Proportional Odds, thus want p >0.05 ◦ Approx Likelihood-ratio test  H 0 : Proportional Odds, thus want p > Code 3.2

Ordinal Logistic Regression: GOF Pseudo R 2 Diagnostics Measures ◦ Performed on the j-1 binomial logistic regressions 27 Code 3.3

Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses ◦ Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28

Multinomial Logistic Regression 29

Multinomial Logistic Regression Example Y=religion (ref=Catholic(1)) ORSEZPCI Protestant (2) β (supernatural) , α (_constant) , Evangelical (3) β (supernatural) , α (_constant) < , Does degree of supernatural belief indicate a religious preference? AS OR: For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30 Code 4.1

Multinomial Logistic Regression GOF Limited GOF tests. ◦ Look at LR Chi-square and compare nested models. ◦ “Essentially, all models are wrong, but some are useful” –George E.P. Box Pseudo R 2 Similar to Ordinal ◦ Perform tests on the j-1 binomial logistic regressions 31

Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: 32