Discrete Multivariate Analysis Analysis of Multivariate Categorical Data
References Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press. Alan Agresti (1990) Categorical Data Analysis, Wiley, New York.
Log Linear Model
Two-way table where Note: X and Y are independent if In this case the log-linear model becomes
Three-way Frequency Tables
Log-Linear model for three-way tables Let mijk denote the expected frequency in cell (i,j,k) of the table then in general where
Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction
Hierarchical Log-linear models for 3 way table Description [1][2][3] Mutual independence between all three variables. [1][23] Independence of Variable 1 with variables 2 and 3. [2][13] Independence of Variable 2 with variables 1 and 3. [3][12] Independence of Variable 3 with variables 1 and 2. [12][13] Conditional independence between variables 2 and 3 given variable 1. [12][23] Conditional independence between variables 1 and 3 given variable 2. [13][23] Conditional independence between variables 1 and 2 given variable 3. [12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. [123]
Maximum Likelihood Estimation Log-Linear Model
For any Model it is possible to determine the maximum Likelihood Estimators of the parameters Example Two-way table – independence – multinomial model or
Log-likelihood where With the model of independence
and with also
Let Now
Since
Now or
Hence and Similarly Finally
Hence Now and
Hence Note or
Comments Maximum Likelihood estimates can be computed for any hierarchical log linear model (i.e. more than 2 variables) In certain situations the equations need to be solved numerically For the saturated model (all interactions and main effects)
Goodness of Fit Statistics These statistics can be used to check if a log-linear model will fit the observed frequency table
Goodness of Fit Statistics The Chi-squared statistic The Likelihood Ratio statistic: d.f. = # cells - # parameters fitted We reject the model if c2 or G2 is greater than
Example: Variables Systolic Blood Pressure (B) Serum Cholesterol (C) Coronary Heart Disease (H)
Goodness of fit testing of Models MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ ----- -- ----------- ------- ------- ------- B,C,H. 24 83.15 0.0000 102.00 0.0000 B,CH. 21 51.23 0.0002 56.89 0.0000 C,BH. 21 59.59 0.0000 60.43 0.0000 H,BC. 15 58.73 0.0000 64.78 0.0000 BC,BH. 12 35.16 0.0004 33.76 0.0007 BH,CH. 18 27.67 0.0673 26.58 0.0872 n.s. CH,BC. 12 26.80 0.0082 33.18 0.0009 BC,BH,CH. 9 8.08 0.5265 6.56 0.6824 n.s. Possible Models: 1. [BH][CH] – B and C independent given H. 2. [BC][BH][CH] – all two factor interaction model
Model 1: [BH][CH] Log-linear parameters Heart disease -Blood Pressure Interaction
Multiplicative effect Log-Linear Model
Heart Disease - Cholesterol Interaction
Multiplicative effect
Model 2: [BC][BH][CH] Log-linear parameters Blood pressure-Cholesterol interaction:
Multiplicative effect
Heart disease -Blood Pressure Interaction
Multiplicative effect
Heart Disease - Cholesterol Interaction
Multiplicative effect
Another Example In this study it was determined for N = 4353 males Occupation category Educational Level Academic Aptidude
Occupation categories Self-employed Business Teacher\Education Self-employed Professional Salaried Employed Education levels Low Low/Med Med High/Med High
Academic Aptitude Low Low/Med High/Med High
Self-employed, Business Teacher Education Education Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 42 55 22 3 122 Low 0 0 1 19 20 LMed 72 82 60 12 226 LMed 0 3 3 60 66 Med 90 106 85 25 306 Med 1 4 5 86 96 HMed 27 48 47 8 130 HMed 0 0 2 36 38 High 8 18 19 5 50 High 0 0 1 14 15 Total 239 309 233 53 834 Total 1 7 12 215 235 Self-employed, Professional Salaried Employed Low 1 2 8 19 30 Low 172 151 107 42 472 LMed 1 2 15 33 51 LMed 208 198 206 92 704 Med 2 5 25 83 115 Med 279 271 331 191 1072 HMed 2 2 10 45 59 HMed 99 126 179 97 501 High 0 0 12 19 31 High 36 35 99 79 249 Total 6 11 70 199 286 Total 794 781 922 501 2998
This is similar to looking at all the bivariate correlations It is common to handle a Multiway table by testing for independence in all two way tables. This is similar to looking at all the bivariate correlations In this example we learn that: Education is related to Aptitude Education is related to Occupational category Can we do better than this?
Fitting various log-linear models Simplest model that fits is: [Apt,Ed][Occ,Ed] This model implies conditional independence between Aptitude and Occupation given Education.
Log-linear Parameters Aptitude – Education Interaction
Aptitude – Education Interaction (Multiplicative)
Occupation – Education Interaction
Occupation – Education Interaction (Multiplicative)