Download presentation
Presentation is loading. Please wait.
Published bySharleen Barton Modified over 9 years ago
1
LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia
2
INTRODUCTION The Log-linear Analysis procedure analyzes the frequency counts of observations falling into each cross- classification category in a cross tabulation or a contingency table Each cross-classification in the table constitutes a cell, and each categorical variable is called a factor The ultimate goal of fitting a log-linear model is to estimate parameters that describe the relationships between categorical variables
3
INTRODUCTION Specifically, for a set of categorical variables, log-linear models treat all variables as response variables by modelling the cell counts for all combinations of the levels of the categorical variables included in the model Therefore, fitting a log-linear model is appropriate when all of the variables are categorical in nature and a researcher is interested in understanding how a count within a particular cell of the contingency table depends on the different levels of the categorical variables that define that particular cell
4
INTRODUCTION Logistic regression is concerned with modeling a single binary valued response variable as a function of covariates There are many situations, however, where several factors simultaneously interact with each other in a multivariate manner and the cause and effect relationship is unclear Log-linear models were developed to analyze this type of data Logistic regression is a special case of log-linear models
5
Coding of Variables Log-linear Models In general, the number of parameters in a log-linear model depends on the number of categories of the variables of interest More specifically, in any log-linear model the effect of a categorical variable with a total of C categories requires (C – 1) unique parameters For example, if variable X is gender (with two categories), then C=2 and only one predictor, thus one parameter, is needed to model the effect of X.
6
Coding of Variables Log-linear Models One of the simplest and most intuitive ways to code categorical variables is called “dummy coding.” When dummy coding is used, the last category of the variable is used as a reference category. Therefore, the parameter associated with the last category is set to zero, and each of the remaining parameters of the model is interpreted relative to the last category.
7
Notation of Variables Instead of representing the parameter associated with the ith variable (Xi) as,in log-linear models this parameter is represented by the Greek letter lambda,,with the variable indicated in the superscript and the (dummy-coded) indicator of the variable in the subscript For example, if the variable X has a total of I categories (i =1, 2, …, I), is the parameter associated with the i- th indicator (dummy variable) for X
8
Using SPSS-Example An investigator intend to assess the contribution that overweight and smoking cause to coronary artery disease. Data are collected based on ECG reading, BMI and whether smoking or not for a sample of 188 people ECGBMISmoke SmokerNon-smoker AbnormalOverweight4710 Normal Weight1412 Normal Overweight 2515 Normal Weight 3530
9
Input Data in SPSS
24
How to run Log-linear Analysis
25
Check Assumptions categorical data each categorical variable is called a factor every case should fall into only one cross-classification category all expected frequencies should be greater than 1, and not more than 20% should be less than 5. 1. collapse the data across one of the variables 2. collapse levels of one of the variables 3. collect more data 4. accept loss of power 5. add a constant (0.5) to all cells of the table
26
From the Model Selection box, Select any variables that you want to include in the analysis by selecting them with the mouse
29
If you click on Model button then this will open a dialog box, check...
30
Clicking on Options opens another dialog box. There are few options to play around with really (the default options are fine) The only two things you can select are Parameter estimates, which will produce a table of parameter estimates for each effect and for an Association table, which will produce chi- square statistics for all of the effects in the model
31
Output from Log-linear Analysis The first table tells us that we have 188 cases. SPSS then lists all of the factors in the model and the number of levels they have
32
The second table gives us the observed and expected counts for each of the combinations of categories in our model.
33
The final bit of this initial output gives us two goodness-of-fit. In this context these tests are testing the hypothesis that the frequencies predicted by the model (the expected frequencies) are significantly different from the actual frequencies in our data (the observed frequencies) The next part of the output tells us something about which components of the model can be removed.
34
The likelihood ratio chi-square with no parameters and only the mean is 49.596. The value for the first order effect is 31.093. The difference 49.596 − 31.093 =18.503 is displayed on the first line of the next table. The difference is a measure of how much the model improves when first order effects are included. The significantly small P value (0.0000) means that the hypothesis of first order effect being zero is rejected. In other words there is a first order effect.
35
Similar reasoning is applied now to the question of second order effect. The addition of a second order effect improves the likelihood ratio chi- square by 28.656. This is also significant. But the addition of a third order term does not help. The P value is not significant. In log-linear analysis the change in the value of the likelihood ratio chi- square statistic when terms are removed (or added) from the model is an indicator of their contribution. We saw this in multiple linear regression with regard to R 2. The difference is that in linear regression large values of R 2 are associated with good models. Opposite is the case with log-linear analysis. Small values of likelihood ratio chi-square mean a good model.
36
This simply breaks down the previous table that we’ve just looked at into its component parts. So, for example, although we know from the previous output that removing all of the two-way interactions significantly affects the model, we don’t know which of the two-way interactions is having the effect
37
Keep in mind, though, that regardless of the partial association test, one must retain even nonsignificant lower- order terms if they are components of a significant higher- order term which is to be retained in the model. Thus in the example above, one would retain ECG and BMI even though they are non-significant because they are terms in the two significant two-way interactions, ECG*BMI and BMI*Smoke Thus the partial associations test suggest dropping only the ECG*Smoke interaction.
38
The output above lists each main and interaction effect in the hierarchy of all effects generated by the highest-order interaction in the set of factors the researcher enters. This not-printed parameter estimate for the left-out category is the negative of the sum of the printed parameter estimates (since the estimates must add to 0).
39
Backward Elimination Statistics
40
The purpose here is to find the unsaturated model that would provide the best fit to the data. This is done by checking that the model currently being tested does not give a worse fit than its predecessor As a first step the procedure commences with the most complex model. In our case it is BMI * ECG * SMOKE. Its elimination produces a chi- square change of 1.389, which has an associated significance level of 0.2386. Since it is greater than the criterion level of 0.05, it is removed. The procedure moves on to the next hierarchical level described under step 1. All 2 – way interactions between the three variables are being tested. Removal of ECG*BMI will produce a large change of 14.601 in the likelihood ratio chi-square. The P value for that is highly significant (prob = 0.0000). The smallest change (of 2.406 ) is related to the ECG * SMOKE interaction. This is removed next. And the procedure continues until the final model which gives the second order interactions of ECG * BMI and BMI * SMOKE.
41
We conclude that being overweight and smoking have each a significant association with an abnormal cardiogram. However, in this particular group of subjects being overweight is more harmful.
42
Estimate the model using Loglinear-General to print parameter estimates
43
From the General box, Select any variables that you want to include in the analysis by selecting them with the mouse
45
Click the Model button to define the model. We are interested in a model with fewer terms and then we must click the Custom button.
49
Click Continue and then the Options button
50
Recall that the best model generated by the Model Selection procedure was the full factorial model minus the ECG*Smoke. The goodness of fit tests show that the fit is perfect: both goodness of fit statistics are not significant. The Output
51
The significance level of the likelihood ratio for these data for this model is.089. This means this model is not significantly different from the saturated model in accounting for the distribution of data in the table. We accept this conditional independence model as a superior model to the saturated model because it is more parsimonious.
53
Looking at the significant parameter estimates, shown in red below, we can analyze the relative importance of different effects in the model Parameter combinations to give expected values ECGBMISmokeexp of these terms are computedexpected frequency 1113.401+(-0.916)+(-1.068)+0.154+1.27+0.904=3.74542.30 1123.401+(-0.916)+(-1.068)+1.27=2.68714.69 1213.401+(-0.916)+0.154=2.63914 1223.401+(-0.916)=2.48512 2113.401+(-1.068)+0.154+0.904=3.39129.69 2123.401+(-1.068)=2.33310.3 2213.401+0.154=3.55535 2223.40130
54
Each cell in the matrix above has 8 dots because for this example factor space has 8 cells. That the observed by expected counts plots in the matrix form almost 45-degree line indicates a well-fitting model. For the plots involving adjusted residuals, a random cloud (no pattern) is desirable. For these data there is no linear trend for residuals to increase or decline as expected or observed count increases.
55
Above, residuals deviate slightly from normal, but probably would be considered to be within acceptable range.
56
References Agresti, A. (2012). An Introduction to Categorical Data Analysis. Wiley: New York. Eye, A.V. & Mun, E.Y. (2012). Log-linear Modeling: Concepts, Interpretation, and Application. Wiley: New York. Field, A. (2005). Discovering Statistics Using SPSS. Sage Publications: London Everitt, B.S. (1992). The Analysis of Contingency Tables. Chapman & Hall: London. SPSS 19. - Online Help: loglinear analysis - Tutorial: Loglinear Modeling
58
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.