LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia.

LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia

INTRODUCTION  The Log-linear Analysis procedure analyzes the frequency counts of observations falling into each cross- classification category in a cross tabulation or a contingency table  Each cross-classification in the table constitutes a cell, and each categorical variable is called a factor  The ultimate goal of fitting a log-linear model is to estimate parameters that describe the relationships between categorical variables

INTRODUCTION  Specifically, for a set of categorical variables, log-linear models treat all variables as response variables by modelling the cell counts for all combinations of the levels of the categorical variables included in the model  Therefore, fitting a log-linear model is appropriate when all of the variables are categorical in nature and a researcher is interested in understanding how a count within a particular cell of the contingency table depends on the different levels of the categorical variables that define that particular cell

INTRODUCTION  Logistic regression is concerned with modeling a single binary valued response variable as a function of covariates  There are many situations, however, where several factors simultaneously interact with each other in a multivariate manner and the cause and effect relationship is unclear  Log-linear models were developed to analyze this type of data  Logistic regression is a special case of log-linear models

Coding of Variables Log-linear Models  In general, the number of parameters in a log-linear model depends on the number of categories of the variables of interest  More specifically, in any log-linear model the effect of a categorical variable with a total of C categories requires (C – 1) unique parameters  For example, if variable X is gender (with two categories), then C=2 and only one predictor, thus one parameter, is needed to model the effect of X.

Coding of Variables Log-linear Models  One of the simplest and most intuitive ways to code categorical variables is called “dummy coding.”  When dummy coding is used, the last category of the variable is used as a reference category.  Therefore, the parameter associated with the last category is set to zero, and each of the remaining parameters of the model is interpreted relative to the last category.

Notation of Variables  Instead of representing the parameter associated with the ith variable (Xi) as,in log-linear models this parameter is represented by the Greek letter lambda,,with the variable indicated in the superscript and the (dummy-coded) indicator of the variable in the subscript  For example, if the variable X has a total of I categories (i =1, 2, …, I), is the parameter associated with the i- th indicator (dummy variable) for X

Using SPSS-Example An investigator intend to assess the contribution that overweight and smoking cause to coronary artery disease. Data are collected based on ECG reading, BMI and whether smoking or not for a sample of 188 people ECGBMISmoke SmokerNon-smoker AbnormalOverweight4710 Normal Weight1412 Normal Overweight 2515 Normal Weight 3530

Input Data in SPSS

How to run Log-linear Analysis

Check Assumptions  categorical data  each categorical variable is called a factor  every case should fall into only one cross-classification category  all expected frequencies should be greater than 1, and not more than 20% should be less than 5.  1. collapse the data across one of the variables  2. collapse levels of one of the variables  3. collect more data  4. accept loss of power  5. add a constant (0.5) to all cells of the table

From the Model Selection box, Select any variables that you want to include in the analysis by selecting them with the mouse

If you click on Model button then this will open a dialog box, check...

Clicking on Options opens another dialog box. There are few options to play around with really (the default options are fine) The only two things you can select are Parameter estimates, which will produce a table of parameter estimates for each effect and for an Association table, which will produce chi- square statistics for all of the effects in the model

Output from Log-linear Analysis The first table tells us that we have 188 cases. SPSS then lists all of the factors in the model and the number of levels they have

The second table gives us the observed and expected counts for each of the combinations of categories in our model.

The final bit of this initial output gives us two goodness-of-fit. In this context these tests are testing the hypothesis that the frequencies predicted by the model (the expected frequencies) are significantly different from the actual frequencies in our data (the observed frequencies) The next part of the output tells us something about which components of the model can be removed.

The likelihood ratio chi-square with no parameters and only the mean is 49.596. The value for the first order effect is 31.093. The difference 49.596 − 31.093 =18.503 is displayed on the first line of the next table. The difference is a measure of how much the model improves when first order effects are included. The significantly small P value (0.0000) means that the hypothesis of first order effect being zero is rejected. In other words there is a first order effect.

Similar reasoning is applied now to the question of second order effect. The addition of a second order effect improves the likelihood ratio chi- square by 28.656. This is also significant. But the addition of a third order term does not help. The P value is not significant. In log-linear analysis the change in the value of the likelihood ratio chi- square statistic when terms are removed (or added) from the model is an indicator of their contribution. We saw this in multiple linear regression with regard to R 2. The difference is that in linear regression large values of R 2 are associated with good models. Opposite is the case with log-linear analysis. Small values of likelihood ratio chi-square mean a good model.

This simply breaks down the previous table that we’ve just looked at into its component parts. So, for example, although we know from the previous output that removing all of the two-way interactions significantly affects the model, we don’t know which of the two-way interactions is having the effect

Keep in mind, though, that regardless of the partial association test, one must retain even nonsignificant lower- order terms if they are components of a significant higher- order term which is to be retained in the model. Thus in the example above, one would retain ECG and BMI even though they are non-significant because they are terms in the two significant two-way interactions, ECG*BMI and BMI*Smoke Thus the partial associations test suggest dropping only the ECG*Smoke interaction.

The output above lists each main and interaction effect in the hierarchy of all effects generated by the highest-order interaction in the set of factors the researcher enters. This not-printed parameter estimate for the left-out category is the negative of the sum of the printed parameter estimates (since the estimates must add to 0).

Backward Elimination Statistics

The purpose here is to find the unsaturated model that would provide the best fit to the data. This is done by checking that the model currently being tested does not give a worse fit than its predecessor As a first step the procedure commences with the most complex model. In our case it is BMI * ECG * SMOKE. Its elimination produces a chi- square change of 1.389, which has an associated significance level of 0.2386. Since it is greater than the criterion level of 0.05, it is removed. The procedure moves on to the next hierarchical level described under step 1. All 2 – way interactions between the three variables are being tested. Removal of ECG*BMI will produce a large change of 14.601 in the likelihood ratio chi-square. The P value for that is highly significant (prob = 0.0000). The smallest change (of 2.406 ) is related to the ECG * SMOKE interaction. This is removed next. And the procedure continues until the final model which gives the second order interactions of ECG * BMI and BMI * SMOKE.

We conclude that being overweight and smoking have each a significant association with an abnormal cardiogram. However, in this particular group of subjects being overweight is more harmful.

Estimate the model using Loglinear-General to print parameter estimates

From the General box, Select any variables that you want to include in the analysis by selecting them with the mouse

Click the Model button to define the model. We are interested in a model with fewer terms and then we must click the Custom button.

Click Continue and then the Options button

Recall that the best model generated by the Model Selection procedure was the full factorial model minus the ECG*Smoke. The goodness of fit tests show that the fit is perfect: both goodness of fit statistics are not significant. The Output

The significance level of the likelihood ratio for these data for this model is.089. This means this model is not significantly different from the saturated model in accounting for the distribution of data in the table. We accept this conditional independence model as a superior model to the saturated model because it is more parsimonious.

Looking at the significant parameter estimates, shown in red below, we can analyze the relative importance of different effects in the model Parameter combinations to give expected values ECGBMISmokeexp of these terms are computedexpected frequency 1113.401+(-0.916)+(-1.068)+0.154+1.27+0.904=3.74542.30 1123.401+(-0.916)+(-1.068)+1.27=2.68714.69 1213.401+(-0.916)+0.154=2.63914 1223.401+(-0.916)=2.48512 2113.401+(-1.068)+0.154+0.904=3.39129.69 2123.401+(-1.068)=2.33310.3 2213.401+0.154=3.55535 2223.40130

Each cell in the matrix above has 8 dots because for this example factor space has 8 cells. That the observed by expected counts plots in the matrix form almost 45-degree line indicates a well-fitting model. For the plots involving adjusted residuals, a random cloud (no pattern) is desirable. For these data there is no linear trend for residuals to increase or decline as expected or observed count increases.

Above, residuals deviate slightly from normal, but probably would be considered to be within acceptable range.

References Agresti, A. (2012). An Introduction to Categorical Data Analysis. Wiley: New York. Eye, A.V. & Mun, E.Y. (2012). Log-linear Modeling: Concepts, Interpretation, and Application. Wiley: New York. Field, A. (2005). Discovering Statistics Using SPSS. Sage Publications: London Everitt, B.S. (1992). The Analysis of Contingency Tables. Chapman & Hall: London. SPSS 19. - Online Help: loglinear analysis - Tutorial: Loglinear Modeling

Thank You

LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia.

Similar presentations

Presentation on theme: "LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia.

Similar presentations

Presentation on theme: "LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia."— Presentation transcript:

Similar presentations

About project

Feedback