LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

SPSS Session 5: Association between Nominal Variables Using Chi-Square Statistic.
Chi Squared Tests. Introduction Two statistical techniques are presented. Both are used to analyze nominal data. –A goodness-of-fit test for a multinomial.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity)
Chapter 13: Inference for Distributions of Categorical Data
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Analysis of frequency counts with Chi square
Log-linear Analysis - Analysing Categorical Data
(Hierarchical) Log-Linear Models Friday 18 th March 2011.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Chapter 16 Chi Squared Tests.
Chi-square Test of Independence
Log-linear and logistic models
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Analysis of Variance & Multivariate Analysis of Variance
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Repeated Measures ANOVA Used when the research design contains one factor on which participants are measured more than twice (dependent, or within- groups.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Two-Way Analysis of Variance STAT E-150 Statistical Methods.
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
Example of Simple and Multiple Regression
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems.
Categorical Data Prof. Andy Field.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Chapter 13: Inference in Regression
Regression Analysis (2)
Simple Linear Regression
Hierarchical Binary Logistic Regression
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Chi-square Test of Independence Steps in Testing Chi-square Test of Independence Hypotheses.
12e.1 ANOVA Within Subjects These notes are developed from “Approaching Multivariate Analysis: A Practical Introduction” by Pat Dugard, John Todman and.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Chi-Square Contingency Table Analysis.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Within Subjects Analysis of Variance PowerPoint.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
Correlation & Regression Analysis
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
ANOVA, Regression and Multiple Regression March
Chi-Square Analyses.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Nonparametric Statistics
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
STATISTICAL TESTS USING SPSS Dimitrios Tselios/ Example tests “Discovering statistics using SPSS”, Andy Field.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
BINARY LOGISTIC REGRESSION
Notes on Logistic Regression
Regression Analysis Simple Linear Regression
CHAPTER 11 Inference for Distributions of Categorical Data
Categorical Data Aims Loglinear models Categorical data
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Regression Analysis.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Presentation transcript:

LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia

INTRODUCTION  The Log-linear Analysis procedure analyzes the frequency counts of observations falling into each cross- classification category in a cross tabulation or a contingency table  Each cross-classification in the table constitutes a cell, and each categorical variable is called a factor  The ultimate goal of fitting a log-linear model is to estimate parameters that describe the relationships between categorical variables

INTRODUCTION  Specifically, for a set of categorical variables, log-linear models treat all variables as response variables by modelling the cell counts for all combinations of the levels of the categorical variables included in the model  Therefore, fitting a log-linear model is appropriate when all of the variables are categorical in nature and a researcher is interested in understanding how a count within a particular cell of the contingency table depends on the different levels of the categorical variables that define that particular cell

INTRODUCTION  Logistic regression is concerned with modeling a single binary valued response variable as a function of covariates  There are many situations, however, where several factors simultaneously interact with each other in a multivariate manner and the cause and effect relationship is unclear  Log-linear models were developed to analyze this type of data  Logistic regression is a special case of log-linear models

Coding of Variables Log-linear Models  In general, the number of parameters in a log-linear model depends on the number of categories of the variables of interest  More specifically, in any log-linear model the effect of a categorical variable with a total of C categories requires (C – 1) unique parameters  For example, if variable X is gender (with two categories), then C=2 and only one predictor, thus one parameter, is needed to model the effect of X.

Coding of Variables Log-linear Models  One of the simplest and most intuitive ways to code categorical variables is called “dummy coding.”  When dummy coding is used, the last category of the variable is used as a reference category.  Therefore, the parameter associated with the last category is set to zero, and each of the remaining parameters of the model is interpreted relative to the last category.

Notation of Variables  Instead of representing the parameter associated with the ith variable (Xi) as,in log-linear models this parameter is represented by the Greek letter lambda,,with the variable indicated in the superscript and the (dummy-coded) indicator of the variable in the subscript  For example, if the variable X has a total of I categories (i =1, 2, …, I), is the parameter associated with the i- th indicator (dummy variable) for X

Using SPSS-Example An investigator intend to assess the contribution that overweight and smoking cause to coronary artery disease. Data are collected based on ECG reading, BMI and whether smoking or not for a sample of 188 people ECGBMISmoke SmokerNon-smoker AbnormalOverweight4710 Normal Weight1412 Normal Overweight 2515 Normal Weight 3530

Input Data in SPSS

How to run Log-linear Analysis

Check Assumptions  categorical data  each categorical variable is called a factor  every case should fall into only one cross-classification category  all expected frequencies should be greater than 1, and not more than 20% should be less than 5.  1. collapse the data across one of the variables  2. collapse levels of one of the variables  3. collect more data  4. accept loss of power  5. add a constant (0.5) to all cells of the table

From the Model Selection box, Select any variables that you want to include in the analysis by selecting them with the mouse

If you click on Model button then this will open a dialog box, check...

Clicking on Options opens another dialog box. There are few options to play around with really (the default options are fine) The only two things you can select are Parameter estimates, which will produce a table of parameter estimates for each effect and for an Association table, which will produce chi- square statistics for all of the effects in the model

Output from Log-linear Analysis The first table tells us that we have 188 cases. SPSS then lists all of the factors in the model and the number of levels they have

The second table gives us the observed and expected counts for each of the combinations of categories in our model.

The final bit of this initial output gives us two goodness-of-fit. In this context these tests are testing the hypothesis that the frequencies predicted by the model (the expected frequencies) are significantly different from the actual frequencies in our data (the observed frequencies) The next part of the output tells us something about which components of the model can be removed.

The likelihood ratio chi-square with no parameters and only the mean is The value for the first order effect is The difference − = is displayed on the first line of the next table. The difference is a measure of how much the model improves when first order effects are included. The significantly small P value (0.0000) means that the hypothesis of first order effect being zero is rejected. In other words there is a first order effect.

Similar reasoning is applied now to the question of second order effect. The addition of a second order effect improves the likelihood ratio chi- square by This is also significant. But the addition of a third order term does not help. The P value is not significant. In log-linear analysis the change in the value of the likelihood ratio chi- square statistic when terms are removed (or added) from the model is an indicator of their contribution. We saw this in multiple linear regression with regard to R 2. The difference is that in linear regression large values of R 2 are associated with good models. Opposite is the case with log-linear analysis. Small values of likelihood ratio chi-square mean a good model.

This simply breaks down the previous table that we’ve just looked at into its component parts. So, for example, although we know from the previous output that removing all of the two-way interactions significantly affects the model, we don’t know which of the two-way interactions is having the effect

Keep in mind, though, that regardless of the partial association test, one must retain even nonsignificant lower- order terms if they are components of a significant higher- order term which is to be retained in the model. Thus in the example above, one would retain ECG and BMI even though they are non-significant because they are terms in the two significant two-way interactions, ECG*BMI and BMI*Smoke Thus the partial associations test suggest dropping only the ECG*Smoke interaction.

The output above lists each main and interaction effect in the hierarchy of all effects generated by the highest-order interaction in the set of factors the researcher enters. This not-printed parameter estimate for the left-out category is the negative of the sum of the printed parameter estimates (since the estimates must add to 0).

Backward Elimination Statistics

The purpose here is to find the unsaturated model that would provide the best fit to the data. This is done by checking that the model currently being tested does not give a worse fit than its predecessor As a first step the procedure commences with the most complex model. In our case it is BMI * ECG * SMOKE. Its elimination produces a chi- square change of 1.389, which has an associated significance level of Since it is greater than the criterion level of 0.05, it is removed. The procedure moves on to the next hierarchical level described under step 1. All 2 – way interactions between the three variables are being tested. Removal of ECG*BMI will produce a large change of in the likelihood ratio chi-square. The P value for that is highly significant (prob = ). The smallest change (of ) is related to the ECG * SMOKE interaction. This is removed next. And the procedure continues until the final model which gives the second order interactions of ECG * BMI and BMI * SMOKE.

We conclude that being overweight and smoking have each a significant association with an abnormal cardiogram. However, in this particular group of subjects being overweight is more harmful.

Estimate the model using Loglinear-General to print parameter estimates

From the General box, Select any variables that you want to include in the analysis by selecting them with the mouse

Click the Model button to define the model. We are interested in a model with fewer terms and then we must click the Custom button.

Click Continue and then the Options button

Recall that the best model generated by the Model Selection procedure was the full factorial model minus the ECG*Smoke. The goodness of fit tests show that the fit is perfect: both goodness of fit statistics are not significant. The Output

The significance level of the likelihood ratio for these data for this model is.089. This means this model is not significantly different from the saturated model in accounting for the distribution of data in the table. We accept this conditional independence model as a superior model to the saturated model because it is more parsimonious.

Looking at the significant parameter estimates, shown in red below, we can analyze the relative importance of different effects in the model Parameter combinations to give expected values ECGBMISmokeexp of these terms are computedexpected frequency (-0.916)+(-1.068) = (-0.916)+(-1.068)+1.27= (-0.916)+0.154= (-0.916)= (-1.068) = (-1.068)= =

Each cell in the matrix above has 8 dots because for this example factor space has 8 cells. That the observed by expected counts plots in the matrix form almost 45-degree line indicates a well-fitting model. For the plots involving adjusted residuals, a random cloud (no pattern) is desirable. For these data there is no linear trend for residuals to increase or decline as expected or observed count increases.

Above, residuals deviate slightly from normal, but probably would be considered to be within acceptable range.

References Agresti, A. (2012). An Introduction to Categorical Data Analysis. Wiley: New York. Eye, A.V. & Mun, E.Y. (2012). Log-linear Modeling: Concepts, Interpretation, and Application. Wiley: New York. Field, A. (2005). Discovering Statistics Using SPSS. Sage Publications: London Everitt, B.S. (1992). The Analysis of Contingency Tables. Chapman & Hall: London. SPSS Online Help: loglinear analysis - Tutorial: Loglinear Modeling

Thank You