Discrete Multivariate Analysis

Slides:



Advertisements
Similar presentations
Logistic Regression Psy 524 Ainsworth.
Advertisements

Conditional Test Statistics. Suppose that we are considering two Log- linear models and that Model 2 is a special case of Model 1. That is the parameters.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Log-linear Analysis - Analysing Categorical Data
(Hierarchical) Log-Linear Models Friday 18 th March 2011.
Multiple Regression Models
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
REGRESSION AND CORRELATION
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Presentation 12 Chi-Square test.
Categorical Data Prof. Andy Field.
Chapter 13: Inference in Regression
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Discrete Multivariate Analysis Analysis of Multivariate Categorical Data.
Multivariate Data Summary. Linear Regression and Correlation.
Chapter 13 Multiple Regression
Fitting a Logit Model with a Polytomous Response Variable.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Correlation & Regression Analysis
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Multivariate Data Summary. Linear Regression and Correlation.
Log-linear Models Please read Chapter Two. We are interested in relationships between variables White VictimBlack Victim White Prisoner151 (151/160=0.94)
Logistic Regression Binary response variable Y (1 – Success, 0 – Failure) Continuous, Categorical independent Variables –Similar to Multiple Regression.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Methods of Presenting and Interpreting Information Class 9.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 22/11/ :12 AM 1 Contingency tables and log-linear models.
Stats Methods at IC Lecture 3: Regression.
Comparing Counts Chi Square Tests Independence.
23. Inference for regression
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Presentation 12 Chi-Square test.
Logistic Regression APKC – STATS AFAC (2016).
Analysis of Variance and Covariance
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Logistic Regression II/ (Hierarchical)
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Comparing Three or More Means
Discrete Multivariate Analysis
Categorical Data Aims Loglinear models Categorical data
John Loucks St. Edward’s University . SLIDES . BY.
Basic Statistics Overview
John Loucks St. Edward’s University . SLIDES . BY.
Lecture #27 Tuesday, November 29, 2016 Textbook: 15.1
Discrete Multivariate Analysis
Conditional Test Statistics
Business Statistics Multiple Regression This lecture flows well with
Comparing k Populations
Lecture Slides Elementary Statistics Thirteenth Edition
The Practice of Statistics in the Life Sciences Fourth Edition
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
NURS 790: Methods for Research and Evidence Based Practice
Multivariate Data Summary
Comparing k Populations
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Test of Independence in 3 Variables
CHAPTER- 17 CORRELATION AND REGRESSION
One-Way Analysis of Variance
Comparing k Populations
Product moment correlation
Multiple Regression – Split Sample Validation
Regression Analysis.
Multiple Testing Tukey’s Multiple comparison procedure
Presentation transcript:

Discrete Multivariate Analysis Analysis of Multivariate Categorical Data

References Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press. Alan Agresti (1990) Categorical Data Analysis, Wiley, New York.

Example 1 In this study we examine n = 1237 individuals measuring X, Systolic Blood Pressure and Y, Serum Cholesterol

Example 2 The following data was taken from a study of parole success involving 5587 parolees in Ohio between 1965 and 1972 (a ten percent sample of all parolees during this period).

The study involved a dichotomous response Y Success (no major parole violation) or Failure (returned to prison either as technical violators or with a new conviction) based on a one-year follow-up. The predictors of parole success included are: type of committed offence (Person offense or Other offense), Age (25 or Older or Under 25), Prior Record (No prior sentence or Prior Sentence), and Drug or Alcohol Dependency (No drug or Alcohol dependency or Drug and/or Alcohol dependency).

The data were randomly split into two parts The data were randomly split into two parts. The counts for each part are displayed in the table, with those for the second part in parentheses. The second part of the data was set aside for a validation study of the model to be fitted in the first part.

Table

Multiway Frequency Tables Two-Way A

Three -Way B A C

Three -Way C B A

four -Way B A C D

Analysis of a Two-way Frequency Table:

Frequency Distribution (Serum Cholesterol and Systolic Blood Pressure)

Joint and Marginal Distributions (Serum Cholesterol and Systolic Blood Pressure) The Marginal distributions allow you to look at the effect of one variable, ignoring the other. The joint distribution allows you to look at the two variables simultaneously.

Conditional Distributions ( Systolic Blood Pressure given Serum Cholesterol ) The conditional distribution allows you to look at the effect of one variable, when the other variable is held fixed or known.

Conditional Distributions (Serum Cholesterol given Systolic Blood Pressure)

GRAPH: Conditional distributions of Systolic Blood Pressure given Serum Cholesterol

Notation: Let xij denote the frequency (no. of cases) where X (row variable) is i and Y (row variable) is j.

Different Models The Multinomial Model: Here the total number of cases N is fixed and xij follows a multinomial distribution with parameters pij

The Product Multinomial Model: Here the row (or column) totals Ri are fixed and for a given row i, xij follows a multinomial distribution with parameters pj|i

The Poisson Model: In this case we observe over a fixed period of time and all counts in the table (including Row, Column and overall totals) follow a Poisson distribution. Let mij denote the mean of xij.

Independence

Multinomial Model if independent and The estimated expected frequency in cell (i,j) in the case of independence is:

The same can be shown for the other two models – the Product Multinomial model and the Poisson model namely The estimated expected frequency in cell (i,j) in the case of independence is: Standardized residuals are defined for each cell:

The Chi-Square Statistic The Chi-Square test for independence Reject H0: independence if

Table Expected frequencies, Observed frequencies, Standardized Residuals c2 = 20.85 (p = 0.0133)

Example In the example N = 57,407 cases in which individuals were victimized twice by crimes were studied. The crime of the first victimization (X) and the crime of the second victimization (Y) were noted. The data were tabulated on the following slide

Table 1: Frequencies

Table 2: Expected Frequencies (assuming independence)

Table 3: Standardized residuals

Table 3: Conditional distribution of second victimization given the first victimization (%)

Log Linear Model

Recall, if the two variables, rows (X) and columns (Y) are independent then

In general let then (1) where Equation (1) is called the log-linear model for the frequencies xij.

Note: X and Y are independent if In this case the log-linear model becomes

Comment: The log-linear model for a two-way frequency table: is similar to the model for a two factor experiment

Three-way Frequency Tables

Example Data from the Framingham Longitudinal Study of Coronary Heart Disease (Cornfield [1962]) Variables Systolic Blood Pressure (X) < 127, 127-146, 147-166, 167+ Serum Cholesterol <200, 200-219, 220-259, 260+ Heart Disease Present, Absent The data is tabulated on the next slide

Three-way Frequency Table

Log-Linear model for three-way tables Let mijk denote the expected frequency in cell (i,j,k) of the table then in general where

Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

1. Model: (All Main effects model) ln mijk = u + u1(i) + u2(j) + u3(k) i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [1][2][3] Description: Mutual independence between all three variables.

2. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [12][3] Description: Independence of Variable 3 with variables 1 and 2.

3. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0. Notation: [13][2] Description: Independence of Variable 2 with variables 1 and 3.

4. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u23(j,k) i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0. Notation: [23][1] Description: Independence of Variable 3 with variables 1 and 2.

5. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) i.e. u23(j,k) = u123(i,j,k) = 0. Notation: [12][13] Description: Conditional independence between variables 2 and 3 given variable 1.

6. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k) i.e. u13(i,k) = u123(i,j,k) = 0. Notation: [12][23] Description: Conditional independence between variables 1 and 3 given variable 2.

7. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k) i.e. u12(i,j) = u123(i,j,k) = 0. Notation: [13][23] Description: Conditional independence between variables 1 and 2 given variable 3.

8. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) i.e. u123(i,j,k) = 0. Notation: [12][13][23] Description: Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

9. Model: (the saturated model) ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) + u123(i,j,k) Notation: [123] Description: No simplifying dependence structure.

Hierarchical Log-linear models for 3 way table Description [1][2][3] Mutual independence between all three variables. [1][23] Independence of Variable 1 with variables 2 and 3. [2][13] Independence of Variable 2 with variables 1 and 3. [3][12] Independence of Variable 3 with variables 1 and 2. [12][13] Conditional independence between variables 2 and 3 given variable 1. [12][23] Conditional independence between variables 1 and 3 given variable 2. [13][23] Conditional independence between variables 1 and 2 given variable 3. [12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. [123] The saturated model

Maximum Likelihood Estimation Log-Linear Model

For any Model it is possible to determine the maximum Likelihood Estimators of the parameters Example Two-way table – independence – multinomial model or

Log-likelihood where With the model of independence

and with also

Let Now

Since

Now or

Hence and Similarly Finally

Hence Now and

Hence Note or

Comments Maximum Likelihood estimates can be computed for any hierarchical log linear model (i.e. more than 2 variables) In certain situations the equations need to be solved numerically For the saturated model (all interactions and main effects), the estimate of mijk… is xijk… .

Discrete Multivariate Analysis Analysis of Multivariate Categorical Data

Multiway Frequency Tables Two-Way A

four -Way B A C D

Log Linear Model

Two- way table where The multiplicative form:

Log-Linear model for three-way tables Let mijk denote the expected frequency in cell (i,j,k) of the table then in general where

Log-Linear model for three-way tables Let mijk denote the expected frequency in cell (i,j,k) of the table then in general or the multiplicative form

Comments The log-linear model is similar to the ANOVA models for factorial experiments. The ANOVA models are used to understand the effects of categorical independent variables (factors) on a continuous dependent variable (Y). The log-linear model is used to understand dependence amongst categorical variables The presence of interactions indicate dependence between the variables present in the interactions

Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

1. Model: (All Main effects model) ln mijk = u + u1(i) + u2(j) + u3(k) i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [1][2][3] Description: Mutual independence between all three variables.

2. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [12][3] Description: Independence of Variable 3 with variables 1 and 2.

3. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0. Notation: [13][2] Description: Independence of Variable 2 with variables 1 and 3.

4. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u23(j,k) i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0. Notation: [23][1] Description: Independence of Variable 3 with variables 1 and 2.

5. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) i.e. u23(j,k) = u123(i,j,k) = 0. Notation: [12][13] Description: Conditional independence between variables 2 and 3 given variable 1.

6. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k) i.e. u13(i,k) = u123(i,j,k) = 0. Notation: [12][23] Description: Conditional independence between variables 1 and 3 given variable 2.

7. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k) i.e. u12(i,j) = u123(i,j,k) = 0. Notation: [13][23] Description: Conditional independence between variables 1 and 2 given variable 3.

8. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) i.e. u123(i,j,k) = 0. Notation: [12][13][23] Description: Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

9. Model: (the saturated model) ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) + u123(i,j,k) Notation: [123] Description: No simplifying dependence structure.

Hierarchical Log-linear models for 3 way table Description [1][2][3] Mutual independence between all three variables. [1][23] Independence of Variable 1 with variables 2 and 3. [2][13] Independence of Variable 2 with variables 1 and 3. [3][12] Independence of Variable 3 with variables 1 and 2. [12][13] Conditional independence between variables 2 and 3 given variable 1. [12][23] Conditional independence between variables 1 and 3 given variable 2. [13][23] Conditional independence between variables 1 and 2 given variable 3. [12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. [123] The saturated model

Goodness of Fit Statistics These statistics can be used to check if a log-linear model will fit the observed frequency table

Goodness of Fit Statistics The Chi-squared statistic The Likelihood Ratio statistic: d.f. = # cells - # parameters fitted We reject the model if c2 or G2 is greater than

Example: Variables Systolic Blood Pressure (B) Serum Cholesterol (C) Coronary Heart Disease (H)

Goodness of fit testing of Models MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ ----- -- ----------- ------- ------- ------- B,C,H. 24 83.15 0.0000 102.00 0.0000 B,CH. 21 51.23 0.0002 56.89 0.0000 C,BH. 21 59.59 0.0000 60.43 0.0000 H,BC. 15 58.73 0.0000 64.78 0.0000 BC,BH. 12 35.16 0.0004 33.76 0.0007 BH,CH. 18 27.67 0.0673 26.58 0.0872 n.s. CH,BC. 12 26.80 0.0082 33.18 0.0009 BC,BH,CH. 9 8.08 0.5265 6.56 0.6824 n.s. Possible Models: 1. [BH][CH] – B and C independent given H. 2. [BC][BH][CH] – all two factor interaction model

Model 1: [BH][CH] Log-linear parameters Heart disease -Blood Pressure Interaction

Multiplicative effect Log-Linear Model

Heart Disease - Cholesterol Interaction

Multiplicative effect

Model 2: [BC][BH][CH] Log-linear parameters Blood pressure-Cholesterol interaction:

Multiplicative effect

Heart disease -Blood Pressure Interaction

Multiplicative effect

Heart Disease - Cholesterol Interaction

Multiplicative effect

Another Example In this study it was determined for N = 4353 males Occupation category Educational Level Academic Aptidude

Occupation categories Self-employed Business Teacher\Education Self-employed Professional Salaried Employed Education levels Low Low/Med Med High/Med High

Academic Aptitude Low Low/Med High/Med High

Self-employed, Business Teacher Education Education Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 42 55 22 3 122 Low 0 0 1 19 20 LMed 72 82 60 12 226 LMed 0 3 3 60 66 Med 90 106 85 25 306 Med 1 4 5 86 96 HMed 27 48 47 8 130 HMed 0 0 2 36 38 High 8 18 19 5 50 High 0 0 1 14 15 Total 239 309 233 53 834 Total 1 7 12 215 235 Self-employed, Professional Salaried Employed Low 1 2 8 19 30 Low 172 151 107 42 472 LMed 1 2 15 33 51 LMed 208 198 206 92 704 Med 2 5 25 83 115 Med 279 271 331 191 1072 HMed 2 2 10 45 59 HMed 99 126 179 97 501 High 0 0 12 19 31 High 36 35 99 79 249 Total 6 11 70 199 286 Total 794 781 922 501 2998

This is similar to looking at all the bivariate correlations It is common to handle a Multiway table by testing for independence in all two way tables. This is similar to looking at all the bivariate correlations In this example we learn that: Education is related to Aptitude Education is related to Occupational category Can we do better than this?

Fitting various log-linear models Simplest model that fits is: [Apt,Ed][Occ,Ed] This model implies conditional independence between Aptitude and Occupation given Education.

Log-linear Parameters Aptitude – Education Interaction

Aptitude – Education Interaction (Multiplicative)

Occupation – Education Interaction

Occupation – Education Interaction (Multiplicative)

Conditional Test Statistics

Suppose that we are considering two Log-linear models and that Model 2 is a special case of Model 1. That is the parameters of Model 2 are a subset of the parameters of Model 1. Also assume that Model 1 has been shown to adequately fit the data.

In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is:

Example

Goodness of Fit test for the all k-factor models Conditional tests for zero k-factor interactions

Conclusions The four factor interaction is not significant G2(3|4) = 0.7 (p = 0.705) The all three factor model provides a significant fit G2(3) = 0.7 (p = 0.705) All the three factor interactions are not significantly different from 0, G2(2|3) = 9.2 (p = 0.239). The all two factor model provides a significant fit G2(2) = 9.9 (p = 0.359) There are significant 2 factor interactions G2(1|2) = 33.0 (p = 0.00083. Conclude that the model should contain main effects and some two-factor interactions

There also may be a natural sequence of progressively complicated models that one might want to identify. In the laundry detergent example the variables are: Softness of Laundry Used Previous use of Brand M Temperature of laundry water used Preference of brand X over brand M

A natural order for increasingly complex models which should be considered might be: [1][2][3][4] [1][3][24] [1][34][24] [13][34][24] [13][234] [134][234] The all-Main effects model Independence amongst all four variables Since previous use of brand M may be highly related to preference for brand M, add first the 2-4 interaction Brand M is recommended for hot water add 2nd the 3-4 interaction brand M is also recommended for Soft laundry add 3rd the 1-3 interaction Add finally some possible 3-factor interactions

Likelihood Ratio G2 for various models d]f] G2 [1][3][24] 17 22.4 [1][24][34] 16 18 [13][24][34] 14 11.9 [13][23][24][34] 13 11.2 [12][13][23][24][34] 11 10.1 [1][234] 14.5 [134][24] 10 12.2 [13][234] 12 8.4 [24][34][123] 9 [123][234] 8 5.6

Discrete Multivariate Analysis Analysis of Multivariate Categorical Data

Log-Linear model for three-way tables Let mijk denote the expected frequency in cell (i,j,k) of the table then in general where

Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

Models for three-way tables

1. Model: (All Main effects model) ln mijk = u + u1(i) + u2(j) + u3(k) i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [1][2][3] Description: Mutual independence between all three variables. Comment: For any model the parameters (u, u1(i) , u2(j) , u3(k)) can be estimated in addition to the expected frequencies (mijk) in each cell

2. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [12][3] Description: Independence of Variable 3 with variables 1 and 2.

3. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0. Notation: [13][2] Description: Independence of Variable 2 with variables 1 and 3.

4. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u23(j,k) i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0. Notation: [23][1] Description: Independence of Variable 3 with variables 1 and 2.

5. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) i.e. u23(j,k) = u123(i,j,k) = 0. Notation: [12][13] Description: Conditional independence between variables 2 and 3 given variable 1.

6. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k) i.e. u13(i,k) = u123(i,j,k) = 0. Notation: [12][23] Description: Conditional independence between variables 1 and 3 given variable 2.

7. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k) i.e. u12(i,j) = u123(i,j,k) = 0. Notation: [13][23] Description: Conditional independence between variables 1 and 2 given variable 3.

8. Model: ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) i.e. u123(i,j,k) = 0. Notation: [12][13][23] Description: Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

9. Model: (the saturated model) ln mijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) + u123(i,j,k) Notation: [123] Description: No simplifying dependence structure.

Goodness of Fit Statistics The Chi-squared statistic The Likelihood Ratio statistic: d.f. = # cells - # parameters fitted We reject the model if c2 or G2 is greater than

Conditional Test Statistics

In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is:

Stepwise selection procedures Forward Selection Backward Elimination

Forward Selection: Starting with a model that under fits the data, log-linear parameters that are not in the model are added step by step until a model that does fit is achieved. At each step the log-linear parameter that is most significant is added to the model: To determine the significance of a parameter added we use the statistic: G2(2|1) = G2(2) – G2(1) Model 1 contains the parameter. Model 2 does not contain the parameter

Backward Elimination: Starting with a model that over fits the data, log-linear parameters that are in the model are deleted step by step until a model that continues to fit the model and has the smallest number of significant parameters is achieved. At each step the log-linear parameter that is least significant is deleted from the model: To determine the significance of a parameter deleted we use the statistic: G2(2|1) = G2(2) – G2(1) Model 1 contains the parameter. Model 2 does not contain the parameter

K = knowledge N = Newspaper R = Radio S = Reading L = Lectures

Continuing after 10 steps

The final step

The best model was found a the previous step [LN][KLS][KR][KN][LR][NR][NS]

Modelling of response variables Independent → Dependent

Logit Models To date we have not worried whether any of the variables were dependent of independent variables. The logit model is used when we have a single binary dependent variable.

The variables Type of seedling (T) Depth of planting (D) Longleaf seedling Slash seedling Depth of planting (D) Too low. Too high Mortality (M) (the dependent variable) Dead Alive

The Log-linear Model Note: mij1 = # dead when T = i and D = j. mij2 = # alive when T = i and D = j. = mortality ratio when T = i and D = j.

Hence since

The logit model: where

Thus corresponding to a loglinear model there is logit model predicting log ratio of expected frequencies of the two categories of the independent variable. Also k +1 factor interactions with the dependent variable in the loglinear model determine k factor interactions in the logit model k + 1 = 1 constant term in logit model k + 1 = 2, main effects in logit model

1 = Depth, 2 = Mort, 3 = Type

Log-Linear parameters for Model: [TM][TD][DM]

Logit Model for predicting the Mortality

The best model was found by forward selection was [LN][KLS][KR][KN][LR][NR][NS] To fit a logit model to predict K (Knowledge) we need to fit a loglinear model with important interactions with K (knowledge), namely [LNRS][KLS][KR][KN] The logit model will contain Main effects for L (Lectures), N (Newspapers), R (Radio), and S (Reading) Two factor interaction effect for L and S

The Logit Parameters for the Model : LNSR, KLS, KR, KN ( Multiplicative effects are given in brackets, Logit Parameters = 2 Loglinear parameters) The Constant term: -0.226 (0.798) The Main effects on Knowledge: Lectures Lect 0.268 (1.307) None -0.268 (0.765) Newspaper News 0.324 (1.383) None -0.324 (0.723) Reading Solid 0.340 (1.405) Not -0.340 (0.712) Radio Radio 0.150 (1.162) None -0.150 (0.861) The Two-factor interaction Effect of Reading and Lectures on Knowledge

Fitting a Logit Model with a Polytomous Response Variable

Example: NA – Not available

The variables Race – white, black Age - < 22, ≥ 22 Father’s education – GS, some HS, HS grad, NA Respondents Education - GS, some HS, HS grad – the response (dependent) variable

Techniques for handling Polytomous Response Variable Approaches Consider the categories 2 at a time. Do this for all possible pairs of the categories. Look at the continuation ratios 1 vs 2 1,2 vs 3 1,2,3 vs 4 etc

Causal or Path Analysis for Categorical Data

When the data is continuous, a causal pattern may be assumed to exist amongst the variables. The path diagram This is a diagram summarizing causal relationships. Straight arrows are drawn between a variable that has some cause and effect on another variable X Y Curved double sided arrows are drawn between variables that are simply correlated Y X

Job Stress Smoking Example 1 The variables – Job stress, Smoking, Heart Disease The path diagram Job Stress Smoking Heart Disease In Path Analysis for continuous variables, one is interested in determining the contribution along each path (the path coefficents)

Job Stress Smoking Drinking Example 2 The variables – Job stress, Alcoholic Drinking, Smoking, Heart Disease The path diagram Job Stress Smoking Drinking Heart Disease

In analysis of categorical data there are no path coefficients but path diagrams can point to the appropriate logit analysis Example In this example the data consists of a two wave, two variable panel data for a sample of n =3398 schoolboys. It is looking at “membership” and “attitude towards” the leading crowd.

A B C D The path diagram: This suggest predicting B from A, then C from A and B and finally D from A, B and C.

Example 2 In this example we are looking at Social Economic Status (SES) Sex IQ Parental Encouragement for Higher Education (PE) College Plans(CP)

The Path Diagram Sex SES IQ PE CP

The path diagram suggests Predicting Parental Encouragement from Sex, SocioEconomic status, and IQ, then Predicting College Plans from Parental Encouragement, Sex, SocioEconomic status, and IQ.

Logit Parameters: Model [ABC][ABD][ACD][BCD]

Two factor Interactions

Logit Parameters for Predicting College Plans Using Model 9: Logit Parameters for Predicting College Plans Using Model 9: [ABCD][BCE][AE][DE]