Test of Independence in 3 Variables

Test of Independence in 3 Variables
Wednesday, September 21, 2016 Farrokh Alemi, PhD. This lecture focuses on possible ways variables can relate to each other. The lecture is based on slides prepared by Dr. Lin and modified by Dr. Alemi. It is also based on the treatment of the topic by Alan Agresti in the book on Categorical Data Analysis.

Independence in 3 Variables
Something Else If we are dealing with 3 or more variables, the establishment of independence becomes increasingly complex. Now all variables could be independent of each other or pairs of variables can be independent, or other possibilities may occur.

Complete Joint Conditional Homogenous Saturated A B C There are 5 unique ways 3 variables could be related to each other and scientist have named these ways. These 5 unique ways are called complete independence, joint independence, conditional independence, homegoenous association, and saturated models. In complete independence all variables are independent of each other. Here we are showing the 3 variables A, B and C without any linkage between them to mark that they are independent from each other.

Complete Joint Conditional Homogenous Saturated A B C Joint independence is the situation where two variables are independent of a third. In these graphs a bar between two variables indicates that they are dependent on each other, i.e. there is a relationship among them. There are 3 ways joint independence could occur among 3 variables. AB could be related by independent of C, AC could be related but independent of B, and BC could be related to each other but independent of A.

Complete Joint Conditional Homogenous Saturated A B C A B C A B C Conditional independence refers to the situation where knowledge of one variable is sufficient to make two previously dependent variables independent. A directional bar indicates that they are only dependent on each other when information about the other variable is known. There are 3 ways conditional independence could occur. Knowing B could make A and C independent. Knowing C could make A and B independent. Finally knowing A could make B and C independent.

Complete Joint Conditional Homogenous Saturated A B C C=k B=j A=i This is the situation where each pair is jointly independent from the third variable while holding the third variable at various fixed levels. Another way of saying this is that presence of a shared feature of a variable does not change the relationship among the other two variables. Thus the relationship between A and B remains the same independent of the shared feature of C.

Complete Joint Conditional Homogenous Saturated A B C C=k B=j A=i Finally, saturated model is the situation where all variables are dependent on each other.

Complete Joint Conditional Homogenous Saturated A B C C=k B=j A=i Parsimony These models are arranged in order of parsimony, with complete independence being most parsimonious and saturated model being the least parsimonious. In science, when a more parsimonious model fits the data, then it is preferred to the less parsimonious model. So if data can be modeled with both saturated and conditional independence, then conditional model is preferred. If data are completely independent, then these data can also be modeled by any of the other 5 methods but preference is given to the complete independence model.

How to test independence among 3 variables?
Observed Count The remainder of this lecture focuses on possible ways these different types of independence can be recognized. The general approach is to use the model of the data to predict estimate the count of data that should be expected in each cell in the table. Then the difference of the observed and expected value are distributed using a chi-square distribution and we can test if the model fits the data. Expected Count

How to test independence among 3 variables?
In each instance, we provide a model for estimating the expected count and the degrees of freedom for the chi-square test. Degrees of freedom is estimated as the number of observations minus the number of parameters estimated from the data by the model. Chi-squared tests of independence merely indicates the degree of evidence of association. These are inadequate for answering all questions about a data set. Rather than relying solely on the results of the tests, one must investigate the nature of the association; study residuals, decomposed chi-squared into components, and estimate parameters such as odds ratios that describe the strength of association. Terminology

Notation: 3 Variables Suppose that we have three categorical variables, A, B, C, where A takes possible value 1 through I. B takes possible value 1 through J. C takes possible values 1 through K. For example A could be 2 physicians in our clinic, George and Smith. B could be two nurses in our clinic, Jim and Jill. And C could be whether a patient has complained about the combined physician-nurse team. It takes on values of Yes and No.

Notation: Count of Observations
Number of times A=i, B=j, C=k in a sample of n triplets of A, B, & C If we collect the triplet A, B, and C for each unit in a sample of n units, then the data can be summarized as a three-dimensional table. Let Y be the count of units having A being i, B being j, and C being k. Then Y will provide a count for each cell in the three-dimensional table of A, B and C. When all variables are categorical, a multidimensional contingency table can be displayed.

Notation: Count of Observations
Number of Patients MD = George, RN = Jill, Complaint = Yes For example, Y is the count of patients who were seen by the clinical team of George and Jill and who had complained. This is one cell value in a larger table.

Table 1: Satisfaction Across Teams
Three Variables Table 1: Satisfaction Across Teams A:Physicians B:Nurses C:Complaint Percent Dissatisfied Yes No George, MD Jim, RN 53 424 11.11% Jill, RN 11 37 22.92% Smith, MD 16 0.00% 4 139 2.80% Total 440 10.75% 15 176 7.85% Each cell in this table is one of the observed counts. In this table we see the distribution of Y for different clinical teams and complaint combinations. We see the distribution of the counts for different combination of A, B and C.

Three Variables Table 1: Satisfaction Across Teams A:Physicians B:Nurses C:Complaint Percent Dissatisfied Yes No George, MD Jim, RN 53 424 11.11% Jill, RN 11 37 22.92% Smith, MD 16 0.00% 4 139 2.80% Total 440 10.75% 15 176 7.85% In the last two rows and in yellow, we see a marginal table for the count of complaints for just nurses, ignoring the physicians. Note that a marginal table is the total of the values while removing one of the variables. Marginal

Three Variables Table 1: Satisfaction Across Teams Physicians Nurses Complaint Percent Dissatisfied Yes No George, MD Jim, RN 53 424 11.11% Jill, RN 11 37 22.92% Smith, MD 16 0.00% 4 139 2.80% Total 440 10.75% 15 176 7.85% Partial The cross sections are called partial tables. In a partial table, one of the variables are held constant. For example, in the partial table shown in yellow the value of physicians is held constant to always be George. All of the data in this partial table are about teams involving George. Marginal

Notation: Summation We will use “+” to indicate summation over a subscript; for example, here we are summing over the subscript j to produce a marginal table involving subscripts i and k. In this marginal table, variable B is ignored. Note that we dropped the comma between the subscripts to save space.

Notation: Summation Here both the subscripts i and j are summed over. We are ignoring both the variables A and B. We see count associated with k levels in variable C. This is equivalent to totals for a marginal table of variable C

Notation: Assumption If the n observations in the sample are independent and identically distributed, then the vector of cells counts y has multinomial distribution. Note that Y and the symbol pi are in bold. These constructs are vectors or collection of variables and not a single variable.

Notation: Assumption We are assuming that across patients we have independent and identical distributed variables. This seems appropriate in many situations where the data are obtained from different patients and where there is no reason to believe that values of one patient can affect values of another. This assumption is not met when patients have infectious diseases, as the probability of infection of one patient is a affected with infection of others. It is not valid when over data collection time the underlying process has changed.

Hat to indicate it is an estimate
Notation: Estimate Hat to indicate it is an estimate Count Under the multinomial assumption, estimated probabilities for a cell is calculated by dividing the count in each cell by the total number of patients. Because this probability is estimated from the data we show it with a hat on top of the p, to emphasize that this is an estimate and not an observed value. Sample

Probability Note that the expected counts for each cell are equal to the observed counts. Multiplying the sample size by the estimated probability of the cell, gives us the expected count which is the same as the observed count. These expected counts can then be used in chi-square test to check if the model fits the data well.

A B C Saturated Model The first model we show how to test is the saturated model. The observed sample proportion and by extension observed sample count of each cell is often called the saturated model. In the saturated model the expected value for each cell is used to complete the chi-squared test. Fitting a saturated model might not reveal any special structure that may exist in the relationships among A, B, and C. To investigate these relationships, we will propose simpler models and perform tests to see whether these simpler models fit the data.

A B C Saturated Model If one of the variables is being predicted, then the saturated model can be thought of a logistic regression of C on the main effects of A and B as well as the interaction of A and B.

Homogenous Model B C A=i
The model of homogeneous association says that the conditional relationship between any pair of variables given the third one is the same at each level of the third one. That is, there are no interactions. An interaction means that the relationship between two variables changes across the levels of a third. This is similar in spirit to the multivariate normal distribution for continuous variables, which says that the conditional correlation between any two variables given a third is the same for all values of the third. Under the model of homogeneous association, there are no close-form estimate for the cell probabilities. In the two by two partial tables, the odds ratio can be calculated, and a chi-square test of homogeneity can be carried out. The degrees of freedom for this test is I minus 1, if I is the number of levels of the fixed variable. The procedure needs to be repeated holding the other variables fixed and examining homogeneity of the odds across the level of the other two variables. In addition, to effectively implement the procedure, data need to be expressed in two by two tables.

Homogenous Model If it is not possible to calculate the variance because one or more of the cell values are zero, a method to estimate the chi-square statistic that by passes division by zero, is provided by these formulas. First the data are organized into l strata, each strata being combination of two variables. The probability of the outcome in the l strata is estimated. Then the expect number of cases in each of the l strata is estimated. The observed and expected values are used to estimate chi-square with L minus 1 degrees of freedom. Based on lesson 45 of STAT 414 course in Penn State. Accessed on August at

Conditional Independence
B A C Conditional Independence This model indicates that A and B are related A and C are also related but only through their mutual association with A. For a given level of A, then B and C are independent. If A is used to stratify the data, then within the strata, B and C are independent. The test for conditional independence of B and C given A is equivalent to separating the table by levels of A going from 1 through I and testing for independence within each level.

Joint Independence A B C
This model indicates that A and B are related but C is independent of these two. The model is tested by creating marginal tables and using the count in these tables to estimate the count in the I by J by K cells of the observed data. A Chi-Squared test can be used to test this model. Note that in three variable models, one can test for joint independence of AB, AC and BC from their complements. Here we are showing how to test if AB is independent of C.

Complete Independence
A B C Complete Independence Under assumption of complete independence, the joint count can be estimated from the product of the marginal counts. Then the square of the difference of observed and expected count has a chi-square distribution which can be tested using the corresponding degrees of freedom.

Satisfaction Example Table 1: Satisfaction Across Teams Physicians Nurses Complained Yes No George, MD Jim, RN 53 424 Jill, RN 11 37 Smith, MD 16 4 139 This table is a 2 × 2 × 2 contingency table–two rows, two columns, and two layers. In this table we see how two physicians are working with two nurses and whether their patients have complained. The data is hypothetical but is easily available in complaint registries within most hospitals. The 684 patients classified in Table were patients at this hypothetical clinic. We will use this table to estimate the relationship among the three variables.

Test of Complete Independence
MD RN Complaint Predicted Observed George Jim Yes 38 53 No 341 424 Jill 15 11 132 37 Smith 103 16 4 40 139 We used the formula to predict the counts within each cell and then calculated the chi-square with 4 degrees of freedom. The chi-square statistic is which is statistically significant and therefore the complete independence model does not fit the data.

Test of Joint Independence
A B C Test of Joint Independence MD RN Complaint Predicted Observed George Jim Yes 47 53 No 430 424 Jill 5 11 43 37 Smith 2 14 16 4 129 139 Using the test of joint independence, we find that assuming that A and B are related and independent of C produces a chi-square of which is statistically significant at 2 degrees of freedom. The fit with the data is better than before but still not very good. This model assumes that the impact of the clinical team depends on who is working together, the combination of the team matters.

Other Tests? Parsimony Other tests can be carried out in the same fashion, each time calculating the chi-square. Since we wish to rely on the most parsimonious model, we have already found a very low chi-square. Therefore we can stop here. You may wish to carry out the remaining tests to see if they also provide additional insight into the structure of the data.

Test of Independence among 3 Variables
This lecture has shown how various tests that can be carried out to get insights into the relationship among 3 variables. Each test provides new insight into the data.

Test of Independence in 3 Variables

Similar presentations

Presentation on theme: "Test of Independence in 3 Variables"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Test of Independence in 3 Variables

Similar presentations

Presentation on theme: "Test of Independence in 3 Variables"— Presentation transcript:

Similar presentations

About project

Feedback