Log Linear Modeling of Independence

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
Lecture 23: Tues., Dec. 2 Today: Thursday:
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Discrete Probability Distributions
Categorical Data Prof. Andy Field.
Keith D. McCroan US EPA National Air and Radiation Environmental Laboratory.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
1 Chi-square Test Dr. T. T. Kachwala. Using the Chi-Square Test 2 The following are the two Applications: 1. Chi square as a test of Independence 2.Chi.
Analysis of variance Tron Anders Moger
Objectives (BPS chapter 12) General rules of probability 1. Independence : Two events A and B are independent if the probability that one event occurs.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Stats Methods at IC Lecture 3: Regression.
CHAPTER 12 More About Regression
Inferential Statistics
Dependent-Samples t-Test
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Logistic Regression APKC – STATS AFAC (2016).
Statistical Analysis Professor Lynne Stokes
Applied Biostatistics: Lecture 2
The binomial applied: absolute and relative risks, chi-square
Chapter 12 Tests with Qualitative Data
Active Learning Lecture Slides
CHAPTER 11 Inference for Distributions of Categorical Data
Categorical Data Aims Loglinear models Categorical data
CHAPTER 12 More About Regression
Chapter 25 Comparing Counts.
Chapter 5 Sampling Distributions
12 Inferential Analysis.
Chapter 5 Sampling Distributions
Probability Distributions for Discrete Variables
Probability Calculus Farrokh Alemi Ph.D.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Chapter 5 Sampling Distributions
Stratification Matters: Analysis of 3 Variables
Saturday, August 06, 2016 Farrokh Alemi, PhD.
Test of Independence in 3 Variables
Test of Independence through Mutual Information
Propagation Algorithm in Bayesian Networks
CHAPTER 11 Inference for Distributions of Categorical Data
Probability Key Questions
Comparing two Rates Farrokh Alemi Ph.D.
When You See (This), You Think (That)
12 Inferential Analysis.
CHAPTER 11 Inference for Distributions of Categorical Data
Wednesday, September 21, 2016 Farrokh Alemi, PhD.
CHAPTER 12 More About Regression
Chapter 26 Comparing Counts.
Product moment correlation
CHAPTER 11 Inference for Distributions of Categorical Data
Improving Overlap Farrokh Alemi, Ph.D.
CHAPTER 11 Inference for Distributions of Categorical Data
HIMS 650 Homework set 5 Putting it all together
Section 11-1 Review and Preview
CHAPTER 12 More About Regression
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 26 Comparing Counts.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
InferentIal StatIstIcs
Risk Adjusted P-chart Farrokh Alemi, Ph.D.
MGS 3100 Business Analysis Regression Feb 18, 2016
Chapter 5: Sampling Distributions
Categorical Data By Farrokh Alemi, Ph.D.
Presentation transcript:

Log Linear Modeling of Independence September 26, 2016 Farrokh Alemi, PhD. This lecture focuses on log linear modeling of occurrences of variables. The lecture is based on the treatment of the topic by Alan Agresti in the book on Categorical Data Analysis.

Test of Independence Chi-square Test Mutual Information Experts’ Judgment Log-linear Poisson Regression There are many ways to test independence of variables. One approach is to hand calculate the difference between observed joint probability and the product of marginal probabilities. The approach leads to a chi-square test of independence. Another approach is to use mutual information, a procedure where counts of co-occurrences are used to measure dependence. One approach is to rely on expert’s judgment of whether one piece of data provides any new information to another in predicting a specific outcome. Still another approach, one that is the focus of this lecture is to use log-linear Poisson regression.

Poisson (Log Linear) Regression Response Has a Poisson Distribution Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model.

Poisson Distribution Count of Rare Events Count of rare events has a Poisson distribution, when the number of trials is large and each trial is independent. Furthermore, the expected value of Poisson distribution is constant over time.

Examples of Poisson Distribution Count of Adverse Events For example, the number of adverse events is likely to have a Poisson distribution. We have a large number of patients and the probability of adverse event for each patient is quite rare.

Examples of Poisson Distribution Count of Combination of Events For a more general example, the count of combination of events is also likely to have a Poisson distribution, when there are large number of events and the probability of any combination is relatively small. This is the case in almost any contingency table with more than 5 variables. There are at least 2 to the power of 5 or 32 possible combination and if each combination was equally likely the probability of each combination will be 1 divided by 32 or 0.05, yielding a relatively small probability. So it is likely that the cell values in 5 or more variable contingency table has a Poisson distribution. Of course, whether or not it has must be tested empirically.

Binomial Converges to Poisson Observations gets larger Probability gets smaller If we have many events, each having a binominal distribution, then when the number of events goes to infinity and the probability of each event becomes progressively smaller then the combination of the events have a Poisson distribution.

Binomial Converges to Poisson Observations gets larger Probability gets smaller If we have many events, each having a binominal distribution, then when the number of events goes to infinity and the probability of each event becomes progressively smaller then the combination of the events has a Poisson distribution.

Poisson Distribution In a Poisson distribution, the probability of observing k items can be calculated from this formula. In this formula meu is the Greek letter and indicates the expected value of the probability distribution, in other words it is a number that shows what will happen in general. K is a constant and it typically is the count of an event that goes from 0 to infinity or if Poisson is used as an approximation a relatively large number. The letter e indicates the famous irrational number, where the first few digits are: 2.718. The Taylor series can be used to show that the numbers calculated in this fashion do in fact add up to one and therefore it is a probability distribution. It is surprising that the expected value of a Poisson distribution is the mean of the distribution, which provides us with an easy way to predict what will happen in the future based on a Poisson distribution.

Homogenous Poisson Regression Response In a Homogenous Poisson regression, the response variable is the log of count of combination of events, the independent variables are either main effects of each event or pair wise combination of events. When a pair of events have a statistically significant impact on the response variable, then these two events are occurring frequently together and holding all other variables constant are dependent on each other. Main Effects Pair Wise Interactions

Meaning of Parameters Holding all other variables constant, the pair are dependent on each other In Poisson regression when two events have a statistically significant impact on the count of co-occurrences, then these two events are occurring frequently together and holding all other variables constant are dependent on each other.

Association that can be excluded from the full model Parsimonious Model Association that can be excluded from the full model In Poisson regression, what really matters is the fit between the model and the data. This fit is reported typically as residual deviance or G squared. If we remove the association between two variables and the goodness of fit does not change significantly, then we have a more parsimonious model.

Data on Bundle Payments To demonstrate the use of log linear modeling we will explore the data provided here. These data indicate count of times a hospital faced above or below average bundled-payment adjustments. The hospital manager wants to know which of the various areas might have contributed to above average cost. Note that the data reported is counts. There are many observations but some cells have few counts. To analyze these data and estimate the homogenous association among any pair of the variables, we start with a model in which all homogenous variables are present.

All main & pair wise combinations Homogenous Model model = glm ( Y ~ (Orth + Rehab + Sev + SNF + Cost)^2, data=t, family=poisson) General Linear Model Count of Combinations All main & pair wise combinations Poisson Regression Here we see R code for a generalized linear model using Poisson distribution. The variable Y contains the count of combinations of the variables. This variable is regressed on 5 variables as well as pair wise combination of these 5 variables. This is referred to as the homogenous model as all possible pair wise associations are present.

Homogenous Model Rehab Severity SNF Bundle Cost Model G2 df All pairs Orthopedist Rehab Severity SNF Bundle Cost Model G2 df All pairs 15.34 16 Since all pair wise associations are included in the homogenous model, then the network depiction of the log linear model is one in which all variables are shown to be associated with each other. The homogenous model has a residual deviance, or G squared, of 15.34 with 16 degrees of freedom.

Homogenous Model but No Rehab & Cost Association df All pairs-RA 513.47 17 Orthopedist Rehab Severity SNF Bundle Cost If we try a model where all pair wise associations are included and the association between rehab and cost is not included then the G squared measure of goodness of fit is 513.47 with now 17 degrees of freedom. Comparing this model with the homogenous model, the exclusion of rehab and cost association has led to a significantly worse fit for an additional 1 degree of gain in degrees of freedom. Notice that this model fit implies that there is no link between rehab and cost as that association was not included in the model specification.

Homogenous Model without One Association df All pairs 15.34 16 All pairs - RS 15.78 17 If we continue in this fashion and try removing one association at a time from the homogenous fit, we find that the smallest loss in residual deviance is associated with removing the association between Rehab and Severity. In this plot the red line shows the performance of a model with all pairs of association and the blue shows the performance after removing one of the associations. The increase in G square is 0.44 for one additional gain in degrees of freedom, which seems worth it.

Homogenous Model But Not Rehab & Severity df All pairs - RS 15.78 17 Orthopedist Rehab Severity SNF Bundle Cost The network model now looks different. There is no connection between rehab and severity variables. But all other associations are still there. What this says is that Rehab and Severity are independent from each other conditional on the values of all other variables.

Homogenous Model But Not Rehab & Severity df All pairs 15.34 16 All pairs-RS-RN 16.74 18 All pairs-RS-RN-ON 22.02 19 All pairs-RS-RN-SA 19.09 We now look at models without the association of rehab and severity and compare it to the performance of the homogenous model, shown in red. The horizontal line shows the performance of the homogenous model. The blue bars show the performance of homogenous model minus key associations. We see that the smallest increase in residual deviance occurs when we remove the association between Rehab and the skilled nursing facility. All other removals seem to be too large for additional gains in degrees of freedom.

Final Model Rehab Severity Bundle Cost SNF Model G2 df All pairs-RS-RN 16.74 18 Orthopedist Rehab Severity SNF Bundle Cost This yields the final log linear Poisson regression for our data with the best possible fit. Note that in this network there are no arcs between rehab and either severity or skilled nursing facility. The interpretation of these missing links is that these pairs of variables are independent from each other, while holding all other variables constant.

Log Linear Poisson Regression Provides a Test of Association This lecture has shown how log linear Poisson regression provides a test of association among the variables.