Higher Order Contingency Tables and Logistic Regression Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.
Predicting an Outcome A major goal for Epidemiology is quantifying the relationship between sets of disease predictors and binary outcomes like diseased/disease free. The first step in describing the relationship between your predictor(s) and your outcome is to do univariate analyses. That is, test for an association between each of your predictors and the outcome.
Predicting an Outcome After you assess the univariate relationships between your predictors and your outcome, you will want to look for effect modification (what everyone else calls interactions) and confounding. You may want to look for higher order interactions. You should look for all interactions which could be there based on subject matter knowledge. Do not test for interactions that you can not explain (in your native language)! I can think about three-way interactions but I can not get my brain around four-way.
Predicting an Outcome Prior to doing any analyses, write out on a spreadsheet all of the effects that interest you. You then will test for those, and only those, effects.
Univariate Analysis: Is There an Association? The way you assess the relationship between your predictors and the outcome depends on your data. If you have a 2x2 table, you just look at the confidence interval for the odds ratio (OR). Otherwise: Row Variable Column Variable Proc Freq Switch Statistic Nominal chisq Pearson 2 NominalOrdinalcmh2Mean Score Ordinal cmh1 Mantel Haenszel 2
Univariate Analysis (2) Strength of an Association If you are looking at a 2x2 table, you can assess the strength of the association with the odds ratio. Otherwise: You can get them from the /measures switch on the tables statement of proc freq. Row Variable Column Variable Statistic NominalNominal or Ordinal Uncertainty Coefficient c|r Ordinal Spearman Correlation
Sets of Univariate Statistics You can request all the univariate measures like this: proc freq data=blah; tables (sex pih_total)*pre_term_l /chisq cmh measures; run; List all your predictors here…and your outcome here.
Sets of 2x2 Tables (1) Cochran Mantel-Haenszel You will need to do analyses where the relationship between the predictor and disease is (at least partially) influenced by a third factor. This third factor can be a stratification factor from your study design or a confounder that you did not block for (i.e., group on). Regardless of the source, you can use the Cochran Mantel-Haenszel method to neutralize the third variable.
Sets of 2x2 Tables (2) Confounding & Interaction Invoking the CMH technique is simple. You add the extra variable (potential confounder) to the left side of the tables statement and add /cmh to the end of the line: tables school * exposure * spots / cmh; This will cause SAS to print out a contingency table for each of the levels of the confounder and the common OR/RR. It does not print out the OR and RR for the subtables. To get them, add the measures switch with the CMH.
(2x)2x2 Ignoring Strata Data spots2; input school exposure $ health $ count datalines; 1 Exposed diseased 38 1 Exposed healthy 4 1 NotExp diseased 10 1 NotExp healthy 21 2 Exposed diseased 20 2 Exposed healthy 57 2 NotExp diseased 10 2 NotExp healthy 17 ;run; proc freq data = spots2; weight count; tables exposure*health/norow nocol chisq cmh; run;
This is the crude odds ratio.
2x2x2 Using Strata Tables (3) Results proc freq data = spots2; weight count; tables school*exposure*health /norow nocol chisq cmh measures; run; All is not well! Do not use the summary table. This is the adjusted odds ratio.
Simpson’s Paradox It is possible to have significant (but opposed) effects in the levels of the covariate, and the overall CMH statistic will indicate NO effects. The moral is to always look at your partial tables.
Exact Tests By default, SAS gives you approximate tests and p values for almost all statistics in proc freq. You can request exact measures. proc freq data = spots2; weight count; exact or; tables exposure*health/norow nocol chisq cmh; run;
Exact Tests (2) Exact tests take time and computer power but run them if you can.
Which CMH Summary? If you have a 2x2 table, then all of the CMH values will be the same. tables treat*response / chisq cmh; If you have a 2xN table, then use nonzero correlation or row mean scores differ. tables treat*response / cmh cmh2; If you have a Nx2 table, then use the nonzero correlation. tables treat*response / cmh cmh1 ;
Test for Trend Looking for a dose response in your predictor is important. If you would like to test for an increasing or decreasing trend in the binomial proportions across the levels of your ordinal variable, you can tell SAS to do a Cochran-Armitage test for trend. To do this, just include the keyword trend on the tables line: tables expLevel*hasCancer/cmh chisq measures trend;
Beyond Contingency Tables SAS provides you powerful ways of analyzing contingency table data. yProc freq provides you with all the tools you need to analyze 2x2 tables. yProc freq becomes more and more awkward as your table sizes increase. yInstead, you will use multiple/redundant modeling techniques.
Predicting Outcomes In other disciplines where outcomes are not dichotomous (e.g., alive or dead) or ordinal (e.g., high, medium or low risk), predictions are regularly done using linear regression techniques. yOutcome = base level + some relationship of the predictors to the outcome.
Problems with Regression Ordinary (least squares) linear regression is not well suited to predict a binary outcome, frequency counts or percentiles. yvalues outside of the possible range ynon-integer values yissues with variance Instead, epidemiologists typically use two other types of regression techniques. yLogistic or Poisson
When to Use Logistic Regression You use LR when you want to predict a binary outcome, say diseased vs. not diseased, and you know that you have numeric covariates (confounding variables) that you want to account for. It is analogous to ANCOVA for continuous outcomes. You choose one outcome and call it the ‘event.’ yMost people have a variable for each ‘bad thing’ in their data sets and code the event as a 1.
Age and Wisdom (1) Continuous Outcome Let’s say you have a complex measure of ‘wisdom’ and you want to predict it with age.
Age and Wisdom (2) Continuous Prediction Conceptually, you can see that a line predicts this data nicely. Percent wise = 1.63+age*.96
Age and Wisdom (3) Categorical Outcome If it is scored as a binary measure, no matter how well you place a line, your predictions are going to be way off.
Age and Wisdom (4) Categorical Prediction Ideally, you want some function that is close to a step function.
Logistic Fit With logistic regression you get the probability of going into the event group (which is the wise group in this case) expressed in terms of odds. yComplete separation of groups is actually a problem…. More on that later.
Odds and Probabilities I have a hard time thinking in terms of odds. Fortunately, it is easy to convert back and forth between probabilities and odds. prob = odds/(odds+1); odds = prob/(1-prob);
Why Odds Anyway? Odds are used to counteract the fact that linear regression produces probability values outside the range of 0 and 1. Going with an odds forces the upper bound on the probability. The lower bound is achieved by taking the natural log of the regression value.
Why Odds Anyway? (2) So whereas from ordinary linear regression you get: Probability = baseline+(predictor*weight value) wise=1.63+age*.96 In logistic regression you calculate: LN(probability/1-probability)= baseline+(predictor*weight value)
What Values Do You Want? With LS regression you get beta weights (parameter estimates) that tell you how much the outcome changes with each unit of the predictor. wise=1.63+age*.96 With LR your parameter estimates are in log odds terms which no one can understand, but if you raise the values to the log base e, then the values make some sense. odds wise=e baseline +age*e value Every unit of age increases your wisdom by about 1. Every unit of age increases your odds of being wise by this amount.
Enough! How Do I Do It? SAS provides you with five procedures that all do logistic regression. ylogistic – quick and friendly ygenmod – much more powerful yprobit – this is the only time I’ll mention it… ycatmod – more than binary outcomes yphreg – conditional logistic for matched case- control data
Fitting a Model Fitting a logistic model is easy with the logistic procedure. But there is one trick. For some (stupid) reason SAS wants to predict group membership into the lowest category (i.e., it wants events to be 0 and non-events to be 1). Typically people use the descending (abbreviated desc) option to make SAS call the events “1” and non-events “0.” proc logistic data = blah descending; model outcome = predict1 predict2; run;
The goal here is to predict who would get severe eclampsia using two of the mothers’ blood chemistries. The primary hypothesis for the study says that these two factors are related to eclampsia. Later I will show you how to choose a good set of predictors from a large set. proc logistic data = ana_temp desc; model severe_pre=dsl_igf dsl_insuli; run; A Real Example Notice the abbreviation of descending.
A Real Example (2) Logistic regression uses a mathematical technique called maximum likelihood estimation, which is not guaranteed to produce a result. Rather, it tries to converge on a valid solution through successive approximations. If it fails to converge on an answer, you have a problem that statisticians like to call infinite parameters.
A Real Example (3) For now, only pay attention to these two sections. Verify that your cases are listed first by looking at the frequency. Check the convergence.
A Real Example (4) This tests whether the model is any good at all. You want to reject the hypothesis of a worthless model. This tells you about the value of the predictors. The “point estimate” is e estimate. It tells you the impact on the predicted odds based on a one unit increase in the predictor. Notice that neither is a statistically significant predictor.
Beta = 0 Statistics These statistics test to see if all your predictors are not good. They are all asymptotically equivalent. If they are wildly different, like this example, you probably have power problems. yThe Likelihood Ratio statistic (AKA: –2 Log L) is preferable for smaller samples. They usually do not differ.
Proc Logistic Improved Students don’t like specifying descending because it is confusing. In modern versions of proc logistic you can specify the event explicitly. model cancer = pack; strata center; proc logistic data=ana_temp; model severe_pre (event = "Sick") = dsl_igf dsl_insuli/plcl plrl; units dsl_igf = 10; run;
Enterprise Guide
Categorical Predictors You interpret the exponentiated parameter estimates as the change in odds of an event associated with a one unit increase in the predictor. What happens when you have a categorical predictor? You want to have a model that tells you the change in your odds of an event when you are in a group relative to a referent group.
Categorical Predictors You can get SAS to give you the odds of an event given in a category relative to a referent group. Say you have packs of cigarettes smoked per day as a variable called “packs” with the values: none, half, full, many. proc logistic data = lung; class packs (ref="none")/ param = ref; model cancer (event = "Sick") = pack; Run;