© Scott Evans, Ph.D. and Lynne Peeples, M.S.

© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Contingency Tables & Logistic Regression © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Two Weeks Ago… Counts and Proportions
Binary = Dichotomous Mutually exclusive endpoints Disease vs. No disease Success vs. Failure Hit vs. No Hit Heads vs. Tails Covered one and two sample tests of proportions for ONE variable. Both Exact and Normal approx methods © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Tonight… Relationships between Proportions
Is there an association between categorical variables? Chi-square & Fisher’s Exact Tests What is the magnitude and direction of this association? Odds Ratios Are there any intervening variables? Confounders & Effect Modifiers © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Contingency Tables We are often interested in determining whether there is an association between two categorical variables. Note that association does not necessarily imply causality. In these cases, data may be represented in a two-dimensional table. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Contingency Tables Smoking Smoker Non-smoker Lung Cancer Yes a c No b d © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Contingency Tables The categorical variables can have more than two levels. The variables may also be ordinal, however this requires more advanced methods. For now, we consider the case in which both variables are nominal. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Contingency Table Example: Bike Helmets
Consider the following data: If we want to test whether the proportion of unprotected cyclists that have serious head injuries is higher than that of protected cyclists, we can carry out a test of hypothesis involving the two proportions p1 =17/147=0.115, and p2=218/646=0.337. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

P< and thus we reject the null hypothesis at the 95% significance level. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Chi-Square Test Alternative technique to test of two independent proportions… Hypothesis Test: H0: No association HA: Association Strategy: Compare what is observed to what is expected if H0 is true (i.e., no association). If difference is large, then there is evidence of association If difference is not large, then insufficient evidence to conclude an association © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Chi-Square Test Some limitations: Does not describe the magnitude or the direction of the association Relies on “large sample theory” (an assumption), which means that the test may be invalid if expected cell sizes are too small (<5). Thus avoid use under these conditions. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Suppose that you wanted to determine whether there is any association between wearing helmets and frequency of brain injuries. Then we could perform the chi-square test (based on the χ2 distribution). This test is set-up as follows: Ho: Suffering a head injury is not associated with wearing a helmet Ha: There is an association between wearing a helmet and suffering a head injury © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Now consider the implication of the null hypothesis… If the distinction between the two groups (helmet wearers and non-helmet wearers) is an artificial one, then the head-injury rate is better estimated by: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

The expected number of injured protected cyclists is (0.2963)(147)=43.6 injuries on average (versus the observed 17). Similarly, the number of injured unprotected cyclists should be (0.2963)(646)=191.4 (versus the observed 218). The expected number of uninjured helmeted cyclists is ( )(147)=103.4 (versus the observed 130), and the expected number of unprotected uninjured cyclists is ( )(646)=454.6 (versus the observed 428). © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Chi-Square Test The chi-square test is based on quantifying whether deviations from these two expected numbers are serious enough to warrant rejection of the null hypothesis. In general, the chi square test looks like this: Ei is the expected number, Oi is the observed number, r is the number of rows, and c is the number of columns. Then, is distributed according to the chi-square distribution with df=(r-1)(c-1) degrees of freedom. Critical percentiles of the chi-square distribution can be found in the appendix of your textbook (Table A.8). © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Chi-Square Test Remember, all expected cell counts all must be ≥5 OBSERVED Exposed Not Event O11 O12 No O21 O22 EXPECTED Exposed Not Event E11 E12 No E21 E22 vs. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Returning to our example… The chi square test (with continuity correction) is: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

We compare this value to 3.84, the right tail of the chi-square distribution with (2-1)(2-1)=1 degree of freedom. 27.27>3.84, so the null hypothesis is rejected. (Note: this uses a continuity correction factor.) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

1 2 3 4 5 6 3.84 When df=1, χ2= Z2 Chi-square distribution with 1 degree of freedom. Note the 5% right tail (to the right of 3.84). We rejected the null hypothesis because as extreme values as or higher would have much less than 5% probability of being observed, if the null hypothesis were correct. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

STATA output: The p-value of the test is < 0.05, so we reject the null hypothesis and conclude that there is an association between wearing a helmet and head injury. STATA calculates the χ2 a bit differently ( instead of from our hand-calculations). Same conclusion as the two-sample test of proportion. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Exact Tests Exact tests do not rely on the assumption of large samples (i.e., ok with <5 expected cell counts) Always use with small expected cell sizes Hypothesis Test: H0: no association HA: association Computes “exact” probability of observing the data in the given study, if no association was present. Does not describe the magnitude or the direction of the association Often called “Fishers exact test” for 2x2 tables. For more general dimensions, it is simply called an “exact test”. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Exact Tests Computationally intensive (particularly for large datasets) For this reason, it is historically been used as a back-up for the chi-square test when samples were small. However, given the power of today’s computers, this is a recommended primary analysis strategy (instead of a chi-square test) whenever possible. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Exact Test Example Where are different cars advertised? © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Exact Test Example p<0.05  Significant difference between where cars are advertised. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Odds Ratio (OR) The Chi Square (and Exact) tests of association answers only the question of association. It does not comment on the magnitude or directionality of the association. OR is a measure of association indicating magnitude and direction. Commonly used in epidemiology Ranges from 0 to ∞. Approximates how much more likely (or unlikely) it is for the outcome to be present among those with “exposure” than those without exposure. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Odds Ratio (OR) Odds of having the disease if exposed: P(disease|exposed)/[1-P(disease|exposed)] Odds of having the disease if unexposed are: P(disease|unexposed)/[1-P(disease|unexposed)] The Odds Ratio (OR) is defined as: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Odds Ratio (OR) Consider the following 22 table: An estimate of the odds ratio is: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Odds Ratio (OR) Useful regardless of how data were collected. OR~RR when disease is rare RR: Relative Risk or Risk Ratio Ratio of the risk of developing a disease if exposed relative to the risk of developing a disease if unexposed © Scott Evans, Ph.D. and Lynne Peeples, M.S.

OR: Interpretation Example #1
Let y denote the presence (1) or absence (0) of lung cancer and x denote whether the person is a smoker (1=smoker, 0=non-smoker). An estimated odds ratio of 2 implies that lung cancer is twice as likely to occur among smokers than among nonsmokers in the study population. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

OR: Interpretation Example #2
Let y denote the presence (1) or absence (0) of heart disease and x denote whether the person engages in regular strenuous physical exercise (1= exercise, 0= no exercise) An estimated odds ratio of 0.5 implies that heart disease is half as likely to occur among those who exercise than those who do not exercise. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Odds Ratio (OR) If the odds of having the disease in the exposed and unexposed groups are equal, then the odds ratio should be close to 1. A test of this is constructed as follows: Ho: There is no association between exposure and disease Ha: There is an association between exposure and disease. If the null hypothesis is true, the odds ratio should be close to 1. The test will answer the question: “How far from 1 is too far to warrant rejection of the null hypothesis?” © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Odds Ratio (OR) The OR is not distributed normally! Skewed to the right… If the denominator is larger, OR in [0,1] If the numerator is larger, OR in [1,∞] y Estimated OR } Equally likely © Scott Evans, Ph.D. and Lynne Peeples, M.S.

ln(Odds Ratio) Fortunately, the natural logarithm (ln) of the OR is distributed normally. In fact, the statistic ~ is approximately distributed according to the standard normal distribution. Ranges from - ∞ to ∞. Allows us to derived tests and confidence intervals as usual. We then convert back to the original scale using the exponential function. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Hypothesis Testing w/ OR
© Scott Evans, Ph.D. and Lynne Peeples, M.S.

OR Confidence Intervals
(1-α)% confidence interval of the log-odds ratio is given by Thus, the (1-α)% confidence interval of the true odds ratio is given by Note: This confidence interval can also be used to perform a hypothesis test by inspecting whether it covers 1 (the hypothesized OR value under the null hypothesis). © Scott Evans, Ph.D. and Lynne Peeples, M.S.

OR Example: Electronic Fetal Monitoring
Consider data on use of EFM (Electronic Fetal Monitoring) and frequency of Caesarean birth deliveries. The data are as follows: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Our test of the null hypothesis of no association between EFM and Caesarean births is based on the statistic: Since Z=6.107>Z0.025=1.96, we reject the null hypothesis. These data are consistent with a strong (positive) association between EFM and Caesarean births. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

The 95% confidence interval is given by: which is consistent with the result of the test of hypothesis above (since 1 is not included in this interval). It is seen that the estimated odds ratio among women that were monitored via EFM, is from 44% higher to over double that of women that were not monitored by EFM. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

In STATA: The odds ratio is 1.72 with a 95% confidence interval (1.447, 2.050). Thus, the null hypothesis of no association is rejected as both limits of the confidence interval are above 1.0. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

OR Example: Coronary Heart Disease
A study of Age and Coronary Heart Disease (CHD) OR = 8.1 & 95% CI = (2.9, 22.9) The study suggests that, CHD is 2.9 to 22.9 times more likely among those 55 or over than for those less than 55. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Sets of Contingency Tables: Intervening Variable
We may have a scenario, whereby we have a contingency table for each level (stratum) of a third (potentially confounding) factor. Example: we develop a contingency table to examine the association between coffee consumption and myocardial infarction (MI). We gather these data for both smokers and non-smokers as smoking status may confound our results. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Interaction and Confounding
Interaction (effect-modification): there is an interaction between x and y when the effect of y on z depends upon the level of x. Example: If the risk of smoking on developing lung cancer differs between males and females, then there is an interaction between smoking and gender. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Confounding occurs when the effect of variable x on z is distorted when we fail to control for variable y. We say that y is a confounder for the effect of x on z. This is different from interaction. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Note: It can happen that, when groups are combined, the overall OR is significantly different than the individual OR’s across groups – even if these OR’s are deemed “homogeneous” (Simpson’s Paradox). Examples: Baseball Batting Averages Electoral College Medical Studies © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Sets of Contingency Tables: Coffee & Smoking Example
Smokers Non-Smokers © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Sets of Contingency Tables: Coffee & Smoking Example
The question that naturally arises is whether we should combine the information in those two tables and use all the available data in order to ascertain the effect of coffee on the risk of Myocardial Infarction (MI). However, if the association between coffee and MI were different in the group of smokers compared to the group of non-smokers (effect modification), then such an analysis would be inappropriate. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Homogeneous ORs? We must first determine if the OR is homogeneous across stratum. This can be done with a hypothesis test: H0: OR is homogeneous across strata HA: OR is heterogeneous across strata This is equivalent to testing for a statistical “interaction”. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Homogeneous ORs? If OR are heterogeneous across strata then: There is an interaction between the third variable and the association. We need to perform a separate analyses by subgroup. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Homogeneous ORs? If ORs are not heterogeneous across strata then we may use the MH Odds Ratio Test. MH OR: Measure of association Controls for the potentially confounding effect of a third variable. Weighted average of individual OR’s (i.e., adjusted). We can obtain a CI, as well as perform hypothesis tests, for the MH OR. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Mantel-Haenszel (MH) Methods
We utilize Mantel-Haenszel Methods… Generalizing, we have g tables (i=1,...,g) that are constructed as follows (g=2 in the previous example) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Mantel-Haenszel (MH) Methods
We employ the following strategy: Analyze the two tables separately. Based on the individual estimates of the odds ratios. Test the hypothesis that the odds ratios in the two subgroups are sufficiently close to each other (they are homogeneous). a. If the assumption of homogeneity is not rejected then perform an overall (combined) “stratified” analysis. b. If the homogeneity assumption is rejected, then perform separate “subgroup” analyses (the association of the two factors is different in each subgroup). © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Test of Homogeneity The test of homogeneity is set-up as follows: Ho: OR1=OR2 (the two odds ratios do not have to be 1, just equal) 2. Ha: OR1≠OR2 (only two-sided alternatives are possible with the chi square test) The test statistic has an approximate i.e., a chi square distribution with g-1 degrees of freedom, where… © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Test of Homogeneity We use the individual odds ratios, producing a weighted average, weighing each of them inversely proportional to the square of their standard errors (one over their variance) to down-weight odds ratios with high variability. High variability means low information. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Test of Homogeneity 4. Rejection rule. Reject the null hypothesis (conclude that the two subgroups are not homogeneous) if 1 2 3 4 5 6 3.84 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Test of Homogeneity: Coffee & Smoking Example
Back to our smoking example: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Test of Homogeneity: Coffee & Smoking Example
By the rejection rule, is not larger than any usual critical value (as seen in the Appendix). Thus, we do not reject the null hypothesis. No evidence for heterogeneity. It is appropriate to proceed with a combined, stratified analysis. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Combined OR: Coffee & Smoking Example
The Summary Odds Ratio is a weighted average of the odds ratios for the g separate strata: So, after adjusting for smoking status, those who drink coffee have 2.18 times greater odds of experiencing nonfatal myocardial infarction compared to those who don’t drink coffee. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Combined OR: Confidence Interval
The confidence intervals of the overall ratio are constructed similarly to the one-sample case. The only difference is the estimate of the overall odds ratio, and its associated standard error. In general a (1-α)% confidence interval based on the standard normal distribution is constructed as follows: where , and , and the wi are defined as before. Since Y=ln(OR), the (1-α)% confidence interval of the common odds ratio is: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Combined OR: Coffee & Smoking Example CI
In the previous example, a 95% confidence interval is: Thus, at the 95% level of significance, coffee drinkers have from 73% higher risk for developing MI, to almost triple the risk, compared to non-coffee drinkers. Since this interval does not contain 1, this confidence interval implies that we should reject the null hypothesis of no (overall) association. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Mantel-Haenszel (MH) Test
Finally, we test whether this summary odds ratio is equal to 1. The Mantel-Haenszel test is based on the chi square distribution and the simple idea that if there is no association between “exposure” and “disease”, then the number of exposed individuals ai contracting the disease should not be too different from: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

To see this, one must recall that under independence, the probability . If A=“Subject has the disease”, and B=“Subject is exposed” then © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Thus, under the assumption of independence (no association), A less obvious estimate of the variance of ai is: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Methods: Coffee & Smoking Example
In the above example, a1=1,011, m1 = 981.3, σ21=29.81, a2=383, m2=358.4, σ22=37.69. Thus, Since is much larger than 3.84 the 5% tail of the chi-square distribution with 1 degree of freedom, we reject the null hypothesis. It seems that coffee consumption has a significant effect on the risk of M.I. across smokers and non-smokers. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Methods: Coffee & Smoking Example
STATA Output: Test of Homogeneity: p= (Note STATA chi-sq=0.933, slightly higher than our hand calculation of 0.896) M-H OR Test: p<0.001 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Methods Summary: Coffee & Smoking Example
Analyzed the two tables separately. The odds ratio among smokers is 2.46, and among non-smokers is Then, based on the individual estimates of the odds ratios… 2. Tested the hypothesis that the odds ratios in the two subgroups are sufficiently close to each other (i.e., they are homogeneous). The test of homogeneity (“test for heterogeneity” in STATA) has a p-value > We do not reject the hypothesis of homogeneity in the two groups. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

MH Methods Summary: Coffee & Smoking Example
3 a. Since the assumption of homogeneity was not rejected (p=0.334) we performed an overall (combined) analysis. From this analysis, the hypothesis of no association between coffee consumption and myocardial infarction is rejected (M-H p-value < ). Since this is the case, by inspection of the combined Mantel-Haenszel estimate of the odds-ratio (2.18) we see that the risk of coffee drinkers (adjusting for smoking status) is over twice as high as that of non-coffee drinkers. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

One More OR Example: Low Birth Weight
Low Birth Weight by Smoking Status, stratified by Race: WHITE OR = 19(40)/4(33) = 760/132 = 5.76 BLACK OR = 6(11)/5(4) = 66/20 = 3.30 OTHER OR = 5(35)/20(7) = 175/140 = 1.25 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

One More OR Example: Low Birth Weight
We now have three groups, so using a chi-squared distribution with g-1=2 degrees of freedom we perform the test of homogeneity. X2H =  p=0.221 Despite apparent differences in odds ratios between strata, they are within sampling variability of one another. Thus we can perform combined analyses. M-H Odd Ratio Estimate = 3.09 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression Extends MH methods to include multiple variables Including continuous confounders and exposures Allows us to predict dichotomous outcomes Why can’t we simply use linear regression…? © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression Estimated values of P(Y=1) 1 Estimated value of p Linear model not appropriate! Predicted probabilities must stay between 0 and 1. x © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression Assumptions Responses are Bernoulli Parameters are linear on logit scale: Where p =P(Y=1) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression We can solve for p, proportion of times that the response variable, Y, takes on the value 1: © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression Apply simple linear regression techniques, just interpret differently… β0= log(odds when x=0) e β0= Odds Ratio (when x=0) β1= log(odds ratio) = log(odds in group 1) – log(odds in group 0) e β1= Odds Ratio (between group 1 and group 0) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression Hypertension Example Study of the relationship between blood pressure and blood lead levels. Hypert=1 for hypertensive and 0 otherwise Sex=1 for males and 0 for females. Lead=1 for high blood lead levels and 0 for low blood lead levels. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Logistic Regression Hypertension Example Test of high vs. low blood lead levels .logistic hypert lead © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Multiple Logistic Regression
Extend simple logistic regression to include more than two variables. Both categorical and continuous predictors. Parallels methods for multiple linear regression Can estimate the effect of each variable while controlling for the effects of other (potentially confounding) variables in the model Indicator variables Interaction terms © Scott Evans, Ph.D. and Lynne Peeples, M.S.

Multiple Logistic Regression
Hypertension Example cont. Now include both lead and sex in model: .logistic hypert sex lead © Scott Evans, Ph.D. and Lynne Peeples, M.S.

© Scott Evans, Ph.D. and Lynne Peeples, M.S.

Similar presentations

Presentation on theme: "© Scott Evans, Ph.D. and Lynne Peeples, M.S."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Scott Evans, Ph.D. and Lynne Peeples, M.S.

Similar presentations

Presentation on theme: "© Scott Evans, Ph.D. and Lynne Peeples, M.S."— Presentation transcript:

Similar presentations

About project

Feedback