Download presentation
Presentation is loading. Please wait.
1
BIOST 536 Thompson1 Modeling the association between a binary outcome, Y, and an “exposure”, X Slides are from Research Professor M. Thompson
2
BIOST 536 Thompson 2 We might want to model p x =P(Y=1|X) What are the characteristics of p X ? 0 ≤p X ≤ 1 p X possibly monotone in X
3
BIOST 536 Thompson 3 Model g(p X )=β 0 + β 1 X
4
BIOST 536 Thompson 4 Logistic regression with a single binary risk factor Table AX=1X=0 Y=1abn1n1 Y=0cdn0n0 m1m1 m0m0 N
5
BIOST 536 Thompson 5 estimates P(Y=1 | X=0) estimates P(Y=1 | X=1) estimates the odds ratio: Cohort or Cross-sectional study X=1X=0 Y=1abn1n1 Y=0cdn0n0 m1m1 m0m0 N
6
BIOST 536 Thompson 6 Under the logistic model: logit(P(Y=1|X))=β 0 +β 1 X ln(OR) = ln( Ψ) = logit(P(Y=1|X=1))-logit(P(Y=1|X=0)) = β 0 + β 1 - β 0 = β 1 i.e. Ψ = exp(β 1 ) And: logit(P(Y=1 |X=0)) = β 0 P(Y=1 |X=0) is estimated by
7
BIOST 536 Thompson 7 The logistic equations: For binary X:
8
BIOST 536 Thompson 8 Case Control study Let Z =1 if individual was sampled = 0 otherwise Define π 1 = P(Z=1 | Y=1); π 0 = P(Z=1 | Y=0) Let p Z (X)= P(Y=1 | X, Z=1)
9
BIOST 536 Thompson 9 We can model: Logit(p Z (X))
10
BIOST 536 Thompson 10 If we model logit(p Z (X)) = α + β 1 X Then ln(Ψ) = β 1 or Ψ = exp(β 1 ) as before. But:
11
BIOST 536 Thompson 11 Parameter estimation: Maximum Likelihood We choose that estimate of the parameters that makes the data most likely to have occurred Let's take the simple setting of a cross-sectional study where we want to estimate the prevalence of a disease. Say we take a random sample of N individuals and w of them have the disease. The common sense estimate of the prevalence of disease is :
12
BIOST 536 Thompson 12 The likelihood Let w=number diseased in N independent individuals and let the true disease prevalence in the population be p. Then the likelihood of observing w diseased individuals in N is given by:
13
BIOST 536 Thompson 13 Setting the derivative equal to zero and solving for p: We want to choose that value of p which maximizes the likelihood or, equivalently, the log of the likelihood: Taking the derivative of l with respect to p:
14
BIOST 536 Thompson 14 In a study involving 53 men with prostate cancer, 20 of the men had nodal involvement How to estimate the chance of nodal involvement?
15
BIOST 536 Thompson 15 Using MLE in the logistic regression setting with a single covariate, X: Say we have N observations (Y i, X i ), i=1,2,…,N, where Y denotes disease status (0 =non-diseased, 1=diseased) and X is a risk factor of interest. Let p(X) denote P(Y=1 | X). Then:
16
BIOST 536 Thompson 16 L= l =ln(L) = Alternative (Binomial) formulation: If X takes on n different values, X j, j=1,2,…,n, and, for each X j, there are n j subjects, where, of whom y j are “diseased”, we can represent the log likelihood as
17
BIOST 536 Thompson 17 If we model then, for a single dichotomous risk factor, X, as in Table A, the maximum likelihood estimate of β 0 is ln(b/d) β 1 is ln(ad/bc) and hence the maximum likelihood estimate of P(Y=1 | X=1) is a/m 1 and of P(Y=1 | X=0) is b/m 0.
18
BIOST 536 Thompson 18 Hypothesis testing and confidence intervals Say we want to establish whether tumor size affects the chance of nodal involvement in men with prostate cancer Nodal | Tumor involvement| large small| Total -----------+----------------------+---------- Yes | 15 5 | 20 | 56% 19% | 38% -----------+----------------------+---------- No | 12 21 | 33 | 44% 81% | 62% -----------+----------------------+---------- Total | 26 27 | 53
19
BIOST 536 Thompson 19 Consider logit(P(nodal involvement | tumor size=X))=β 0 + β 1 X The maximum likelihood estimate of β 1 is Hence the OR is estimated by e 1.66 = 5.25 (=15x21/(5x12)) How do we test the statistical significance of the OR? Calculate a confidence interval?
20
BIOST 536 Thompson 20 Ho: β 1 =0 Ho: OR=Ψ=1
21
BIOST 536 Thompson 21 The deviance compares observed to predicted values via the likelihood: where To assess the role of X in the logistic model : Logit(P(Y=1|X))= β 0 + β 1 X We can consider G = D(model without X)-D(model with X) =
22
BIOST 536 Thompson 22 Let Y=nodal involvement in prostate cancer, X=tumor size We estimate: logit(P(Y=1|X)= -1.44+1.66 X, andOR=Ψ=5.25 Ln L= -31.276 Under the null model: Logit(P(Y=1))=constant, then Ln L=-35.126 Under the hypothesis H 0 : β 1 =0, G has a Χ 2 distribution with 1 degree of freedom Here G=-2*(-35.126+31.276) = 7.7 LR test: P(Х 2 1 > 7.7)=.0055 Score Test: P(Х 2 1 > 7.44)=.0064 Wald test: P(Х 2 1 > 6.92)=.0090 STATA gives the LR test for the fitted model versus the null model STATA does not do the Score test easily STATA gives the single parameter Wald test
23
BIOST 536 Thompson 23. logistic node tumor Logistic regression Number of obs = 53 LR chi2(1) = 7.70 Prob > chi2 = 0.0055 Log likelihood = -31.276312 Pseudo R2 = 0.1096 ------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- tumor | 5.25 3.310487 2.63 0.009 1.52552 18.06761 ------------------------------------------------------------------------------. logit ------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- tumor | 1.658228.630569 2.63 0.009.4223355 2.894121 _cons | -1.435085.4976116 -2.88 0.004 -2.410385 -.4597837 ------------------------------------------------------------------------------ Pseudo R 2 =1- l m / l 0 Stata code
24
BIOST 536 Thompson 24 The information matrix Maximum likelihood theory states that the variance estimators for estimates obtained from MLE can be derived from the matrix of second partial derivatives of the log likelihood. Minus this matrix is called the information matrix, I, and the estimated variances and covariances of the parameter estimates are obtained from the inverse of the matrix.
25
BIOST 536 Thompson 25 Let and β and let V=
26
BIOST 536 Thompson 26 Then I = X' V X and it can be shown that ~N(β, I -1 ) and so an approximate 95% CI for, e.g., β 1 is given by: and hence a 95% CI for the OR is obtained by exponentiation of the CI for β 1
27
BIOST 536 Thompson 27 Interpretation of coefficients Dichotomous X (coded 0 or 1) Here OR = or Interpretation of β 0 depends on study design.
28
BIOST 536 Thompson 28 Polytomous X Smoking cigs/day CHD>3021-301-200 Present39507098 Absent2533557351554 OR2.442.231.511.00
29
BIOST 536 Thompson 29 Polytomous X with k categories We define X 1, X 2, …, X k-1 dummy 0-1 design variables and consider the model: P(Y=1 | X) = β 0 + β 1 X 1 + β 2 X 2 + … β k-1 X k-1. is the odds ratio for the j'th category of X relative to the baseline category.
30
BIOST 536 Thompson 30 Stata code:. input chd smoke count. 1 3 39. 1 2 50. 1 1 70. 1 0 98. 0 3 253. 0 2 355. 0 1 735. 0 0 1554. end
31
BIOST 536 Thompson 31. xi: logit chd i.smoke [fweight = count] i.smoke _Ismoke_0-3 (naturally coded; _Ismoke_0 omitted) Iteration 0: log likelihood = -890.62187 Iteration 1: log likelihood = -876.52013 Iteration 2: log likelihood = -875.84853 Iteration 3: log likelihood = -875.84738 Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000 Log likelihood = -875.84738 Pseudo R2 = 0.0166 ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismoke_1 |.4122448.1627693 2.53 0.011.0932229.7312667 _Ismoke_2 |.8035253.1834786 4.38 0.000.4439138 1.163137 _Ismoke_3 |.8937922.2010989 4.44 0.000.4996455 1.287939 _cons | -2.76362.1041517 -26.53 0.000 -2.967754 -2.559486 ------------------------------------------------------------------------------
32
BIOST 536 Thompson 32. xi: logistic chd i.smoke [fweight=count] i.smoke _Ismoke_0-3 (naturally coded; _Ismoke_0 omitted) Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000 Log likelihood = -875.84738 Pseudo R2 = 0.0166 ------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismoke_1 | 1.510204.2458148 2.53 0.011 1.097706 2.077711 _Ismoke_2 | 2.2334.4097812 4.38 0.000 1.558796 3.199955 _Ismoke_3 | 2.444382.4915626 4.44 0.000 1.648137 3.625307 ------------------------------------------------------------------------------
33
BIOST 536 Thompson 33. expand count (3146 observations created). xi: logit chd i.smoke Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000 Log likelihood = -875.84738 Pseudo R2 = 0.0166 ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismoke_1 |.4122448.1627693 2.53 0.011.0932229.7312667 _Ismoke_2 |.8035253.1834786 4.38 0.000.4439138 1.163137 _Ismoke_3 |.8937922.2010989 4.44 0.000.4996455 1.287939 _cons | -2.76362.1041517 -26.53 0.000 -2.967754 -2.559486 ------------------------------------------------------------------------------. xi: logistic chd i.smoke ------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismoke_1 | 1.510204.2458148 2.53 0.011 1.097706 2.077711 _Ismoke_2 | 2.2334.4097812 4.38 0.000 1.558796 3.199955 _Ismoke_3 | 2.444382.4915626 4.44 0.000 1.648137 3.625307 -------------------------------------------------------------------------------------------------------------
34
BIOST 536 Thompson 34. lincom _Ismoke_2- _Ismoke_1, or ( 1) - _Ismoke_1 + _Ismoke_2 = 0 ------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 1.478873.2900367 2.00 0.046 1.006916 2.172044 ------------------------------------------------------------------------------. lincom _Ismoke_3- _Ismoke_2, or ( 1) - _Ismoke_2 + _Ismoke_3 = 0 ------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 1.094466.2505588 0.39 0.693.698771 1.714234 ------------------------------------------------------------------------------. lincom _Ismoke_3- _Ismoke_1, or ( 1) - _Ismoke_1 + _Ismoke_3 = 0 ------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 1.618577.3442644 2.26 0.024 1.066809 2.455728 ------------------------------------------------------------------------------
35
BIOST 536 Thompson 35 Continuous X Here interpretation of β 1 depends on the units of X. If the logit is linear in X, then β 1 represents the change in log odds for a 1 unit increase in X. is the odds ratio corresponding to a 1 unit increase in X.
36
BIOST 536 Thompson 36. logistic node age Logit estimates Number of obs = 53 LR chi2(1) = 1.09 Prob > chi2 = 0.2965 Log likelihood = -34.581125 Pseudo R2 = 0.0155 ------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age |.9526993.0445086 -1.037 0.300.8693389 1.044053 ------------------------------------------------------------------------------. logit ------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | -.048456.0467184 -1.037 0.300 -.1400223.0431104 _cons | 2.366605 2.770912 0.854 0.393 -3.064283 7.797493 ------------------------------------------------------------------------------ Example: Effect of age on nodal involvement in prostate cancer
37
BIOST 536 Thompson 37 NOTES The OR for nodal involvement corresponding to a ten year age difference is: estimated by.953 10 =.62 The 95% CI for log(10β AGE ) is given by: Hence the 95% CI for the 10-year OR is given by: (.25,1.54) This OR is the same comparing 40 year olds with 30 year olds as comparing 60 year olds with 50 year olds etc
38
BIOST 536 Thompson 38 Multiple logistic regression Logit(P(Y=1| X 1, X 2,.., X k ) ) = β 0 +β 1 X 1 + β 2 X 2 + …+ β k X k
39
BIOST 536 Thompson 39 Estimation Assume we have N observations (Y i, X i1, X i2,.., X ik ), i=1,2,…,N As before, we can use maximum likelihood to obtain estimates of β 0, β 1, β 2,…, β k that maximize the likelihood: L= and we can estimate the variances and covariances of the estimates from the inverse of the information matrix, I.
40
BIOST 536 Thompson 40 Hypothesis testing The Wald, Likelihood Ratio and Score tests generalize to the case of k X variables. In general Full model: logit(p) = β 0 +β 1 X 1 + β 2 X 2 + …+ β k X k Reduced model: logit(p) = β 0 +β 1 X 1 + β 2 X 2 + …+ β p X p,, p<k H 0 : β p+1 = β p+2 = …= β k =0 H a : ≠0 somewhere
41
BIOST 536 Thompson 41 Likelihood ratio test LR statistic = -2[ln L(reduced) -ln L(full)] = Deviance(reduced) - Deviance(full) Approximate distribution under H 0 : Χ 2 k-p We must fit two models to calculate the LR statistic Stata provides LR test of the current model relative to the null model: H 0 : β 1 = β 2 = …= β k =0
42
BIOST 536 Thompson 42 Score test If H 0 implies β = β* then Score statistic = S(β*)' I -1 S(β*) where I denotes the information matrix Approximate distribution under H 0 : Χ 2 k-p Only need to fit the reduced model to calculate the Score statistic Stata does not perform the Score test easily.
43
BIOST 536 Thompson 43 Wald test For a single parameter: ~ N(0,1) under H 0 : β j =0. The Wald test can be generalized to multiple parameters where it also follows a Χ 2 k-p distribution under H 0. Most confidence intervals are based on the Wald test statistic
44
BIOST 536 Thompson 44 LR tests using Stata In general: Fit "full" model, then:. est store Asaves log-likelihood from most recently fitted model and labels it “A" Fit reduced model, then:. est store Bsaves log-likelihood from most recently fitted model and labels it “B" Carry out the LR test comparing "full" model (A) with reduced model (B). lrtest A B, stats
45
BIOST 536 Thompson 45 Example: prostate cancer study Tumor largeTumor small Nodal involvement Xray+Xray-Xray+Xray- Yes9623 No111318
46
BIOST 536 Thompson 46 Fitting “full” model:. logistic node tsize xray Logistic regression Number of obs = 53 LR chi2(2) = 16.90 Prob > chi2 = 0.0002 Log likelihood = -26.676709 Pseudo R2 = 0.2405 ------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- tsize | 4.895297 3.426809 2.269 0.023 1.241425 19.30357 xray | 8.326496 6.218498 2.838 0.005 1.926448 35.9888 ------------------------------------------------------------------------------. logit ------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- tsize | 1.588275.7000206 2.269 0.023.2162598 2.96029 xray | 2.119443.7468325 2.838 0.005.6556779 3.583208 _cons | -2.044627.6099686 -3.352 0.001 -3.240144 -.8491109 ------------------------------------------------------------------------------. est store A
47
BIOST 536 Thompson 47 Fitting “reduced” model:. logistic node tsize Logistic regression Number of obs = 53 LR chi2(1) = 7.70 Prob > chi2 = 0.0055 Log likelihood = -31.276312 Pseudo R2 = 0.1096 ------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- tsize | 5.25 3.310487 2.630 0.009 1.52552 18.06761 ------------------------------------------------------------------------------------------------. logit ------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- tsize | 1.658228.630569 2.630 0.009.4223355 2.894121 _cons | -1.435085.4976116 -2.884 0.004 -2.410385 -.4597837 ------------------------------------------------------------------------------. est stor B
48
BIOST 536 Thompson 48 Likelihood ratio test: comparing models for nodal involvement with and without effect of xray. lrtest A B, stats Likelihood-ratio test LR chi2(1) = 9.20 (Assumption: B nested in A) Prob > chi2 = 0.0024 ------------------------------------------------------------------------------ Model | Obs ll(null) ll(model) df AIC BIC -------------+---------------------------------------------------------------- B | 53 -35.12608 -31.27631 2 66.55262 70.49321 A | 53 -35.12608 -26.67671 3 59.35342 65.26429 ------------------------------------------------------------------------------ What hypothesis is this testing?
49
BIOST 536 Thompson 49 Fitted probabilities in the “full” model:. predict pnode, p P(node | tumor=0, xray=0)=.1146(.1429) P(node | tumor=1, xray=0)=.3879(.3529) P(node | tumor=0, xray=1)=.5187(.4000) P(node | tumor=1, xray=1)=.8407(.9000) Note: these are slightly different from what we would get if we used the raw data without modelling. Why?
50
BIOST 536 Thompson 50 Confidence intervals A 100(1-α)% Likelihood Ratio based confidence region for β is given by: Stata provides Wald-based CIs for individual parameters CIs for odds ratios can be obtained by exponentiation
51
BIOST 536 Thompson 51 Confounding “With ruin upon ruin, rout on rout, Confusion worse confounded.” Milton: Paradise Lost, ii. line 996. A confounder, C, is a variable which, because of its relationship to disease, D, and the exposure of interest, E, distorts the disease- exposure relationship. Adjustment can remove its effects. We can adjust for the confounder explicitly by modeling or implicitly by stratification. Note: C would not be considered a confounder if - it lies in the causal pathway between E and D, or - C is caused by both E and D
52
BIOST 536 Thompson 52 Example E+E- D+9010 D-91 E+E- D+ 9119 D- 1991 E+E- D+19 D-1090
53
BIOST 536 Thompson 53 Effect modification Effect modification occurs when the chosen summary of association differs in different strata. In some cases there may be effect modification for one summary but not for another. For example, in clinical trials of cholesterol-lowering drugs it appears that the relative risk of coronary heart disease comparing the treatment to placebo is about the same in people in people with and without previous disease, but the risk difference is enormously greater in those with previous disease. Effect modification may or may not be of interest. For example, pleconaril, an investigational antiviral drug, reduced the mean duration of symptoms in subjects with a common cold due to rhinoviruses but had no effect in subjects whose cold was due to some other agent. Here the effect modification was important in checking that the drug really worked by inhibiting rhinovirus. On the other hand, in clinical use of the drug it would typically not be possible to determine the infectious agent and so the average effectiveness across all colds would be a more important quantity.
54
BIOST 536 Thompson 54 Effect modification and confounding can exist separately or together: Confounding without effect modification. Here, the overall association is not the same as the causal effect of interest, but after stratification the association is the same within each stratum of the confounder. The ideal solution is to stratify and then to average the association across strata to regain the precision lost by stratifying. Effect modification without confounding. Here, the overall association correctly estimates the average effect of the exposure, but that effect is different in different subgroups. If the separate associations are of interest then a stratified analysis is called for. If the main scientific interest is in the average effect across the population then a stratified analysis is unnecessary. Both confounding and effect modification. That is, the overall association does not correctly estimate the average effect of exposure and after stratification the association is different in different subgroups. The confounding means that a stratified analysis is necessary. If the effect modification is scientifically uninteresting the estimates from separate strata can be combined as would be done in the absence of effect modification.
55
BIOST 536 Thompson 55 Logistic models for a binary exposure (X E ) and binary covariate (X C ) Consider the model: logit(P(Y=1| X E, X C ))=β 0 + β 1 X E + β 2 X C + β 3 X E X C Let
56
BIOST 536 Thompson 56 Then, in a cohort study, under the logistic model: ln(Ψ 1 ) = logit(P(Y=1| X E =1, X C =0)) - logit(P(Y=1| X E =0, X C =0)) = β 0 + β 1 - β 0 = β 1 ln(Ψ 2 ) = logit(P(Y=1| X E =1, X C =1)) - logit(P(Y=1| X E =0, X C =1)) = β 0 + β 1 + β 2 + β 3 - β 0 - β 2 = β 1 + β 3 β 0 estimates logit(P(Y=1| X E =0, X C =0)) β 1 estimates ln(Ψ 1 ) β 2 estimates logit(P(Y=1| X E =0, X C =1)) - logit(P(Y=1| X E =0, X C =0)) β 3 estimates ln(Ψ 2 ) - ln(Ψ 1 )
57
BIOST 536 Thompson 57 Logistic models for two 2x2 tables
58
BIOST 536 Thompson 58
59
BIOST 536 Thompson 59 In a case-control study: Let Z=1 if an individual is sampled, Z=0 otherwise And let π 11 =P(Z=1 | Y=1, X C =1) π 10 =P(Z=1 | Y=1, X C =0) π 01 =P(Z=1 | Y=0, X C =1) π 00 =P(Z=1 | Y=0, X C =0) Then logit(P(Y=1| X E, X C =1, Z=1)) =log(π 11 / π 01 ) + logit(P(Y=1| X E, X C =1)) etc
60
BIOST 536 Thompson 60 Consider the model: logit(P(Y=1| X E, X C, Z=1)= β 0 * + β 1 X E + β 2 * X C + β 3 X E X C Then: log(Ψ 1 ) = β 1 log(Ψ 2 ) = β 1 + β 3 as before, but β 0 * = ln(π 10 / π 00 ) + β 0 β 2 * = ln(π 11 / π 01 ) - ln(π 10 / π 00 ) + β 2
61
BIOST 536 Thompson 61 Confidence intervals for linear combinations of parameters 100(1-α)% CI for Ψ 2 : where This can be obtained in Stata using the "lincom" command. Or by using a different parameterization.
62
BIOST 536 Thompson 62 Parameterizations 1. "Full" model with interaction (A): logit(P(Y=1| X E, X C ))=β 0 + β 1 X E + β 2 X C + β 3 X E X C 2. Reduced model without interaction (B): logit(P(Y=1| X E, X C ))=β 0 + β 1 X E + β 2 X C
63
BIOST 536 Thompson 63 3. Another "full" logistic model: X 1 = 1 when X E =1, X C =0X 2 =1 when X E =1, X C =1 = 0 otherwise =0 otherwise logit(P(Y=1| X 1, X 2, X C ))=β 0 + β 1 X 1 + β 2 X C + γ X 2 Here β 1 = log(Ψ 1 ) β 2 = log(Ψ C|E- ) γ = log(Ψ 2 ) This model allows us to estimate Ψ 2 directly, but it is more complicated to test for interaction. (No interaction => γ =β 1.)
64
BIOST 536 Thompson 64 4. A third "full" logistic model X 1 = 1 when X E =1, X C =0X 2 =1 when X E =0, X C =1 = 0 otherwise =0 otherwise X 3 = 1 when X E =1, X C =1 = 0 otherwise logit(P(Y=1| X 1, X 2, X 3 ))=β 0 + β 1 X 1 + β 2 X 2 + η X 3 allows us to test each of the above groups against the baseline. The interpretation of regression coefficients depends on the parameterization and what other variables are in the model.
65
BIOST 536 Thompson 65 Case-control study of esophageal cancer and alcohol consumption in France (Breslow & Day, Vol I, p137). 6 age strata; 2 exposure variables: daily alcohol consumption (4 categories), daily tobacco consumption (4 categories); 2 disease groups: cases and controls. infile age alcohol tobacco case using "p:\536\esoph.raw", clear (975 observations read). * begin labelling data and variables. label data "Esophageal Cancer Case-Control Study". label define agelabel 1 "25-34" 2 "35-44" 3 "45-54" 4 "55-64" 5 "65-74" 6 "75+". label values age agelabel. label variable age "Age in years". label define alclabel 1 "0-39" 2 "40-79" 3 "80-119" 4 "120+". label values alcohol alclabel. label variable alcohol "Alcohol g/day". label define toclabel 1 "0-9" 2 "10-19" 3 "20-29" 4 "30+". label values tobacco toclabel. label variable tobacco "Tobacco g/day". label define caselab 0 "Control" 1 "Case". label values case caselab. label variable case "Case-control status"
66
BIOST 536 Thompson 66. * CREATE SOME SIMPLE TABLES TO LOOK AT DATA. tabulate age case, col Age in | Case-control status years | Control Case | Total -----------+----------------------+---------- 25-34 | 115 1 | 116 | 14.84 0.50 | 11.90 -----------+----------------------+---------- 35-44 | 190 9 | 199 | 24.52 4.50 | 20.41 -----------+----------------------+---------- 45-54 | 167 46 | 213 | 21.55 23.00 | 21.85 -----------+----------------------+---------- 55-64 | 166 76 | 242 | 21.42 38.00 | 24.82 -----------+----------------------+---------- 65-74 | 106 55 | 161 | 13.68 27.50 | 16.51 -----------+----------------------+---------- 75+ | 31 13 | 44 | 4.00 6.50 | 4.51 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00
67
BIOST 536 Thompson 67. tabulate alcohol case, col Alcohol | Case-control status g/day | Control Case | Total -----------+----------------------+---------- 0-39 | 386 29 | 415 | 49.81 14.50 | 42.56 -----------+----------------------+---------- 40-79 | 280 75 | 355 | 36.13 37.50 | 36.41 -----------+----------------------+---------- 80-119 | 87 51 | 138 | 11.23 25.50 | 14.15 -----------+----------------------+---------- 120+ | 22 45 | 67 | 2.84 22.50 | 6.87 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00
68
BIOST 536 Thompson 68. tabulate tobacco case, col Tobacco | Case-control status g/day | Control Case | Total -----------+----------------------+---------- 0-9 | 447 78 | 525 | 57.68 39.00 | 53.85 -----------+----------------------+---------- 10-19 | 178 58 | 236 | 22.97 29.00 | 24.21 -----------+----------------------+---------- 20-29 | 99 33 | 132 | 12.77 16.50 | 13.54 -----------+----------------------+---------- 30+ | 51 31 | 82 | 6.58 15.50 | 8.41 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00
69
BIOST 536 Thompson 69. table case alcohol tobacco ----------+----------------------------------------------------------------- Case-cont | Tobacco g/day and Alcohol g/day rol | ------------- 0-9 ------------ ------------ 10-19 ----------- status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+ ----------+----------------------------------------------------------------- Control | 252 145 42 8 74 68 30 6 Case | 9 34 19 16 10 17 19 12 ----------+----------------------------------------------------------------- Case-cont | Tobacco g/day and Alcohol g/day rol | ------------ 20-29 ----------- ------------- 30+ ------------ status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+ ----------+----------------------------------------------------------------- Control | 37 47 10 5 23 20 5 3 Case | 5 15 6 7 5 9 7 10 ----------+-----------------------------------------------------------------
70
BIOST 536 Thompson 70. table case alcohol tobacco, by(age) ----------+----------------------------------------------------------------- Age in | years and | Case-cont | Tobacco g/day and Alcohol g/day rol | ------------- 0-9 ------------ ------------ 10-19 ----------- status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+ ----------+----------------------------------------------------------------- 25-34 | Control | 40 27 2 1 10 7 1 Case | 1 ----------+----------------------------------------------------------------- 35-44 | Control | 60 35 11 1 13 20 6 3 Case | 2 1 3 ----------+----------------------------------------------------------------- 45-54 | Control | 45 32 13 18 17 8 1 Case | 1 6 3 4 4 6 3 ----------+----------------------------------------------------------------- 55-64 | Control | 47 31 9 5 19 15 7 1 Case | 2 9 9 5 3 6 8 6 ----------+----------------------------------------------------------------- 65-74 | Control | 43 17 7 1 10 7 8 1 Case | 5 17 6 3 4 3 4 1 ----------+----------------------------------------------------------------- 75+ | Control | 17 3 4 2 Case | 1 2 1 2 2 1 1 1 ----------+-----------------------------------------------------------------
71
BIOST 536 Thompson 71 ----------+----------------------------------------------------------------- Age in | years and | Case-cont | Tobacco g/day and Alcohol g/day rol | ------------ 20-29 ----------- ------------- 30+ ------------ status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+ ----------+----------------------------------------------------------------- 25-34 | Control | 6 4 1 5 7 2 2 Case | ----------+----------------------------------------------------------------- 35-44 | Control | 7 13 2 2 8 8 1 Case | 1 2 ----------+----------------------------------------------------------------- 45-54 | Control | 10 10 4 1 4 2 2 Case | 5 1 2 5 2 4 ----------+----------------------------------------------------------------- 55-64 | Control | 9 13 3 1 2 3 1 Case | 3 4 3 2 4 3 4 5 ----------+----------------------------------------------------------------- 65-74 | Control | 5 4 1 2 Case | 2 5 2 1 1 1 ----------+----------------------------------------------------------------- 75+ | Control | 3 2 Case | 1 1 ----------+-----------------------------------------------------------------
72
BIOST 536 Thompson 72 Some Stata language for recoding variables:. generate agegp=recode(age,2,4,6). * All obsns with age 2 and <=4. * have agegp=4 and all with age > 4 have agegp=6. * Change the coding to 1,2,3. recode agegp 2=1 4=2 6=3 (975 changes made). table age ----------+----------- Age in | years | Freq. ----------+----------- 25-34 | 116 35-44 | 199 45-54 | 213 55-64 | 242 65-74 | 161 75+ | 44 ----------+-----------
73
BIOST 536 Thompson 73. drop agegp. gen agegp=recode(age,2,4). table agegp -------+----------- agegp | Freq. -------+----------- 2 | 315 4 | 660 -------+-----------. * All observations that are not <= a number in the list are given the last. * value in the list. drop agegp. gen agegp=1+(age>2)+(age>4). table agegp ----------+----------- agegp | Freq. ----------+----------- 1 | 315 2 | 455 3 | 205 ----------+-----------
74
BIOST 536 Thompson 74 Analysis with binary tobacco and alcohol variables. gen binalc=alcohol>2. gen bintob=tobacco>2. * Start by looking at some crude and stratified analyses. table case binalc bintob ----------+------------------------- Case-cont | bintob and binalc rol | ---- 0 --- ---- 1 --- status | 0 1 0 1 ----------+------------------------- Control | 539 86 127 23 Case | 70 66 34 30 ----------+-------------------------
75
BIOST 536 Thompson 75. cc case binalc, by (bintob) bintob | OR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- 0 | 5.909302 3.94179 8.859986 7.910644 (Cornfield) 1 | 4.872123 2.523999 9.408074 3.654206 (Cornfield) -----------------+------------------------------------------------- Crude | 5.640085 4.003217 7.94673 (Cornfield) M-H combined | 5.581579 3.945401 7.89629 -----------------+------------------------------------------------- Test of homogeneity (M-H) chi2(1) = 0.24 Pr>chi2 = 0.6258 Test that combined OR = 1: Mantel-Haenszel chi2(1) = 106.85 Pr>chi2 = 0.0000
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.