Download presentation
Presentation is loading. Please wait.
Published bySolomon Adams Modified over 5 years ago
1
Στατιστικές μέθοδοι στην Επιδημιολογία (Statistical Methods in Epidemiology) Ακαδ. Έτος 2018-2019
2
Διάλεξη 6. CASE-CONTROL STUDIES
3
1. Introduction: In a cohort study, the relationship between exposure and disease incidence is investigated by following the entire cohort and measuring the rate of occurrence of new cases in the different exposure groups. The follow-up allows the investigator to register those subjects who develop the disease during the study period and to identify those who remain free of the disease. In a case-control study the subjects who develop the disease are registered by some other mechanism than follow-up and a group of healthy subjects (the controls) is used to represent the subjects who did not develop the disease. This approach has two advantages: It removes the need to follow people over time, waiting to see who falls ill and who does not, and It reduces the number of people without the outcome of interest that need to be studied (very important with very rare diseases). Therefore a case-control study recruits some people because they have the disease (or outcome of interest) and some people because they do not have it.
4
2. Analysis of case-control studies
The retrospective approach Compare the distribution of exposure between cases and controls – since the distribution of the outcome is fixed by design we study the distribution of exposure (the other way round to the study of cohort studies) The proportion of cases who smoke compared to controls The proportion of cases who were vaccinated compared to controls The mean age of cases compared to controls (continuous exposure) The prospective approach Compare the case/control ratio between exposed and not exposed
5
Level of physical activity (PA) in occupation
Example: William Guy’s study of TB (perhaps the 1st case-control study!!!) Level of physical activity (PA) in occupation Tuberculosis (cases, D) Other diseases (Controls, H) Case/Control ratio 0 = Little 125 385 0.325 1 = Varied 41 136 0.301 2 = More 142 630 0.225 3 = Great 33 167 0.198 Retrospective: mean level of PA is for cases and for controls (p=0.0021), i.e., We compare the distribution of PA in occupation across cases and controls. We see that people who did not develop the disease share a higher level of PA in occupation compared to people who developed tuberculosis. Prospective: the prospective approach is given in the last column of the table: We see that the case/control ratio decreases as we go from the first level of PA in occupation (little) to the last one (great) – the odds of being a case are less for people with high PA in occupation compared to those with little PA – there is a steady increase in the odds of failure with decreasing level of PA Conclusion : the level of physical activity is inversely associated with developing tuberculosis.
6
3. The probability model in the population
Every case control study of incidence can be seen within the context of an underlying cohort (study base) which supplies the cases on which the case-control study is based. E+ E- F S Prevalence of exposure = 0.1 Prevalence of non exposed = 0.9 Exposure Failure π 1 1-π1 Selection 0.97 0.03 Case D1 0.01 0.99 Control H1 π 0 1- π 0 D0 Control H0 Probability 0.1x π 1x 0.97 0.1x (1- π 1) x 0.01 0.9 x (1- π 0) x 0.01 0.9 x π 0 x 0.97 π1 = probability of being a case having been exposed π0 = probability of being a case not having been exposed.
7
Chain rule: P(A,B,C)=P(C|A,B)*P(A,B)=P(C|A,B)*P(B|A)*P(A) P(exposure,case,selected)=P(S|E,C)P(E,C)=P(S|E,C)P(C|E)P(E) where P(exposure)=0.1 P(case|exposure)=π1 P(selection| exposure,case)=0.97
8
4. The prospective model The odds of being a case
Exposure Failure Selection F Probability 0.1x π 1x 0.97 0.1x (1- π 1) x 0.01 0.9 x (1- π 0) x 0.01 0.9 x π 0 x 0.97 S Not in study The odds of being a case Let 1 be the odds in the study that an exposed subject is a case. Then Let 0 be the odds in the study that an unexposed subject is a case. Then i.e., the “true” odds ratio can be estimated by the study odds ratio The odds ratio in the study is then
9
5. ESTIMATION D1 / H1 estimates 1 and D0 / H0 estimates 0. Then estimates the odds ratio. Thus although it is not possible to estimate π1 and π0 separately from a case control study, it is possible to estimate the odds ratio.
10
6. The retrospective model
We define a model for the conditional probabilities of exposure, given that the subject was a case (F) or a control (S) in our STUDY. F S Exposure Failure Selection E+ Probability 0.1x π 1x 0.97 0.1x (1- π 1) x 0.01 0.9 x (1- π 0) x 0.01 0.9 x π 0 x 0.97 E- Not in study Let Ω1 (and Ω0 ) be the odds of having been exposed for a case (or for control). From the figure Ω1 and Ω0 are given as
11
the prospective or the retrospective approach.
The value of Ω1 can be estimated by D1/D0, i.e., the ratio of the exposed to unexposed cases, and Ω0 by H1/H0, i.e., the ratio of exposed to unexposed controls. Thus the odds ratio of disease in the population can be estimated by the odds rario of exposure from the case control study You see that estimation of the true odds ratio is the same whether we adopt the prospective or the retrospective approach.
12
8. What is the «odds ratio»?
Note that when the disease of interest is rare, π is small and π / (1-π) is approximately equal to π [because (1-π) is almost 1]. So: «disease odds ratio»= = which is the risk ratio. So, when the disease under investigation is rare, the cross-product ratio from a case-control study provides an estimate of the risk ratio. Traditionally, this has been the approach to the interpretation of case-control studies and is often referred to as the «rare disease assumption». A more general result states that when the disease is rare (in the population under study, over the period of the study), the odds ratio, risk ratio and rate ratio are, for all practical purposes, equal . More recently it has been shown that, when the disease is not rare, which of these three measures is estimated by the (cross-product) odds ratio depends on the way in which controls are selected. In general terms, we can state that the cross-product ratio obtained from a case-control study provides an estimate of how much more (or less) common the disease / outcome is among the exposed compared with the unexposed.
13
9. The likelihood for a model for a case-control study
We are now interested to investigate the likelihood of a case-control study Therefore we must calculate the likelihood for a model for the odds ratio since this is the only parameter which can be estimated from such a study. The likelihood depends on which approach we have adopted to analyze the data - prospective or retrospective. Fortunately it can be shown that the results are exactly the same regarding the odds ratio estimate and its variance. Here we will show only the prospective approach since it is the one that is used for setting up logistic regression models (described below). - The likelihood for the retrospective approach can be found in Clayton and Hills, page 166 where you can also see that the two approaches end up to identical results.
14
9.1 The prospective likelihood
Imagine the following case-control study Exposure Cases Controls Total subjects Exposed D1 H1 N1 = D1 + H1 Unexposed D0 H0 N0 = D0 + H0 Total D H N = D + H Recall that ω1 the odds of being a case for the exposed given that the subject is in the study, where p1 is the probability of being a case for the exposed given that the subject is in the study. Similarly ω0 the odds of being a case for the unexposed given that the subject is in the study, where p0 is the probability of being a case for the unexposed given that the subject is in the study. In addition define θ = ω1 / ω0
15
[Reminder] The Likelihood function describes the plausibility of a model parameter value, given specific observed data. the Likelihood of a set of parameter values given the observed outcomes is the probability of some observed outcomes given the set of parameter. Δίνεται από το γινόμενο των συναρτήσεων πιθανότητας για συγκεκριμένες τιμές δεδομένων (Χ) για κάθε άτομο του δείγματος Εκφράζει πόσο πιθανές είναι οι τιμές της παραμέτρου θ δεδομένου του δείγματος (δηλαδή δεδομένων των συγκεκριμένων τιμών- δεδομένων που έχουμε)
16
To find the likelihood function recall that the data here are binary responses (case vs. control) and come from two independent Bernoulli distributions one for the exposed and one for the unexposed. Recall that for a Bernoulli distribution with N subjects and D successes, where π is the probability of success, the likelihood function is given by Likelihood = πD(1-π) N-D and the log likelihood is: log-likelihood = D log(π) + (N-D) log(1- π) Recall that the odds of success , say ω = , and 1-π = So that the log-likelihood can be expressed as a function of ω, that is: log-likelihood = D log(ω(1-π)) + (N-D) log(1- π) = Dlog(ω) + Dlog(1-π) + Nlog(1-π) – Dlog(1-π) = Dlog(ω) + Nlog(1-π) = Dlog(ω) + Nlog(1/(1+ω)) = Dlog(ω) - Nlog(1+ω) (1) In a case control study under the prospective approach the log likelihood for the data will then be log-likelihood = D1 log(p1) + (N1-D1) log(1- p1) + D0 log(p0) + (N0-D0) log(1- p0) and from the (1) above it follows that log-likelihood = D1log(ω1) – N1log(1+ω1) + D0 log(ω0) -N0 log(1+ω0) , and by substituting ω1 = θ ω0 : D0 log(ω0) -N0 log(1+ω0) D1 log(θω0) -N1 log(1+θω0)
17
The unknown parameters here are θ and ω0
The unknown parameters here are θ and ω0. The exact solution of this function is very complicated. For approximate solutions the likelihood is used to estimate θ by replacing ω0 with its most likely value for given θ - we obtain in this way a profile likelihood curve. This does not lead to a simple algebraic expression but we can approximate it with a Gaussian distribution (see Clayton and Hills, p. 169, and last section of this course). According to this, the most likely value for log (ω1) and log (ω0) are: M1 = log(D1/H1), and M0 = log(D0/H0) with corresponding standard deviations : S1 = , and S0 = The most likely value of log(θ) under the Gaussian approximation can be proved to be: M = M1 - M0 = log and the standard deviation of the Gaussian approximation to the log likelihood is S = VAR(log M) = so that a 95% confidence interval for the log of odds ratio can be obtained by M +/ S.
18
95% CI for the odds ratio = These results are exactly the same as we obtain using the retrospective approach. This continues to be the case for more complex patterns of exposure and, since the prospective approach is more convenient in these situations it is to be preferred. The estimation for S is often referred to as the Wolf’s method. When the sample size is «sufficiently large» this methods yields perfectly adequate approximations. What constitutes «sufficiently large»? There is no established rule for assessing if the data are adequate for the application of approximate methods. As a rough indication, approximate methods should provide reasonable results if each cell of the table (D1, D0, H1, H0) is greater than or equal to 10. When this condition does not hold many authors suggest that one should present exact confidence limits. STATA can give you exact CIs
19
10. Test of the null hypothesis that the true odds ratio = 1
with Score test with the χ2 distribution on 1 degree of freedom Where U= D1 – E1 V = = [E1 is the expected number of exposed cases under the null hypotheses; D is the total number of cases, exposed and unexposed; Η is the total number of controls, exposed and unexposed; N0 is the total number of unexposed individuals, cases and controls; N1 is the total number of exposed individuals, cases and controls; Ν is the total number of individuals in the table] The only difference between this formula for the score test and the classical one of chi-square is the replacement of Ν by (N-1). This version (which is used by STATA) is known as the Mantel-Haenszel chi-squared test (without continuity correction). It is an extension of this test which is used to test the null hypothesis that the true odds ratio is 1 when the data are stratified to control for a confounding variable.
20
11. An example of a case-control study
Calyton and Hills provide an example of a case control study that was carried out in Mwanza, Tanzania, to investigate risk factors for HIV infection among females aged 15 years or more. Cases were women with HIV +ve during a study conducted in 13 communities. Controls were women selected from those who were found HIV -ve. The question here is whether a woman's educational level is associated with risk of HIV infection. See below the related table of results: The odds ratio of HIV infection in this study is: Educational level Some education No education Total Cases 140 (74%) 49 189 Controls 311 (54%) 263 574 451 312 763 OR = = 2.42 To test the null hypothesis that the true odds ratio of HIV infection among educated and not educated women is 1 we perform the score test we saw before: U2/V= 23.2 on 1 df, p <0.0001
21
This result indicates that the observed data are not compatible with the null hypothesis. Thus there is strong evidence that there is a strong association between schooling and risk of HIV infection in this population. This is consistent with the 95% confidence interval that can be calculated based on the formula we saw before (you would find a lower limit of C.I. = 1.68). This test does not perform well if the sample size is small (N < 20 or N < 40 and one of the expected numbers in the cells of the 2x2 table is less than 5). In such cases an «exact» test (Fisher's exact test) may be used.
22
12. Controlling for confounding
For the example we saw before, a possible explanation for the observed negative association between schooling and risk of HIV is the confounding effect of age: Women who have been most sexually active thus at highest risk of HIV infection may be younger women (age may be associated with outcome). In addition, school attendance rates for women in Tanzania may have increased over the last 30 years thus younger women have a higher probability of having attended school (age associated with exposure). Suppose women are classified into 4 age groups (15-19, 20-29, and 45+). A stratifying analysis by age group is shown below:
23
By looking at these tables is age a possible confounder?
Schooling Yes No 15-19 Cases 9 (69%) 4 13 OR1 = 0.52 Controls 78 (81%) 18 96 87 109 By looking at these tables is age a possible confounder? Yes No 20-29 Cases 82 (85%) 14 96 OR2 = 1.95 Controls 144 (75%) 48 192 226 288 What is your conclusion? Yes No 30-44 Cases 44 (70%) 19 63 OR3 = 3.46 Controls 77 (40%) 115 192 121 255 Yes No 45+ Cases 5 (29%) 12 17 OR4 = 2.85 Controls 12 (13%) 82 94 111 Mantel-Haenszel weights for each stratum - the smallest weights are given to strata 1 and 4 (ages and 45+) in which there are fewest subjects.
24
In the same way as we saw for the rates we can perform a test of null hypothesis for the MHOR
(i.e, after stratification) and calculate a 95% CI. We can and should also test for homogeneity of the ORs across strata before providing the “adjusted” OR
25
13. Fitting models for the log odds
Models of the form log(ω) = α+βx can be fitted using logistic regression with outcome Y which takes the value of 1 for a case and 0 for a control. Recall that ω are the odds of being a case in the specific case-control study (prospective approach). Essentially, these are also models for the log (π/(1-π)) because as we saw ω = K π/(1-π) where K is the ratio of sampling fractions for cases and controls and therefore log (ω) = log (K) + log(π/(1-π)), thus log(K) will cancel out when estimating odds ratios. For example imagine that we have an exposure with two levels. Then x1 is a dummy variable indicating the exposed subjects. Define ω1 the odds of being a case in the study given exposure and ω0 the odds of being a case in the study given non exposure. Then, according to the logistic regression model: log(ω1) - log(ω0) = α + β – α = β = log (K) + log(π1/(1-π1)) - log (K) - log(π0/(1-π0)) = log(π1/(1-π1)) - log(π0/(1-π0)) = log [(π1/(1-π1)) / (π0/(1-π0)] Therefore β is an estimate of the true odds ratio (in the population). More accurately the β coefficients are log odds ratios of being a case in the study OR of being a failure in the population. Note however that α does not have any meaning for the population - it just reflects the proportions of unexposed cases and controls in the design.
26
14. Exposure classified in >2 levels Example (continued)
In the study in Tanzania, suppose that we have obtained also data on the number of years of schooling. Let’s assume that we are interested in the association between duration of schooling and risk of HIV infection. We categorize the variable years of schooling into the 4 categories shown in the table below. We then calculate odds ratios for each exposure group relative to a baseline group – the latter being usually the group of non-exposed. Note that it is sometimes preferable to use as referent the largest group because this gives the most stable odds ratio estimates. Years of schooling None 1-3 4-6 7+ Total Cases 49 24 110 6 189 Controls 263 51 255 5 574 312 75 365 11 763 Estimated odds ratios 1.0 (baseline) 2.53 (OR1) 2.32 (OR2) 6.44 (OR3)
27
Tests of null hypotheses and calculation of 95% confidence intervals may be obtained for the odds ratio at each level of exposure (e.g. score test or Wald test) compared with the baseline with methods described for the simple case. This approach however does not answer the question of the overall odds ratio. Another approach is to apply overall significance tests that test the null hypothesis that the true odds ratios for all exposure levels are all equal to 1 (ORi = 1 for all i). Depending on the alternative we obtain two versions of the test: That at least one true ORi is not equal to 1. (LR test with/without exposure) That the true odds ratio increases (or decreases) with increasing level of exposure. The second alternative to there being a trend in the odds ratio usually called a dose-response effect.
28
Linear effect Factored - Categorical Quadratic effect A quadratic effect is parameterized by entering apart from the continuous variable, a new variable that equals the square of the original one. The quadratic effect is the simplest form of a non-linear effect. Recall from the GLM course that you can compare models with a variable specified as linear effect with models with the same variable specified as quadratic effect or with models with the same variable specified as categorical (nested models, LR test).
30
15. Interactions With logistic regression we can test for effect modification in a very flexible way . When you fit an interaction term you can test for effect modification by using the Likelihood Ratio Statistic to test the Null hypothesis of no interaction. Recall that the statistic is defined as the difference in log-likelihoods between the more complicated (with the interaction) and the simpler model (without the interaction) model. Note now that when modeling the effect of two factors with say K and L levels as categorical, and we want to test for interaction between them (say the one is exposure and the other one a potential confounder) then we introduce the inclusion and we require the estimation of (L-1) x (K-1) new indicator variables. For large K and L this estimation becomes problematic and the power to detect an interaction will be low let apart the interpretation of the results especially in the presence of an interaction. The usual practice for these circumstances can be to: Fit interactions with simpler factors – for example you can reduce the number of categories in the one or the other factor. Fit the interaction terms with linear effect although the main effect may be fitted as ordered.
31
16. Logistic regression in case-control studies: Some remarks on interpretation
Assume x is an indicator variable for exposure: 0 (unexposed) / 1 (exposed) and, z is a potential confounder with 3 levels, i.e, z2 is an indicator variable for level 2 of the confounder: 0 (level 1) / 1 (level 2), z3 is an indicator variable for level 3 of the confounder: 0 (level 1) / 1 (level 3) 1) The basic model log (π/(1-π)) = α + βx Exposure 1 log (π/(1-π) | x=1) - log (π/(1-π) | x=0) Log(π/(1-π)) α (no interpretation for the population) a + β (no interpretation of the population β (difference in log of odds of the disease between the exposed and non exposed)
32
2) The model with the confounder log (π/(1-π)) = α + βx + γ1z2 + γ2z3
log (π/(1-π)) = α + βx + γ1z2 + γ2z3 Exposure log (π/(1-π) | x=1, z) - log (π/(1-π) | x=0, z) Stratum 1 α(no interpretation for the population) a + β(no interpretation for the population) β (difference in log of odds of the disease between the exposed and non exposed in level 0 of the confounder) α + γ1(no interpretation for the population) a + γ1 + β(no interpretation for the population) β (difference in log of odds of the disease between the exposed and non exposed in level 1 of the confounder) 2 α + γ2(no interpretation for the population) a + γ2 + β(no interpretation for the population) β (difference in log of odds of the disease between the exposed and non exposed in level 2 of the confounder) β = difference in log of odds of the disease between the exposed and non exposed adjusted for the confounder.
33
In multiplicative form
Exposure (π/(1-π) | x=1, z) / (π/(1-π) | x=0, z) Stratum 1 ω0(no interpretation for the population) ω0 θ(no interpretation for the population) θ (odds ratio of the disease for the exposed compared to non exposed in level 0 of the confounder) ω0 φ1(no interpretation for the population) ω0 θ φ1(no interpretation for the population) θ (odds ratio of the disease for the exposed compared to non exposed in level 1 of the confounder) 2 ω0 φ2(no interpretation for the population) ω0 θ φ2(no interpretation for the population) θ (odds ratio of the disease for the exposed compared to non exposed in level 2 of the confounder) θ = odds ratio of the disease for the exposed compared to non exposed adjusted for the confounder.
34
3) Interactions (effect modification)
log (π/(1-π)) = α + βx + γ1z2 + γ2z3 + δ1 (xz2) + δ2 (xz3) Exposure log (π/(1-π) | x=1, z) - log (π/(1-π) | x=0, z) Stratum 1 α(no interpretation for the population) a + β(no interpretation for the population) β (difference in log of odds of the disease between the exposed and non exposed in level 0 of the confounder) α + γ1(no interpretation for the population) a + γ1 + β + δ1 (no interpretation for the population) β + δ1 (difference in log of odds of the disease between the exposed and non exposed in level 1 of the confounder) 2 α + γ2(no interpretation for the population) a + γ2 + β + δ2 (no interpretation for the population) β + δ2 (difference in log of odds of the disease between the exposed and non exposed in level 2 of the confounder)
35
In multiplicative form
Exposure (π/(1-π) | x=1, z) / (π/(1-π) | x=0, z) Stratum 1 ω0(no interpretation for the population) ω0 θ(no interpretation for the population) θ (odds ratio of the disease for the exposed compared to non exposed in level 0 of the confounder) ω0 φ1(no interpretation for the population) ω0 θ φ1 ρ1 (no interpretation for the population) θ ρ1 (odds ratio of the disease for the exposed compared to non exposed in level 1 of the confounder) 2 ω0 φ2(no interpretation for the population) ω0 θ φ2 ρ2 (no interpretation for the population) θ ρ2 (odds ratio of the disease for the exposed compared to non exposed in level 2 of the confounder) No adjusted odds ratio of the disease for the exposed compared to non exposed can be estimated. Make sure that you know how to interpret the parameters φ1, φ2, ρ1, ρ2.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.