TWO-STAGE CASE-CONTROL STUDIES USING EXPOSURE ESTIMATES FROM A GEOGRAPHICAL INFORMATION SYSTEM Jonas Björk 1 & Ulf Strömberg 2 1 Competence Center for Clinical Research 2 Occupational and Environmental Medicine Lund University Hospital
OUTLINE OF TALK Previous project: What have we done? (Jonas Björk) Ongoing project: What shall we do? (Ulf Strömberg)
Two-stage procedure for case- control studies 1 st stage Complete data obtained from registries Disease status General characteristics Group affiliation (e.g. occupation or residential area) Group-level exposure X G 2 nd stage Individual exposure data for a subset of the 1 st stage sample
Exposure database group-level exposure JEM = Job Exposure Matrix Occupational group proportion exposed GIS Residential group (area) average concentration of an air pollutant
JEM - proportion exposed Most data typically in groups with low X G
Linear Relation between Proportion Exposed and Relative Risk No confounding between/within groups Example: RR (exposed vs. unexposed) = 2.0 Proportion exposed X G Average RR 0%1.0 10%0.10 * =1.1 50% %2.0
Linear OR model: OR(X G ) = 1 + β X G X G = Exposure proportion OR for exposed vs. unexposed = OR(1) = 1 + β 1 OR(1) XGXG 0 1 Most data typically in groups with low X G
Confounding between groups General confounders (eg, gender and age) can normally be adjusted for Assuming no confounding within groups and no effect modification in any stratum s k : OR(X G ;s 1, s 2,...s k ) = (1 + β X G ) exp(Σγ k s k )
Combining 1 st and 2 nd stage data Assumption: 2 nd stage data missing at random condition on disease status and 1 st stage group affiliation For subjects with missing 2 nd stage data: Use 1 st stage data to calculate expected number of exposed/unexposed Expectation-maximization (EM) algorithm
EM-algorithm (Wacholder & Weinberg 1994) 1.Select a starting value, e.g. OR=1 2.E-step Among the non-participants, calculate expected number of exposed/unexposed case and controls in each group 3.M-step Maximize the likelihood for observed+expected cell frequencies using the chosen risk model for individual-level data (not necessarily linear) New OR-estimate 4. Repeat 2. and 3. until convergence
E-step in our situation (Strömberg & Björk, submitted) m 0 controls with missing 2 nd stage data m 0 * X G = expected number of exposed m 1 cases with missing 2 nd stage data m 1 * X G * ÔR / [1+(ÔR-1)* X G ] ÔR = Current OR-estimate Complete the data in each group G:
Simulated case-control studies 400 cases, 1200 controls in the 1 st stage 2 nd stage participation 75% of the cases 25% of the controls Selective participation of 2 nd stage controls Corr(Participation, X G ) =0, > 0, < replications in each scenario True OR = 3
Simulations - Results Participation1 st stage data only ( ) 2 nd stage data only ( ) EM-method ( ) ORSDCoverageORSDCoverageORSDCoverage Corr(Part., X G )= % % % Corr(Part., X G )< % % % Corr(Part., X G )> % % % SD = Empirical standard deviation of the ln(OR) estimates Coverage = Coverage of 95% confidence intervals
Simulations - Conclusions Combining 1 st and 2 nd stage data, using the EM method can: 1. Improve precision 2. Remove bias from selective participation Method is sensitive to errors in the (1 st stage) external exposure data!
Simulations – Conclusions II EM-method is sensitive to 1.Violations of the MAR-assumption (condition on on disease status and 1 st stage group affiliation) 2. Errors in the (1 st stage) external exposure data
Ongoing methodological research project Focus on exposure estimates from a GIS
GIS data: NO2 (Scania)
Two-stage exposure assessment procedure X G = 4.8 X G = 10.1 X G = x i 1 st stage: X G represents mean exposure levels rather than proportion exposed x i 2 nd stage: x i is a continuous, rather than a dichotomous, exposure variable
Assume a linear relation between and x i and disease odds (cf. radon exposure and lung cancer [Weinberg et al., 1996]). xixi Odds For the ”only 1 st stage” subjects: no bias expected by using their X G :s (Berkson errors) provided MAR in each group – independent of disease status. EM method? Exposure variation in each group?
Two-stage exposure assessment procedure – related work Multilevel studies with applications to a study of air pollution [Navidi et al., 1994]: pooling exposure effect estimates based on individual-level and group-level models, respectively
Collecting data on confounders or effect modifiers at 2 nd stage X G = 4.8 X G = 10.1 X G = c i 1 st stage: X G = mean exposure levels c i 2 nd stage: c i is a covariate, e.g. smoking history
Data on confounders or effect modifiers at 2 nd stage – estimation of exposure effect Confounder adjustment based on logistic regression: pseudo-likelihood approach [Cain & Breslow, 1988] More general approach: EM method [Wacholder & Weinberg, 1994]
Design stage (“stage 0”) Group 1 Group 2 Group 3... Subjects? 1 st stage: How many geographical areas (groups)? ? ? 2 nd stage: Fractions of the 1 st stage cases and controls?
Design stage – related work Two-stage exposure assessment: power depends more strongly on the number of groups than on the number of subjects per group [Navidi et al., 1994]
References I Björk & Strömberg. Int J Epidemiol 2002;31: Strömberg & Björk. “Incorporating group- level exposure information in case-control studies with missing data on dichotomous exposures”. Submitted.
References II Cain & Breslow. Am J Epidemiol 1988;128: Navidi et al. Environ Health Perspect 1994;102(Suppl 8): Wacholder & Weinberg. Biometrics 1994;50: Weinberg et al. Epidemiology 1996;7:190-7.