Download presentation
Presentation is loading. Please wait.
1
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll
2
Outline Problem: Can more efficient inference be done assuming gene (G) and environment (X) independence? Gene-Environment independence: the case-only method Profile likelihood approach Efficiency gains Example Conclusions
3
Acknowledgment This work is joint with Nilanjan Chatterjee, National Cancer Institute Papers in: Biometrika, Genetic Epidemiology http://dceg.cancer.gov/people/ChatterjeeNilanjan.html
4
Outline Theoretical Methods: With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood (Key insight) Equivalent to a device of pretending the study is a regular random sample subject to missing data (This allows) generalization to any parametric model for G given X.
5
A Little Terminology Epidemiologists: Case control sample Econometricians: Choice-based sample These are exactly the same problems Subjects have two choices (or disease states) Subjects have their covariates sampled conditional on their choices, i.e., Random sample from those with disease Random sample from those without disease
6
Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment: X Strata: S We are interested in main effects for G and (X,S) along with their interaction
7
Prospective Models Simplest logistic model General logistic model The function m(G,X 1 ) is completely general
8
Case-Control Data Case-control data are not a random sample We observe (G,X) given D, i.e., we observe the covariates given the response, not vice-versa If we had a random sample, linear logistic regression would be used to fit the model Obvious idea: ignore the sampling plan and pretend you have a random sample
9
Case-Control Data Known Fact: The intercept is not identified, rest of the model is identified Retrospective odds is given as
10
Alternative Derivation: Ignore Sampling Plan Consider a prospective study Let = 1 mean selection into the study Pretend Then compute
11
Case-Control Data Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan Standard Errors: Those compute ignoring the sampling plan are asymptotically correct
12
Case-Control Data The intercept is determined by pr(D=1) in the population, hence not identified from these data Little Known Fact: Adding information about pr(D=1) adds no information about
13
Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies
14
G-E Independence: Discussion Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction If False: Possible severe bias (Albert, et al., 2001, our own simulations)
15
G-E Independence: Discussion It is reasonable in many problems Example: Environment is a treatment in a randomized study under nested case-control sampling Example: Reasonable when exposure is not directly controlled by individual behavior Radiation exposure for A-bomb survivors Carcinogenic exposure of employees Pesticide exposure in a rural community
16
Generalizations I have phrased this problem as one where G and X are independent given strata This makes sense contextually in genetic epidemiology All the results I will describe go through if you can write down a probability model for G given (X,S): I do this in the Israeli Study.
17
Generalizations If G is binary, it is natural to apply our approach Posit a parametric or semiparametric model for G given (X,S) Consequences: More efficient estimation of G effects Much more efficient estimation of G and (X,S) interactions.
18
Gene-Environment Independence Rare Disease Approximation: Rare disease for all values of (G,X) May be unreasonable for important genes such as BRCA1/2 Case-only estimate of multiplicative interaction (Piegorsch, et al.,1994)
19
Gene-Environment Independence: Case-Only Analysis Positive Consequence: Often much more powerful than standard analysis Power advantage of this method often has led researchers to discard information on controls Negative Consequence: no ability to estimate other risk parameters, which are often of greater interest (see example later) Restrictions: Can only handle multiplicative interaction, requires rare disease in all values of (G,X)
20
Gene-Environment Independence Fact: gain in power for inference about a multiplicative interaction Consequence: There is thus (Fisher) information in the assumption Conjecture: Can handle general models and improve efficiency for all parameters We do this via a semiparametric profile likelihood approach We start though from a different likelihood
21
Prentice-Pyke Calculation Methodology: Start with the retrospective likelihood The distribution of (X,G) in the population is left unspecified Semiparametric MLE is usual logistic regression
22
Environment and Gene Expression Methodology: Start with the retrospective likelihood Note how independence of G and X is used here, see the red expressions We do not want to model the often multivariate distribution of X Gene distribution model can be standard
23
Environment and Gene Expression Methodology: Compute a profile estimate Parametric/semiparametric distribution for G Nonparametric distribution for X (possibly high dimensional) Result: Explicit profile likelihood
24
Environment and Gene Expression Methodology: Treat as distinct parameters Let G have parametric structure: Construct the profile likelihood, having estimated the as functions of data and other parameters The result is a function of : this function can be calculated explicitly!
25
Profile Likelihood Result:
26
Alternative Derivation Consider a prospective study Let = 1 mean selection into the study Pretend Then compute This is exactly our profile pseudo- likelihood!
27
Alternative Derivation We compute: Standard approach computes It is this insight that allows us to greatly generalize the work past independence of G and X.
28
Computation Intercept: The logistic intercept, and hence pr(D=1), is weakly identified by itself Disease rate: If pr(D=1) is known, or a good bound for it is specified, can have significant gains in efficiency. This does not happen for a regular case-control study
29
Interesting Technical Point Profile pseudo-likelihood acts like a likelihood Information Asymptotics are (almost) exact Missing G data handled seamlessly (see next) Missing genotype Unphased haplotype data
30
Missing Data We have a formal likelihood: If gene is missing, suggests the formal likelihood Result: Inference as if the data were a random sample with missing data
31
Measurement Error The likelihood formulation also allows us to deal with measurement error in the environmental variables
32
Advertisement
33
First Simulation MSE Efficiency of Profile method: 0.02 < pr(D=1) < 0.07
34
Israeli Ovarian Cancer Study Population based case-control study Study the interplay of BRCA1/2 mutations (G) and two known risk factors (E or X) of ovarian cancer: oral contraceptive (OC) use parity. Missing Data: Approximately 50% of the controls were not genotyped, and 10% of the cases
35
Israeli Ovarian Cancer Study Results reported in Modan et al., NEJM (2001). Their analysis involves Assumption of parity and OC use are independent of BRCA1/2 mutation status Simple but approximate methods for exploiting G and E independence assumption (including case-only estimate of interaction) Risk model adjusted for Age, Race, Family History, History of Gynecological Surgery
36
Israeli Ovarian Cancer Study Disease risk model including same covariates as Modan et al (2001) In addition, we explicitly adjusted for the possibility of both G and E being related to S FH = family history (breast cancer = 1, ovarian or >= 2 breast cancer = 2)
37
Israeli Ovarian Cancer Study Question: Can carriers be protected via OC- use? The logarithm of the odds ratio is the sum of The main effect for OC-use The interaction term between OC-use and being a carrier, i.e., interaction between gene and environment Note how this involves main effects and interactions
38
Israeli Ovarian Cancer Study Question: Is there a carrier/OC interaction The case-only method can only answer this question
39
Israeli Ovarian Cancer Study Interaction of OC and BRCA1/2:
40
Israeli Ovarian Cancer Study Main Effect of BRCA1/2:
41
Israeli Ovarian Cancer Study Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) No evidence for protective effect Not available from case-only analysis Length of interval is ½ the length of the usual analysis
42
Features of the Method Allows estimation of all parameters of logistic regression model and can be used to examine interaction in alternative scales Can be used to estimate OR for non-rare diseases Important for studying major genes such as BRCA1/2
43
Features of the Method Allows incorporation of external information on Pr(D=1) Unlike with logistic regression in case-control studies, this information improves efficiency of estimation
44
Colorectal Adenoma Study PLCO Study: 772 cases, 772 controls Three SNPs in the calcium-sensing receptor region HWE assumed Interest in the interaction of number of copies of one haplotype (GCG) and calcium intake from diet
45
Colorectal Adenoma Study Method #1: Write down the prospective likelihood and apply missing data techniques A standard analysis If ignoring the case-control sampling scheme works for ordinary logistic regression, it should work for missing haplotype regression too, right? Wrong! Biased estimates and standard errors Method #2: Our method
46
Colorectal Adenoma Study
47
Conclusions Standard case-control (choice-based) studies Specify a model for G given X, e.g., G-E independence in population after conditioning on strata No assumptions made about X (high dimensional) All parameters estimable, no rare-disease assumption Handle missing G data Large gains in efficiency versus usual method Large gains in efficiency for effects of environment given the gene
48
Conclusions Theoretical Methods: With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood (Key insight) Equivalent to a device of pretending that study is a regular random sample subject to missing data (This allows) generalization to any parametric model for G given X.
49
Acknowledgment Two graduate students have worked on this project Iryna Lobach, YaleChristie Spinka, U of Missouri
50
Thanks! http://stat.tamu.edu/~carroll
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.