Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll

Outline Problem: Can more efficient inference be done assuming gene (G) and environment (X) independence? Gene-Environment independence: the case-only method Profile likelihood approach Efficiency gains Example Conclusions

Acknowledgment This work is joint with Nilanjan Chatterjee, National Cancer Institute Papers in: Biometrika, Genetic Epidemiology http://dceg.cancer.gov/people/ChatterjeeNilanjan.html

Outline Theoretical Methods: With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood (Key insight) Equivalent to a device of pretending the study is a regular random sample subject to missing data (This allows) generalization to any parametric model for G given X.

A Little Terminology Epidemiologists: Case control sample Econometricians: Choice-based sample These are exactly the same problems Subjects have two choices (or disease states) Subjects have their covariates sampled conditional on their choices, i.e., Random sample from those with disease Random sample from those without disease

Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment: X Strata: S We are interested in main effects for G and (X,S) along with their interaction

Prospective Models Simplest logistic model General logistic model The function m(G,X  1 ) is completely general

Case-Control Data Case-control data are not a random sample We observe (G,X) given D, i.e., we observe the covariates given the response, not vice-versa If we had a random sample, linear logistic regression would be used to fit the model Obvious idea: ignore the sampling plan and pretend you have a random sample

Case-Control Data Known Fact: The intercept is not identified, rest of the model is identified Retrospective odds is given as

Alternative Derivation: Ignore Sampling Plan Consider a prospective study Let  = 1 mean selection into the study Pretend Then compute

Case-Control Data Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan Standard Errors: Those compute ignoring the sampling plan are asymptotically correct

Case-Control Data The intercept is determined by pr(D=1) in the population, hence not identified from these data Little Known Fact: Adding information about pr(D=1) adds no information about

Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies

G-E Independence: Discussion Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction If False: Possible severe bias (Albert, et al., 2001, our own simulations)

G-E Independence: Discussion It is reasonable in many problems Example: Environment is a treatment in a randomized study under nested case-control sampling Example: Reasonable when exposure is not directly controlled by individual behavior Radiation exposure for A-bomb survivors Carcinogenic exposure of employees Pesticide exposure in a rural community

Generalizations I have phrased this problem as one where G and X are independent given strata This makes sense contextually in genetic epidemiology All the results I will describe go through if you can write down a probability model for G given (X,S): I do this in the Israeli Study.

Generalizations If G is binary, it is natural to apply our approach Posit a parametric or semiparametric model for G given (X,S) Consequences: More efficient estimation of G effects Much more efficient estimation of G and (X,S) interactions.

Gene-Environment Independence Rare Disease Approximation: Rare disease for all values of (G,X) May be unreasonable for important genes such as BRCA1/2 Case-only estimate of multiplicative interaction (Piegorsch, et al.,1994)

Gene-Environment Independence: Case-Only Analysis Positive Consequence: Often much more powerful than standard analysis Power advantage of this method often has led researchers to discard information on controls Negative Consequence: no ability to estimate other risk parameters, which are often of greater interest (see example later) Restrictions: Can only handle multiplicative interaction, requires rare disease in all values of (G,X)

Gene-Environment Independence Fact: gain in power for inference about a multiplicative interaction Consequence: There is thus (Fisher) information in the assumption Conjecture: Can handle general models and improve efficiency for all parameters We do this via a semiparametric profile likelihood approach We start though from a different likelihood

Prentice-Pyke Calculation Methodology: Start with the retrospective likelihood The distribution of (X,G) in the population is left unspecified Semiparametric MLE is usual logistic regression

Environment and Gene Expression Methodology: Start with the retrospective likelihood Note how independence of G and X is used here, see the red expressions We do not want to model the often multivariate distribution of X Gene distribution model can be standard

Environment and Gene Expression Methodology: Compute a profile estimate Parametric/semiparametric distribution for G Nonparametric distribution for X (possibly high dimensional) Result: Explicit profile likelihood

Environment and Gene Expression Methodology: Treat as distinct parameters Let G have parametric structure: Construct the profile likelihood, having estimated the as functions of data and other parameters The result is a function of : this function can be calculated explicitly!

Profile Likelihood Result:

Alternative Derivation Consider a prospective study Let  = 1 mean selection into the study Pretend Then compute This is exactly our profile pseudo- likelihood!

Alternative Derivation We compute: Standard approach computes It is this insight that allows us to greatly generalize the work past independence of G and X.

Computation Intercept: The logistic intercept, and hence pr(D=1), is weakly identified by itself Disease rate: If pr(D=1) is known, or a good bound for it is specified, can have significant gains in efficiency. This does not happen for a regular case-control study

Interesting Technical Point Profile pseudo-likelihood acts like a likelihood Information Asymptotics are (almost) exact Missing G data handled seamlessly (see next) Missing genotype Unphased haplotype data

Missing Data We have a formal likelihood: If gene is missing, suggests the formal likelihood Result: Inference as if the data were a random sample with missing data

Measurement Error The likelihood formulation also allows us to deal with measurement error in the environmental variables

First Simulation MSE Efficiency of Profile method: 0.02 < pr(D=1) < 0.07

Israeli Ovarian Cancer Study Population based case-control study Study the interplay of BRCA1/2 mutations (G) and two known risk factors (E or X) of ovarian cancer: oral contraceptive (OC) use parity. Missing Data: Approximately 50% of the controls were not genotyped, and 10% of the cases

Israeli Ovarian Cancer Study Results reported in Modan et al., NEJM (2001). Their analysis involves Assumption of parity and OC use are independent of BRCA1/2 mutation status Simple but approximate methods for exploiting G and E independence assumption (including case-only estimate of interaction) Risk model adjusted for Age, Race, Family History, History of Gynecological Surgery

Israeli Ovarian Cancer Study Disease risk model including same covariates as Modan et al (2001) In addition, we explicitly adjusted for the possibility of both G and E being related to S FH = family history (breast cancer = 1, ovarian or >= 2 breast cancer = 2)

Israeli Ovarian Cancer Study Question: Can carriers be protected via OC- use? The logarithm of the odds ratio is the sum of The main effect for OC-use The interaction term between OC-use and being a carrier, i.e., interaction between gene and environment Note how this involves main effects and interactions

Israeli Ovarian Cancer Study Question: Is there a carrier/OC interaction The case-only method can only answer this question

Israeli Ovarian Cancer Study Interaction of OC and BRCA1/2:

Israeli Ovarian Cancer Study Main Effect of BRCA1/2:

Israeli Ovarian Cancer Study Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) No evidence for protective effect Not available from case-only analysis Length of interval is ½ the length of the usual analysis

Features of the Method Allows estimation of all parameters of logistic regression model and can be used to examine interaction in alternative scales Can be used to estimate OR for non-rare diseases Important for studying major genes such as BRCA1/2

Features of the Method Allows incorporation of external information on Pr(D=1) Unlike with logistic regression in case-control studies, this information improves efficiency of estimation

Colorectal Adenoma Study PLCO Study: 772 cases, 772 controls Three SNPs in the calcium-sensing receptor region HWE assumed Interest in the interaction of number of copies of one haplotype (GCG) and calcium intake from diet

Colorectal Adenoma Study Method #1: Write down the prospective likelihood and apply missing data techniques A standard analysis If ignoring the case-control sampling scheme works for ordinary logistic regression, it should work for missing haplotype regression too, right? Wrong! Biased estimates and standard errors Method #2: Our method

Colorectal Adenoma Study

Conclusions Standard case-control (choice-based) studies Specify a model for G given X, e.g., G-E independence in population after conditioning on strata No assumptions made about X (high dimensional) All parameters estimable, no rare-disease assumption Handle missing G data Large gains in efficiency versus usual method Large gains in efficiency for effects of environment given the gene

Conclusions Theoretical Methods: With real G and X independence, we used a profile likelihood method based on nonparametric maximum likelihood (Key insight) Equivalent to a device of pretending that study is a regular random sample subject to missing data (This allows) generalization to any parametric model for G given X.

Acknowledgment Two graduate students have worked on this project Iryna Lobach, YaleChristie Spinka, U of Missouri

Thanks! http://stat.tamu.edu/~carroll

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University

Similar presentations

Presentation on theme: "Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University

Similar presentations

Presentation on theme: "Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University"— Presentation transcript:

Similar presentations

About project

Feedback