Download presentation
Presentation is loading. Please wait.
1
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A AA
2
Outline Problem: Case-Control Studies with Gene- Environment relationships Efficient formulation when genes are observed Measurement errors in environmental variables Haplotype modeling and Robustness
3
Acknowledgment This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)
4
Acknowledgment Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)
5
Software SAS and Matlab Programs Available at my web site under the software button Examples are given in the programs http://stat.tamu.edu/~carroll
6
Some Personal History I was born in Japan The coffee table is still in my house
7
Some Personal History My father lived in Seoul for 2 months in 1948 and 1 year in 1968 He took many photos of sights there, especially in 1948
8
Joonghwa moon at Deoksugung, 1948
9
Joonghwa moon at Deoksugung, today
10
The Prices of Drinks Were Pretty Low
11
Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment, can include strata: X We are interested in main effects for G and X along with their interaction
12
Prospective Models Simplest logistic model General logistic model The function m(G,X 1 ) is completely general
13
Likelihood Function The likelihood is Note how the likelihood depends on two things: The distribution of (X,G) in the population The probability of disease in the population Neither can be estimated from the case-control study
14
When G is observed The usual choice is ordinary logistic regression It is semiparametric efficient if nothing is known about the distribution of G, X in the population Why semiparametric: what is unknown is the distribution of (G,X) in the population
15
When G is observed Logistic regression is thus robust to any modeling assumptions about the covariates in the population Unfortunately it is not very efficient for understanding interactions
16
Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies
17
G-E Independence Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction Part of this talk is to model the distribution of G given X
18
Gene-Environment Independence If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. The reason is that you are putting a constraint on the retrospective likelihood
19
More Efficiency, G Observed A constraint on the population is to posit a parametric or semiparametric model for G given X Consequences: More efficient estimation of G effects Much more efficient estimation of G and (X,S) interactions.
20
The Formulation In the most general semiparametric setting, we have Question: What methods do we have to construct estimators?
21
Methodology We have developed two new ways of thinking about this problem In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation
22
The Hard Way Treat X as a discrete random variable whose mass points are the observed data points Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.
23
The Hard Way The maximization is not trivial to do correctly Result: an explicit profile likelihood that does not involve the distribution of X
24
Pretend Missing Data Formulation The following simple trick can be shown to be legitimate and semiparametric efficient Equivalently, we compute a semiparametric profiled likelihood Semiparametric because the distribution of X is not modeled
25
Pretend Missing Data Formulation The idea is to create a “pretend” study, which is one of random sampling with missing data We use an MAR regime. The “pretend” study mimics the case-control study
26
Pretend Missing Data Formulation Suppose you have a large but finite population of size N Then, there are with the disease There are without the disease
27
Pretend Missing Data Formulation In a case-control sample, we randomly select n 1 with the disease, and n 0 without. The fraction of people with disease status D=d that we observe is
28
Pretend Missing Data Formulation Then let’s make up a “pretend” study, that has random sampling with missing data I take a random sample I get to observe (D,X,G) when D=d with probability I will say that if I observe (D,X,G). Then
29
Pretend Missing Data Formulation In this pretend missing data formulation, ordinary logistic regression is simply We have a model for G given X, hence we compute This has a simple explicit form, as follows
30
Result Define This is the intercept that ordinary logistic regression actually estimates –It only gets the slope right
31
Result Define Further define
32
Result Then, the semiparametric efficient profiled likelihood function is Trivial to compute.
33
Result In the rare disease case, we have the further simplification that
34
Interesting Technical Point Profile pseudo-likelihood acts like a likelihood Information Asymptotics are (almost) exact
35
Typical Simulation Example MSE Efficiency of Profile method compared to ordinary logistic regression
36
Typical Empirical Example
37
Consequence #1 We have a formal likelihood: This is also a legitimate semiparametric profile likelihood Anything you can do with a likelihood you can do with a semiparametric profile likelihood
38
Consequences #2-#3 Measurement Error in the Gene: Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) Measurement Error in the Environment : The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.
39
Advertisement Lobach, et al., Biometrics, in press
40
Consequences #4-#5 Flexible Modeling of Covariate Effects: Modeling some components by penalized regression splines The LASSO and other likelihood-based methods apply Model Averaging: Can entertain/average various risk models Bayesian methods are asymptotically correct
41
Consequence #6 Model Robustness: One can model average/select/LASSO various models for the distribution of G given X Main Point: Our method results in a legitimate likelihood, hence can be treated as such
42
Modeling the Gene Now turn to models for the gene Given such models likelihood calculations can be used for model fitting We will consider haplotypes
43
Haplotypes Haplotypes consist of what we get from our mother and father at more than one site Mother gives us the haplotype h m = (A m,B m ) Father gives us the haplotype h f = (a f,b f ) Our diplotype is H dip = {(A m,B m ), (a f,b f )}
44
Haplotypes Unfortunately, we cannot presently observe the two haplotypes We can only observe genotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b)
45
Missing Haplotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b) However, this is also consistent with a different diplotype, namely H dip = {(a m,B m ), (A f,b f )} Note that the number of copies of the (a,b) haplotype differs in these two cases The true diploid = haplotype pair is missing
46
Missing Haplotypes The likelihood in terms of the diploid is We observe the genotypes G The likelihood of the observed data is
47
Missing Haplotypes The likelihood of the observed data is Note how easy this was: it is really the profiled semiparametric likelihood of the observed data
48
Haplotypes Danyu Lin has a nice EM-based program for estimating haplotype frequencies It accepts data in text format with SAS missing data conventions The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) http:// www.bios.unc.edu /~lin/hapstat/
49
Haplotype Fitting Models that assume haplotype-environment independence are straightforward to fit via EM Danyu Lin’s program can do this as well as our SAS program The remaining issue is how to gain robustness against deviations from this assumed independence
50
Robustness We build robustness by specifying models for diplotypes given the environmental variables We first run a program to get a preliminary estimate of haplotype frequency We use the most frequent haplotype as a reference haplotype
51
Haplotypes Approach: Start with a logistic model for the unobserved haplotypes H given covariates X In practice, we collapse all rare haplotypes into the reference haplotype to eliminate many variables
52
Haplotypes Approach: Start with a logistic model for the unobserved haplotypes H given covariates X This gives us the model:
53
Haplotypes Since the diplotypes are not observed, for identifiability we need further constraints Example: One simple additive-type model is that
54
Haplotypes Further identification: Assume that the population as a whole is in HWE, so that
55
Haplotypes Summary: We have two models
56
Haplotypes Summary: The models are linked Let F(x) be the marginal distribution of X Then
57
Haplotypes In this set up, we have a particular form for hence is defined through them and the marginal distribution of X
58
Marginal Distributions of X Three approaches for estimating F(x) Profiled likelihood If pr(D=1) is known, weighted mixture of empirical cdf for cases and controls For rare disease, the empirical cdf for the controls
59
Summary Population model for the diplotypes, e.g., HWE Conditional model for diplotypes given environment Various estimates of marginal distribution of environment and the crucial link
60
Haplotypes Analysis The resulting method adds robustness EM-algorithms enable fast computation Explicit asymptotic theory (not trivial) The method is also semiparametric efficient
61
Haplotypes Analysis Simulations indicate the gain in robustness
62
The NAT2 Example Study of colorectal adenoma, a precursor to colon cancer 628 cases and 635 controls The gene NAT2 is known to be important in the metabolism of smoking-related carcinogens X: age, gender, whether one smokes or used to smoke 6 SNPS Haplotype 101010 is of interest
63
The NAT2 Example 7 Haplotypes had frequency > 0.5% The most frequent was treated as baseline, additive risk model for the diplotypes Interactions of smoking variable with the haplotype 101010 in the risk model Interactions of the smoking variable with the haplotypes in the gene model
64
The NAT2 Example Current smoking and 101010 haplotype interaction Estimates.e.P-value Independence-0.290.180.109 Dependence-0.560.270.039
65
The NAT2 Example In this example, recognizing the possibility that the gene distribution may depend on the environment (smoking) changes the analysis Plus, we get a p-value < 0.05!
66
Further work These is another way to get robustness that we have just submitted The idea is that the haplotypes and the environment are independent given the genotypes That is, once you know the genotypes, the haplotypes are determined solely by random mating.
67
Further work We then have two estimates: Haplotype-environment unconditional independence Independence conditional on the genotype Then we do a penalized likelihood analysis –Likelihood is the conditional independence likelihood –The penalty is the L1 distance from the unconditional independence estimate
68
Further work The result is increased robustness and major gains in efficiency
69
Summary Fully flexible risk models Flexible models for genes/haplotypes given covariates Computable semiparametric efficient inference that is more powerful than ordinary logistic regression and more robust than gene- environment independence
70
Thanks! http://stat.tamu.edu/~carroll
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.