Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A AA

Outline Problem: Case-Control Studies with Gene- Environment relationships Efficient formulation when genes are observed Measurement errors in environmental variables Haplotype modeling and Robustness

Acknowledgment This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Acknowledgment Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)

Software SAS and Matlab Programs Available at my web site under the software button Examples are given in the programs http://stat.tamu.edu/~carroll

Some Personal History I was born in Japan The coffee table is still in my house

Some Personal History My father lived in Seoul for 2 months in 1948 and 1 year in 1968 He took many photos of sights there, especially in 1948

Joonghwa moon at Deoksugung, 1948

Joonghwa moon at Deoksugung, today

The Prices of Drinks Were Pretty Low

Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment, can include strata: X We are interested in main effects for G and X along with their interaction

Prospective Models Simplest logistic model General logistic model The function m(G,X  1 ) is completely general

Likelihood Function The likelihood is Note how the likelihood depends on two things: The distribution of (X,G) in the population The probability of disease in the population Neither can be estimated from the case-control study

When G is observed The usual choice is ordinary logistic regression It is semiparametric efficient if nothing is known about the distribution of G, X in the population Why semiparametric: what is unknown is the distribution of (G,X) in the population

When G is observed Logistic regression is thus robust to any modeling assumptions about the covariates in the population Unfortunately it is not very efficient for understanding interactions

Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies

G-E Independence Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction Part of this talk is to model the distribution of G given X

Gene-Environment Independence If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. The reason is that you are putting a constraint on the retrospective likelihood

More Efficiency, G Observed A constraint on the population is to posit a parametric or semiparametric model for G given X Consequences: More efficient estimation of G effects Much more efficient estimation of G and (X,S) interactions.

The Formulation In the most general semiparametric setting, we have Question: What methods do we have to construct estimators?

Methodology We have developed two new ways of thinking about this problem In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation

The Hard Way Treat X as a discrete random variable whose mass points are the observed data points Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.

The Hard Way The maximization is not trivial to do correctly Result: an explicit profile likelihood that does not involve the distribution of X

Pretend Missing Data Formulation The following simple trick can be shown to be legitimate and semiparametric efficient Equivalently, we compute a semiparametric profiled likelihood Semiparametric because the distribution of X is not modeled

Pretend Missing Data Formulation The idea is to create a “pretend” study, which is one of random sampling with missing data We use an MAR regime. The “pretend” study mimics the case-control study

Pretend Missing Data Formulation Suppose you have a large but finite population of size N Then, there are with the disease There are without the disease

Pretend Missing Data Formulation In a case-control sample, we randomly select n 1 with the disease, and n 0 without. The fraction of people with disease status D=d that we observe is

Pretend Missing Data Formulation Then let’s make up a “pretend” study, that has random sampling with missing data I take a random sample I get to observe (D,X,G) when D=d with probability I will say that if I observe (D,X,G). Then

Pretend Missing Data Formulation In this pretend missing data formulation, ordinary logistic regression is simply We have a model for G given X, hence we compute This has a simple explicit form, as follows

Result Define This is the intercept that ordinary logistic regression actually estimates –It only gets the slope right

Result Define Further define

Result Then, the semiparametric efficient profiled likelihood function is Trivial to compute.

Result In the rare disease case, we have the further simplification that

Interesting Technical Point Profile pseudo-likelihood acts like a likelihood Information Asymptotics are (almost) exact

Typical Simulation Example MSE Efficiency of Profile method compared to ordinary logistic regression

Typical Empirical Example

Consequence #1 We have a formal likelihood: This is also a legitimate semiparametric profile likelihood Anything you can do with a likelihood you can do with a semiparametric profile likelihood

Consequences #2-#3 Measurement Error in the Gene: Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) Measurement Error in the Environment : The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.

Advertisement Lobach, et al., Biometrics, in press

Consequences #4-#5 Flexible Modeling of Covariate Effects: Modeling some components by penalized regression splines The LASSO and other likelihood-based methods apply Model Averaging: Can entertain/average various risk models Bayesian methods are asymptotically correct

Consequence #6 Model Robustness: One can model average/select/LASSO various models for the distribution of G given X Main Point: Our method results in a legitimate likelihood, hence can be treated as such

Modeling the Gene Now turn to models for the gene Given such models likelihood calculations can be used for model fitting We will consider haplotypes

Haplotypes Haplotypes consist of what we get from our mother and father at more than one site Mother gives us the haplotype h m = (A m,B m ) Father gives us the haplotype h f = (a f,b f ) Our diplotype is H dip = {(A m,B m ), (a f,b f )}

Haplotypes Unfortunately, we cannot presently observe the two haplotypes We can only observe genotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b) However, this is also consistent with a different diplotype, namely H dip = {(a m,B m ), (A f,b f )} Note that the number of copies of the (a,b) haplotype differs in these two cases The true diploid = haplotype pair is missing

Missing Haplotypes The likelihood in terms of the diploid is We observe the genotypes G The likelihood of the observed data is

Missing Haplotypes The likelihood of the observed data is Note how easy this was: it is really the profiled semiparametric likelihood of the observed data

Haplotypes Danyu Lin has a nice EM-based program for estimating haplotype frequencies It accepts data in text format with SAS missing data conventions The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) http:// www.bios.unc.edu /~lin/hapstat/

Haplotype Fitting Models that assume haplotype-environment independence are straightforward to fit via EM Danyu Lin’s program can do this as well as our SAS program The remaining issue is how to gain robustness against deviations from this assumed independence

Robustness We build robustness by specifying models for diplotypes given the environmental variables We first run a program to get a preliminary estimate of haplotype frequency We use the most frequent haplotype as a reference haplotype

Haplotypes Approach: Start with a logistic model for the unobserved haplotypes H given covariates X In practice, we collapse all rare haplotypes into the reference haplotype to eliminate many variables

Haplotypes Approach: Start with a logistic model for the unobserved haplotypes H given covariates X This gives us the model:

Haplotypes Since the diplotypes are not observed, for identifiability we need further constraints Example: One simple additive-type model is that

Haplotypes Further identification: Assume that the population as a whole is in HWE, so that

Haplotypes Summary: We have two models

Haplotypes Summary: The models are linked Let F(x) be the marginal distribution of X Then

Haplotypes In this set up, we have a particular form for hence is defined through them and the marginal distribution of X

Marginal Distributions of X Three approaches for estimating F(x) Profiled likelihood If pr(D=1) is known, weighted mixture of empirical cdf for cases and controls For rare disease, the empirical cdf for the controls

Summary Population model for the diplotypes, e.g., HWE Conditional model for diplotypes given environment Various estimates of marginal distribution of environment and the crucial link

Haplotypes Analysis The resulting method adds robustness EM-algorithms enable fast computation Explicit asymptotic theory (not trivial) The method is also semiparametric efficient

Haplotypes Analysis Simulations indicate the gain in robustness

The NAT2 Example Study of colorectal adenoma, a precursor to colon cancer 628 cases and 635 controls The gene NAT2 is known to be important in the metabolism of smoking-related carcinogens X: age, gender, whether one smokes or used to smoke 6 SNPS Haplotype 101010 is of interest

The NAT2 Example 7 Haplotypes had frequency > 0.5% The most frequent was treated as baseline, additive risk model for the diplotypes Interactions of smoking variable with the haplotype 101010 in the risk model Interactions of the smoking variable with the haplotypes in the gene model

The NAT2 Example Current smoking and 101010 haplotype interaction Estimates.e.P-value Independence-0.290.180.109 Dependence-0.560.270.039

The NAT2 Example In this example, recognizing the possibility that the gene distribution may depend on the environment (smoking) changes the analysis Plus, we get a p-value < 0.05!

Further work These is another way to get robustness that we have just submitted The idea is that the haplotypes and the environment are independent given the genotypes That is, once you know the genotypes, the haplotypes are determined solely by random mating.

Further work We then have two estimates: Haplotype-environment unconditional independence Independence conditional on the genotype Then we do a penalized likelihood analysis –Likelihood is the conditional independence likelihood –The penalty is the L1 distance from the unconditional independence estimate

Further work The result is increased robustness and major gains in efficiency

Summary Fully flexible risk models Flexible models for genes/haplotypes given covariates Computable semiparametric efficient inference that is more powerful than ordinary logistic regression and more robust than gene- environment independence

Thanks! http://stat.tamu.edu/~carroll

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Similar presentations

Presentation on theme: "Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Similar presentations

Presentation on theme: "Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University"— Presentation transcript:

Similar presentations

About project

Feedback