Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics.

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A AA

Advertising Training: We are finishing Year 08 of an NCI- funded R25T training program http://www.stat.tamu.edu/b3nc We train statistically and computationally oriented post-docs in the biology of nutrition and cancer Active seminar series

Outline Problem: Case-Control Studies with Gene- Environment relationships Efficient formulation when genes are observed Haplotype modeling and Robustness Applications

Acknowledgment This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Software SAS and Matlab Programs Available at my web site under the software button Examples are given in the programs Paper are in Biometrika (2005), Genetic Epidemiology (2006), Biostatistics (2007), Biometrics (2008) and JASA (2009) R programs available from the NCI http://stat.tamu.edu/~carroll

Basic Problem Formalized Gene and Environment Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?

Basic Problem Formalized Gene and Environment Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?

Basic Problem Formalized Gene and Environment Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?

Prospective and Retrospective Studies D = disease status (binary) X = environmental variables Smoking status Vitamin D Oral contraceptive use G = gene status Mutation or not Multiple or single SNP Haplotypes

Prospective and Retrospective Studies Prospective: Classic random sampling of a population You measure gene and environment on a cohort You then follow up people for disease occurrence

Prospective and Retrospective Studies Prospective Studies: Expensive: disease states are rare, so large sample sizes needed Time-consuming: you have to wait for disease to develop They Exist: Framingham Heart Study, NIH- AARP Diet and Health Study, Women’s Health Initiative, etc.

Prospective and Retrospective Studies Prospective Studies: Daunting Task: Only very large, very expensive prospective studies can find gene- environment interactions Data Access: Access to the Framingham Heart Study requires a university commitment to security

Prospective and Retrospective Studies Retrospective Studies: Usually called case- control studies Find a population of cases, i.e., people with a disease, and sample from it. Find a population of controls, i.e., people without the disease, and sample from it.

Prospective and Retrospective Studies Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained Microarray studies on humans: most are case-control studies Genome Wide Association Studies (GWAS): most are case-control studies

Prospective and Retrospective Studies Case-control Studies: Fast: no need to wait for disease to develop Cheap: sample sizes are much smaller Subtle: The controls need to be representative of the population of people without the disease.

Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment, can include strata: X We are interested in main effects for G and X along with their interaction as they affect development of disease

Basic Problem Formalized 99.9999% of analyses of case-control data use logistic regression Closely related to Fisher’s Linear Discriminant Analysis (LDA) Difference: we want to understand what targets affect disease, not just predict disease

Logistic Regression Logistic Function: The approximation works for rare diseases

Prospective Models Simplest logistic model without an interaction The effect of having a mutation (G=1) versus not (G=0) is

Prospective Models Simplest logistic model with an interaction The effect of having a mutation (G=1) versus not (G=0) is

Empirical Observations Logistic regression is in every statistical package Unfortunately, logistic regression is not efficient for understanding interactions Much larger sample sizes are required for interactions that for just gene effects Most gene-environment interaction case-control studies fail for this reason

Empirical Observations Statistical Theory: There is a lovely statistical theory available It says: ignore the fact that you have a case- control sample, and pretend you have a prospective study It all works out: don’t worry, be happy!

Empirical Observations Statistical Theory: Ordinary logistic regression applied to a case-control study makes no assumptions about the population distribution of (G,X) Remember: we do not have a sample from a population, only a case-control sample Logistic regression is robust: to assumptions about the population distribution of (G,X)

Likelihood Function The likelihood is Note how the likelihood depends on two things: The distribution of (X,G) in the population The probability of disease in the population Neither can be estimated from the case-control study

When G is observed Logistic regression is thus robust to any modeling assumptions about the covariates in the population Unfortunately it is not very efficient for understanding interactions

Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies

G-E Independence Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

Gene-Environment Independence If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. This is NOT TRUE for prospective studies, only true for retrospective studies.

Gene-Environment Independence The reason is that you are putting a constraint on the retrospective likelihood

Gene-Environment Independence Our Methodology: Is far more general than assuming that genetic status and environment are independent We have developed capacity for modeling the distribution of genetic status given strata and environmental factors I will skip this and just pretend G-E independence here

More Efficiency, G Observed Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium Consequences: More efficient estimation of G effects Much more efficient estimation of G-E interactions.

The Formulation Any logistic model works Question: What methods do we have to construct estimators?

Methodology I won’t give you the full methodology, but it works as follows. Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people

Pretend Missing Data Formulation Suppose you have a large but finite population of size N Then, there are with the disease There are without the disease

Pretend Missing Data Formulation In a case-control sample, we randomly select n 1 with the disease, and n 0 without. The fraction of people with disease status D=d that we observe is

Pretend Missing Data Formulation Pretend you randomly sample a population You observe a person who has D=d, and, with the probability Statisticians know how to deal with missing data, e.g., compute probabilities for what you actually see

Pretend Missing Data Formulation In this pretend missing data formulation, ordinary logistic regression is simply We have a model for G given X, hence we compute

Methodology Our method has an explicit form, i.e., no integrals or anything nasty It is easy to program the method to estimate the logistic model It is likelihood based. Technically, a semiparametric profile likelihood

Methodology We can handle missing gene data We can handle error in genotyping We can handle measurement errors in environmental variables, e.g., diet

Methodology Our method results in much more efficient statistical inference

More Data What does More efficient statistical inference mean? It means, effectively, that you have more data In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data

How much more data: Typical Simulation Example The increase in effective sample size when using our methodology

Real Data Complexities The Israeli Ovarian Cancer Study G = BRCA1/2 mutation (very deadly) X includes age, ethnic status (below), parity, oral contraceptive use Family history Smoking Etc.

Real Data Complexities In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases Also, among Jewish citizens, Israel has two dominant ethnic types Ashkenazi (European) Shephardic (North African)

Real Data Complexities The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X Gene-Environment independence fails here What can be done? Model pr(G=1 | X) as binary with different probabilities!

Israeli Ovarian Cancer Study Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?

Typical Empirical Example

Israeli Ovarian Cancer Study Main Effect of BRCA1/2:

Israeli Ovarian Cancer Study Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) No evidence for protective effect Not available from case-only analysis Length of interval is ½ the length of the usual analysis

Haplotypes Haplotypes consist of what we get from our mother and father at more than one site Mother gives us the haplotype h m = (A m,B m ) Father gives us the haplotype h f = (a f,b f ) Our diplotype is H dip = {(A m,B m ), (a f,b f )}

Haplotypes Unfortunately, we cannot presently observe the two haplotypes We can only observe genotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b) However, this is also consistent with a different diplotype, namely H dip = {(a m,B m ), (A f,b f )} Note that the number of copies of the (a,b) haplotype differs in these two cases The true diploid = haplotype pair is missing

Missing Haplotypes Our methods handle unphased diplotyes (missing haplotypes) with no problem. Standard EM-algorithm calculations can be used We assume that the haplotypes are in HWE, and have extended to cases of non-HWE

Robustness Robustness: We are making assumptions to gain efficiency = “get more data” What happens if the assumptions are wrong? Biases, incorrect conclusions, etc. How can we gain efficiency when it is warranted, and yet have valid inferences?

Two Likelihoods In our “pretend” missing data formulation, the model free estimator uses the likelihood The model-based estimator uses the likelihood

Two Likelihoods The two likelihoods lead to two estimators The former is robust but not efficient The latter is efficient but not robust What to do?

Empirical Bayes We chose an Empirical Bayes approach Let and Then is diagonal with elements

Comments on Empirical Bayes If the model fails, then the estimator converges to the model-free estimator If the model holds, the estimator estimates the right thing, but is much more efficient than the model-free estimator

Simulations Various simulations show the following If the model holds, EB is slightly less efficient that model-based much more efficient than model-free If the model fails, Model-based is badly biased EB and shrinkage eliminate most bias, at least as efficient as model-free

Example 1: Prostate Cancer G = SNPs in the Vitamin D Pathway X = Serum-level biomarker of vitamin D (diet and sun) The VDR gene is downstream in the pathway, hence unlikely to influence the level of X Gene-environment independence likely

Example I: Vitamin D

Example 2: Colorectal Adenoma G = SNPs in the NAT2 gene, which is important in the metabolism of X =Various measures of smoking history The NAT2 gene may make smokers more addicted Gene-environment independence unlikely

The NAT2 Example Current smoking and 101010 haplotype interaction coefficient Current smokers with this haplotype are 50% less likely to develop a colorectal adenoma MethodEstimates.e.p-value Model Free-0.630.170.014 Independence-0.330.160.048 Consistent EB1 -0.590.250.017

The VDR Example Serum Vitamin D and 000 haplotype interaction coefficient Men with 1 sd greater Serum vitamin D then the norm are 70% less likely to develop prostate cancer MethodEstimates.e.p-value Model Free-0.210.120.093 Independence-0.180.080.019 Consistent EB1 -0.190.080.021

Genome-Wide Association Studies These methods can be applied to GWAS My last two examples were actually from the PLCO GWAS Also, can call the environment = other SNP

Follow-up #1 Follow-up #2 Establish Loci Initial Study Identifying Genetic Markers for Prostate & Breast Cancer Fine Mapping Functional Studies Validate Plausible Variants Possible Clinical Testing Genome-Wide Analysis Public Health Problem Prostate (1 in 8 Men) Breast (1 in 9 Women) Analyze Long-Term Studies NCI PLCO Study Nurses’ Health Study http://cgems.cancer.gov

Identifying Genetic Markers for Prostate & Breast Cancer

Case-Control studies nested in prospective cohort used in CGEMS GWAS Non aggressive : stage <= 2 (non invasive) and Gleason score <= 6 Aggressive : stage > 2 (invasive) or Gleason score > 6 May 2004 32,826 eligible participants Post-menoposal Breast Cancer 1183 cases NHS cohort starts 1976 199520042000 Oct 2001 Oct 2003 28,521 eligible participants Aggressive ProstateCancer 737 cases Non-aggressive P. C. 1995 200220001998 1990 blood sample collection 1185 controls 2000 493 cases 1230 controls blood sample collection PLCO cohort starts 1994

Genome-Wide Association Studies The methodology I will describe is now the standard gene-environment analysis at the National Cancer Institute for GWAS There are now 500,000 SNP in a typical GWAS, and our method is fast enough to handle this

Genome-Wide Association Studies Typically, loci are identified initially for main effects, then followed up for gene-environment interactions My analyses have come from the PLCO study In some cases, the “environment” is other genes on different chromosomes, i.e., gene- gene interactions

Genome-Wide Association Studies Despite the fact that the genes are on different chromosomes, they are not always independent For example they might be in the same pathway

Genome-Wide Association Studies When genes on different chromosomes are independenjt, our methods give huge gains in efficiency = “more data” = smaller standard errors When they are not, our methods give, in effect, the robust method of ordinary logistic regression

Summary Case-control studies are the backbone of epidemiology in general, and genetic epidemiology in particular Their retrospective nature distinguishes them from random samples = prospective studies

Summary We start by assuming relationships between the genes and the “environment” in the population, e.g., independence This model can be fully flexible We also, where necessary, specify distributions for genes

Summary We calculated a new likelihood function, leading to more much more precise inferences The method can handle missing genes, genotyping errors, measurement errors in the environment Calculations are straightforward via the EM algorithm

Summary Forced to face the dilemma Lousy but robust method Great but not robust method We developed a fast, data adaptive, novel way of addressing this issue In cases where one can predict the outcome, the EB method works as desired

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics.

Similar presentations

Presentation on theme: "Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics.

Similar presentations

Presentation on theme: "Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics."— Presentation transcript:

Similar presentations

About project

Feedback