Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics.

Slides:



Advertisements
Similar presentations
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Advertisements

Conditional Probability
1 Health Warning! All may not be what it seems! These examples demonstrate both the importance of graphing data before analysing it and the effect of outliers.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Study Designs in GWAS Jess Paulus, ScD January 30, 2013.
CONCEPTS UNDERLYING STUDY DESIGN
Genetic Analysis in Human Disease
Chance, bias and confounding
Basics of Linkage Analysis
Chapter 19 Confidence Intervals for Proportions.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Review: What influences confidence intervals?
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Clustered or Multilevel Data
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University
Score Tests in Semiparametric Models Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University
Getting Started with Hypothesis Testing The Single Sample.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
HYPOTHESIS TESTING Dr. Aidah Abu Elsoud Alkaissi
Tests of significance & hypothesis testing Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
Multiple Choice Questions for discussion
Determining Sample Size
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
TWO-STAGE CASE-CONTROL STUDIES USING EXPOSURE ESTIMATES FROM A GEOGRAPHICAL INFORMATION SYSTEM Jonas Björk 1 & Ulf Strömberg 2 1 Competence Center for.
PARAMETRIC STATISTICAL INFERENCE
 Is there a comparison? ◦ Are the groups really comparable?  Are the differences being reported real? ◦ Are they worth reporting? ◦ How much confidence.
Instructor Resource Chapter 5 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
CS177 Lecture 10 SNPs and Human Genetic Variation
A short introduction to epidemiology Chapter 2b: Conducting a case- control study Neil Pearce Centre for Public Health Research Massey University Wellington,
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Empirical Efficiency Maximization: Locally Efficient Covariate Adjustment in Randomized Experiments Daniel B. Rubin Joint work with Mark J. van der Laan.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Leicester Warwick Medical School Health and Disease in Populations Case-Control Studies Paul Burton.
1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.
Gene-Environment Case-Control Studies
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 4 First Part.
Lecture 15: Linkage Analysis VII
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
CHP400: Community Health Program - lI Research Methodology STUDY DESIGNS Observational / Analytical Studies Cohort Study Present: Disease Past: Exposure.
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
1 Borgan and Henderson: Event History Methodology Lancaster, September 2006 Session 8.1: Cohort sampling for the Cox model.
Estimating standard error using bootstrap
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Review: What influences confidence intervals?
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A AA

Advertising Training: We are finishing Year 08 of an NCI- funded R25T training program We train statistically and computationally oriented post-docs in the biology of nutrition and cancer Active seminar series

Outline Problem: Case-Control Studies with Gene- Environment relationships Efficient formulation when genes are observed Haplotype modeling and Robustness Applications

Acknowledgment This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Software SAS and Matlab Programs Available at my web site under the software button Examples are given in the programs Paper are in Biometrika (2005), Genetic Epidemiology (2006), Biostatistics (2007), Biometrics (2008) and JASA (2009) R programs available from the NCI

Basic Problem Formalized Gene and Environment Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?

Basic Problem Formalized Gene and Environment Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?

Basic Problem Formalized Gene and Environment Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?

Prospective and Retrospective Studies D = disease status (binary) X = environmental variables Smoking status Vitamin D Oral contraceptive use G = gene status Mutation or not Multiple or single SNP Haplotypes

Prospective and Retrospective Studies Prospective: Classic random sampling of a population You measure gene and environment on a cohort You then follow up people for disease occurrence

Prospective and Retrospective Studies Prospective Studies: Expensive: disease states are rare, so large sample sizes needed Time-consuming: you have to wait for disease to develop They Exist: Framingham Heart Study, NIH- AARP Diet and Health Study, Women’s Health Initiative, etc.

Prospective and Retrospective Studies Prospective Studies: Daunting Task: Only very large, very expensive prospective studies can find gene- environment interactions Data Access: Access to the Framingham Heart Study requires a university commitment to security

Prospective and Retrospective Studies Retrospective Studies: Usually called case- control studies Find a population of cases, i.e., people with a disease, and sample from it. Find a population of controls, i.e., people without the disease, and sample from it.

Prospective and Retrospective Studies Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained Microarray studies on humans: most are case-control studies Genome Wide Association Studies (GWAS): most are case-control studies

Prospective and Retrospective Studies Case-control Studies: Fast: no need to wait for disease to develop Cheap: sample sizes are much smaller Subtle: The controls need to be representative of the population of people without the disease.

Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment, can include strata: X We are interested in main effects for G and X along with their interaction as they affect development of disease

Basic Problem Formalized % of analyses of case-control data use logistic regression Closely related to Fisher’s Linear Discriminant Analysis (LDA) Difference: we want to understand what targets affect disease, not just predict disease

Logistic Regression Logistic Function: The approximation works for rare diseases

Prospective Models Simplest logistic model without an interaction The effect of having a mutation (G=1) versus not (G=0) is

Prospective Models Simplest logistic model with an interaction The effect of having a mutation (G=1) versus not (G=0) is

Empirical Observations Logistic regression is in every statistical package Unfortunately, logistic regression is not efficient for understanding interactions Much larger sample sizes are required for interactions that for just gene effects Most gene-environment interaction case-control studies fail for this reason

Empirical Observations Statistical Theory: There is a lovely statistical theory available It says: ignore the fact that you have a case- control sample, and pretend you have a prospective study It all works out: don’t worry, be happy!

Empirical Observations Statistical Theory: Ordinary logistic regression applied to a case-control study makes no assumptions about the population distribution of (G,X) Remember: we do not have a sample from a population, only a case-control sample Logistic regression is robust: to assumptions about the population distribution of (G,X)

Likelihood Function The likelihood is Note how the likelihood depends on two things: The distribution of (X,G) in the population The probability of disease in the population Neither can be estimated from the case-control study

When G is observed Logistic regression is thus robust to any modeling assumptions about the covariates in the population Unfortunately it is not very efficient for understanding interactions

Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies

G-E Independence Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

Gene-Environment Independence If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. This is NOT TRUE for prospective studies, only true for retrospective studies.

Gene-Environment Independence The reason is that you are putting a constraint on the retrospective likelihood

Gene-Environment Independence Our Methodology: Is far more general than assuming that genetic status and environment are independent We have developed capacity for modeling the distribution of genetic status given strata and environmental factors I will skip this and just pretend G-E independence here

More Efficiency, G Observed Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium Consequences: More efficient estimation of G effects Much more efficient estimation of G-E interactions.

The Formulation Any logistic model works Question: What methods do we have to construct estimators?

Methodology I won’t give you the full methodology, but it works as follows. Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people

Pretend Missing Data Formulation Suppose you have a large but finite population of size N Then, there are with the disease There are without the disease

Pretend Missing Data Formulation In a case-control sample, we randomly select n 1 with the disease, and n 0 without. The fraction of people with disease status D=d that we observe is

Pretend Missing Data Formulation Pretend you randomly sample a population You observe a person who has D=d, and, with the probability Statisticians know how to deal with missing data, e.g., compute probabilities for what you actually see

Pretend Missing Data Formulation In this pretend missing data formulation, ordinary logistic regression is simply We have a model for G given X, hence we compute

Methodology Our method has an explicit form, i.e., no integrals or anything nasty It is easy to program the method to estimate the logistic model It is likelihood based. Technically, a semiparametric profile likelihood

Methodology We can handle missing gene data We can handle error in genotyping We can handle measurement errors in environmental variables, e.g., diet

Methodology Our method results in much more efficient statistical inference

More Data What does More efficient statistical inference mean? It means, effectively, that you have more data In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data

How much more data: Typical Simulation Example The increase in effective sample size when using our methodology

Real Data Complexities The Israeli Ovarian Cancer Study G = BRCA1/2 mutation (very deadly) X includes age, ethnic status (below), parity, oral contraceptive use Family history Smoking Etc.

Real Data Complexities In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases Also, among Jewish citizens, Israel has two dominant ethnic types Ashkenazi (European) Shephardic (North African)

Real Data Complexities The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X Gene-Environment independence fails here What can be done? Model pr(G=1 | X) as binary with different probabilities!

Israeli Ovarian Cancer Study Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?

Typical Empirical Example

Israeli Ovarian Cancer Study Main Effect of BRCA1/2:

Israeli Ovarian Cancer Study Odds ratio for OC use among carriers = 1.04 (0.98, 1.09) No evidence for protective effect Not available from case-only analysis Length of interval is ½ the length of the usual analysis

Haplotypes Haplotypes consist of what we get from our mother and father at more than one site Mother gives us the haplotype h m = (A m,B m ) Father gives us the haplotype h f = (a f,b f ) Our diplotype is H dip = {(A m,B m ), (a f,b f )}

Haplotypes Unfortunately, we cannot presently observe the two haplotypes We can only observe genotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b) However, this is also consistent with a different diplotype, namely H dip = {(a m,B m ), (A f,b f )} Note that the number of copies of the (a,b) haplotype differs in these two cases The true diploid = haplotype pair is missing

Missing Haplotypes Our methods handle unphased diplotyes (missing haplotypes) with no problem. Standard EM-algorithm calculations can be used We assume that the haplotypes are in HWE, and have extended to cases of non-HWE

Robustness Robustness: We are making assumptions to gain efficiency = “get more data” What happens if the assumptions are wrong? Biases, incorrect conclusions, etc. How can we gain efficiency when it is warranted, and yet have valid inferences?

Two Likelihoods In our “pretend” missing data formulation, the model free estimator uses the likelihood The model-based estimator uses the likelihood

Two Likelihoods The two likelihoods lead to two estimators The former is robust but not efficient The latter is efficient but not robust What to do?

Empirical Bayes We chose an Empirical Bayes approach Let and Then is diagonal with elements

Comments on Empirical Bayes If the model fails, then the estimator converges to the model-free estimator If the model holds, the estimator estimates the right thing, but is much more efficient than the model-free estimator

Simulations Various simulations show the following If the model holds, EB is slightly less efficient that model-based much more efficient than model-free If the model fails, Model-based is badly biased EB and shrinkage eliminate most bias, at least as efficient as model-free

Example 1: Prostate Cancer G = SNPs in the Vitamin D Pathway X = Serum-level biomarker of vitamin D (diet and sun) The VDR gene is downstream in the pathway, hence unlikely to influence the level of X Gene-environment independence likely

Example I: Vitamin D

Example 2: Colorectal Adenoma G = SNPs in the NAT2 gene, which is important in the metabolism of X =Various measures of smoking history The NAT2 gene may make smokers more addicted Gene-environment independence unlikely

The NAT2 Example Current smoking and haplotype interaction coefficient Current smokers with this haplotype are 50% less likely to develop a colorectal adenoma MethodEstimates.e.p-value Model Free Independence Consistent EB

The VDR Example Serum Vitamin D and 000 haplotype interaction coefficient Men with 1 sd greater Serum vitamin D then the norm are 70% less likely to develop prostate cancer MethodEstimates.e.p-value Model Free Independence Consistent EB

Genome-Wide Association Studies These methods can be applied to GWAS My last two examples were actually from the PLCO GWAS Also, can call the environment = other SNP

Follow-up #1 Follow-up #2 Establish Loci Initial Study Identifying Genetic Markers for Prostate & Breast Cancer Fine Mapping Functional Studies Validate Plausible Variants Possible Clinical Testing Genome-Wide Analysis Public Health Problem Prostate (1 in 8 Men) Breast (1 in 9 Women) Analyze Long-Term Studies NCI PLCO Study Nurses’ Health Study

Identifying Genetic Markers for Prostate & Breast Cancer

Case-Control studies nested in prospective cohort used in CGEMS GWAS Non aggressive : stage <= 2 (non invasive) and Gleason score <= 6 Aggressive : stage > 2 (invasive) or Gleason score > 6 May ,826 eligible participants Post-menoposal Breast Cancer 1183 cases NHS cohort starts Oct 2001 Oct ,521 eligible participants Aggressive ProstateCancer 737 cases Non-aggressive P. C blood sample collection 1185 controls cases 1230 controls blood sample collection PLCO cohort starts 1994

Genome-Wide Association Studies The methodology I will describe is now the standard gene-environment analysis at the National Cancer Institute for GWAS There are now 500,000 SNP in a typical GWAS, and our method is fast enough to handle this

Genome-Wide Association Studies Typically, loci are identified initially for main effects, then followed up for gene-environment interactions My analyses have come from the PLCO study In some cases, the “environment” is other genes on different chromosomes, i.e., gene- gene interactions

Genome-Wide Association Studies Despite the fact that the genes are on different chromosomes, they are not always independent For example they might be in the same pathway

Genome-Wide Association Studies When genes on different chromosomes are independenjt, our methods give huge gains in efficiency = “more data” = smaller standard errors When they are not, our methods give, in effect, the robust method of ordinary logistic regression

Summary Case-control studies are the backbone of epidemiology in general, and genetic epidemiology in particular Their retrospective nature distinguishes them from random samples = prospective studies

Summary We start by assuming relationships between the genes and the “environment” in the population, e.g., independence This model can be fully flexible We also, where necessary, specify distributions for genes

Summary We calculated a new likelihood function, leading to more much more precise inferences The method can handle missing genes, genotyping errors, measurement errors in the environment Calculations are straightforward via the EM algorithm

Summary Forced to face the dilemma Lousy but robust method Great but not robust method We developed a fast, data adaptive, novel way of addressing this issue In cases where one can predict the outcome, the EB method works as desired