Gene-Environment Case-Control Studies

Slides:

Advertisements

Similar presentations

Empirical Estimator for GxE using imputed data Shuo Jiao.

Advertisements

Objectives 10.1 Simple linear regression

Conditional Probability

Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.

METHODS FOR HAPLOTYPE RECONSTRUCTION

Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.

Chapter 10: Hypothesis Testing

Review: What influences confidence intervals?

Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.

EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.

More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.

Clustered or Multilevel Data

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Score Tests in Semiparametric Models Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics.

Ch. 9 Fundamental of Hypothesis Testing

Today Concepts underlying inferential statistics

Getting Started with Hypothesis Testing The Single Sample.

EVAL 6970: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Dr. Anne Cullen Spring 2012.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.

Business Statistics: Communicating with Numbers

Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.

Hypothesis Testing:.

Confidence Intervals and Hypothesis Testing - II

HYPOTHESIS TESTING Dr. Aidah Abu Elsoud Alkaissi

Determining Sample Size

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.

PARAMETRIC STATISTICAL INFERENCE

Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.

5.1 Chapter 5 Inference in the Simple Regression Model In this chapter we study how to construct confidence intervals and how to conduct hypothesis tests.

DIRECTIONAL HYPOTHESIS The 1-tailed test: –Instead of dividing alpha by 2, you are looking for unlikely outcomes on only 1 side of the distribution –No.

Chap 8-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 8 Introduction to Hypothesis.

Introduction to the Practice of Statistics Fifth Edition Chapter 6: Introduction to Inference Copyright © 2005 by W. H. Freeman and Company David S. Moore.

1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.

Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.

Bayesian Inference, Review 4/25/12 Frequentist inference Bayesian inference Review The Bayesian Heresy (pdf)pdf Professor Kari Lock Morgan Duke University.

Statistics for Differential Expression Naomi Altman Oct. 06.

Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 4 First Part.

Question paper 1997.

Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.

Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.

Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.

Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.

Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.

Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.

Analysis of Experiments

POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)

Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.

T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.

Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Chapter 10: The t Test For Two Independent Samples.

Estimating standard error using bootstrap

Chapter Nine Hypothesis Testing.

3. The X and Y samples are independent of one another.

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Chapter 4: The Nature of Regression Analysis

How to handle missing data values

Review: What influences confidence intervals?

Chapter 4: The Nature of Regression Analysis

Presentation transcript:

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Note the Maroon color scheme! And the green MSU flag. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Apologies to Dr. Seuss TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Michigan State Grads at TAMU Mohsen Pourahmadi Soumen Lahiri TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Other Michigan State Contacts David Ruppert Anton Schick TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Outline Problem: Case-Control Studies with Gene-Environment relationships Theme I: Logistic regression is lousy for understanding interactions. We make assumptions that can double or triple the effective sample size

Outline Problem: Case-Control Studies with Gene-Environment relationships Theme II: There is a lousy estimator, and a good one that makes more assumptions. How do you protect yourself if the assumptions fail, and you want to analyze 500,00 SNP?

Outline Problem: Case-Control Studies with Gene-Environment relationships Theme III: How does all this work with actual data, as opposed to simulated data?

Software SAS and Matlab Programs Available at my web site under the software button R programs available from the NCI New Statistical Science paper 2009, volume 24, 489-502 http://stat.tamu.edu/~carroll

Basic Problem Formalized Gene and Environment Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?

Basic Problem Formalized Gene and Environment Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?

Basic Problem Formalized Gene and Environment Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?

Retrospective Studies D = disease status (binary) X = environmental variables Smoking status Vitamin D Oral contraceptive use G = gene status Mutation or not Multiple or single SNP Haplotypes

Prospective and Retrospective Studies Retrospective Studies: Usually called case-control studies Find a population of cases, i.e., people with a disease, and sample from it. Find a population of controls, i.e., people without the disease, and sample from it.

Prospective and Retrospective Studies Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained

Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment, can include strata: X We are interested in main effects for G and X along with their interaction as they affect development of disease

Logistic Regression Logistic Function: The approximation works for rare diseases

Prospective Models Simplest logistic model without an interaction The effect of having a mutation (G=1) versus not (G=0) is

Prospective Models Simplest logistic model with an interaction The effect of having a mutation (G=1) versus not (G=0) is

Empirical Observations Statistical Theory: There is a lovely statistical theory available It says: ignore the fact that you have a case-control sample, and pretend you have a prospective study

When G is observed Logistic regression is robust to any modeling assumptions about the covariates in the population Unfortunately it is not very efficient for understanding interactions Much larger sample sizes are required for interactions that for just gene effects

Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene-environment interaction studies

G-E Independence Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

Gene-Environment Independence If you are willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. This is NOT TRUE for prospective studies, only true for retrospective studies.

Gene-Environment Independence The reason is that you are putting a constraint on the retrospective likelihood

Gene-Environment Independence Our Methodology: Is far more general than assuming that genetic status and environment are independent We have developed capacity for modeling the distribution of genetic status given strata and environmental factors I will skip this and just pretend G-E independence here

More Efficiency, G Observed Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium

The Formulation Any logistic model works Question: What methods do we have to construct estimators?

Methodology I won’t give you the full methodology, but it works as follows. Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people

Methodology N Total Population Np1 Np0 Controls in the Population Cases in the Population Cases in the Sample n1 n0 Controls in the Sample Missing Cases Np1-n1 Np0-n0 Missing Controls % of Controls observed % of Cases observed

Pretend Missing Data Formulation This means that there is a missing data problem. The selection into the case control study is biased: cases are vastly over-represented Ordinary logistic regression computes the probability of disease given the environment, given the gene, and given that the person was selected into the case control study

Pretend Missing Data Formulation This means that there is a missing data problem. Our method computes the probability of disease and the probability of gene given the environment and given that the person was selected into the case control study The selection into the case control study is biased: cases are vastly over-represented

Methodology Our method has an explicit form, i.e., no integrals or anything nasty It is easy to program the method to estimate the logistic model It is likelihood based. Technically, a semiparametric profile likelihood

Methodology We can handle missing gene data We can handle error in genotyping We can handle measurement errors in environmental variables, e.g., diet

Methodology Our method results in much more efficient statistical inference

More Data What does More efficient statistical inference mean? It means, effectively, that you have more data In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data

How much more data: Typical Simulation Example The increase in effective sample size when using our methodology

Real Data Complexities The Israeli Ovarian Cancer Study G = BRCA1/2 mutation (very deadly) X includes age, ethnic status (below), parity, oral contraceptive use Family history Smoking Etc.

Real Data Complexities In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases Also, among Jewish citizens, Israel has two dominant ethnic types Ashkenazi (European) Shephardic (North African)

Real Data Complexities The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X Gene-Environment independence fails here What can be done? Model pr(G=1 | X) as binary with different probabilities!

Israeli Ovarian Cancer Study Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?

Typical Empirical Example

Israeli Ovarian Cancer Study Main Effect of BRCA1/2:

Israeli Ovarian Cancer Study

Haplotypes Haplotypes consist of what we get from our mother and father at more than one site Mother gives us the haplotype hm = (Am,Bm) Father gives us the haplotype hf = (af,bf) Our diplotype is Hdip = {(Am,Bm), (af,bf)}

Haplotypes Unfortunately, we cannot presently observe the two haplotypes We can only observe genotypes Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b) However, this is also consistent with a different diplotype, namely Hdip = {(am,Bm), (Af,bf)} Note that the number of copies of the (a,b) haplotype differs in these two cases The true diploid = haplotype pair is missing

Missing Haplotypes Our methods handle unphased diplotyes (missing haplotypes) with no problem. Standard EM-algorithm calculations can be used We assume that the haplotypes are in HWE, and have extended to cases of non-HWE

Robustness Robustness: We are making assumptions to gain efficiency = “get more data” What happens if the assumptions are wrong? Biases, incorrect conclusions, etc. How can we gain efficiency when it is warranted, and yet have valid inferences?

Two Likelihoods The two likelihoods lead to two estimators The former is robust but not efficient The latter is efficient but not robust What to do?

Empirical Bayes The idea is to take a weighted average of the model free and model based estimators The weight depends on how different the estimators are Relative to how variable the difference is

Empirical Bayes You can actually formally test the hypothesis of whether the model fits the data It is just a t-test on the difference between the two estimators

Empirical Bayes If the difference is small relative to the variability, then this argues in favor of the model based approach

Empirical Bayes We chose an Empirical Bayes type-approach Let and Then

Comments on Empirical Bayes If the model fails, then the estimator converges to the model-free estimator If the model holds, the estimator estimates the right thing, but is much more efficient than the model-free estimator

Example 1: Prostate Cancer G = SNPs in the Vitamin D Pathway X = Serum-level biomarker of vitamin D (diet and sun) The VDR gene is downstream in the pathway, hence unlikely to influence the level of X Gene-environment independence likely

Example 1: Prostate Cancer 3 age groups 9 centers Two haplotype-serum Vitamin D interactions Three haplotype main effects

Example 2: Colorectal Adenoma G = SNPs in the NAT2 gene, which is important in the metabolism of X =Various measures of smoking history The NAT2 gene may make smokers more addicted Gene-environment independence unlikely

Example 2: Colorectal Adenoma Two genders 4 age groups 7 common haplotypes as main effects One haplotype known to affect metabolism Current and former smoking interactions

The NAT2 Example Current smoking and 101010 haplotype interaction coefficient Current smokers with this haplotype are 50% less likely to develop a colorectal adenoma Method Estimate s.e. p-value Model Free -0.63 0.17 0.014 Independence -0.33 0.16 0.048 Consistent EB1 -0.59 0.25 0.017

The VDR Example Serum Vitamin D and 000 haplotype interaction coefficient Men with 1 sd greater Serum vitamin D then the norm are 70% less likely to develop prostate cancer Method Estimate s.e. p-value Model Free -0.21 0.12 0.093 Independence -0.18 0.08 0.019 Consistent EB1 -0.19 0.021

Genome-Wide Association Studies These methods are routinely applied to GWAS My last two examples were actually from the PLCO GWAS Also, can call the environment = other SNP

Summary Case-control studies are the backbone of epidemiology in general, and genetic epidemiology in particular Their retrospective nature distinguishes them from random samples = prospective studies

Summary We start by assuming relationships between the genes and the “environment” in the population, e.g., independence This model can be fully flexible We also, where necessary, specify distributions for genes

Summary We calculated a new likelihood function, leading to more much more precise inferences The method can handle missing genes, genotyping errors, measurement errors in the environment Calculations are straightforward via the EM algorithm

Summary Forced to face the dilemma Lousy but robust method Great but not robust method We developed a fast, data adaptive, novel way of addressing this issue In cases where one can predict the outcome, the EB method works as desired

Acknowledgments This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Acknowledgments This work is supposed by NCI-R27-CA057030 NHLBI RO1-HL091172 (P.I., N. Chatterjee) Texas A&M Institute of Applied Mathematics and Computational Science through KAUST (King Abdullah University of Science and Technology)