Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University

Slides:



Advertisements
Similar presentations
The Simple Regression Model
Advertisements

Empirical Estimator for GxE using imputed data Shuo Jiao.
Mean, Proportion, CLT Bootstrap
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
METHODS FOR HAPLOTYPE RECONSTRUCTION
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Improving health worldwide George B. Ploubidis The role of sensitivity analysis in the estimation of causal pathways from observational.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Raymond J. Carroll Texas A&M University Postdoctoral Training Program: Non/Semiparametric.
Visual Recognition Tutorial
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Raymond J. Carroll Texas A&M University Nonparametric Regression and Clustered/Longitudinal Data.
Detecting Spatial Clustering in Matched Case-Control Studies Andrea Cook, MS Collaboration with: Dr. Yi Li November 4, 2004.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Point estimation, interval estimation
Evaluating Hypotheses
Statistical Inference and Regression Analysis: GB Professor William Greene Stern School of Business IOMS Department Department of Economics.
Score Tests in Semiparametric Models Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University
Visual Recognition Tutorial
Cumulative Geographic Residual Test Example: Taiwan Petrochemical Study Andrea Cook.
Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics.
Business Statistics: Communicating with Numbers
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
1 1 Slide Statistical Inference n We have used probability to model the uncertainty observed in real life situations. n We can also the tools of probability.
Estimation Basic Concepts & Estimation of Proportions
Inferences for Regression
TWO-STAGE CASE-CONTROL STUDIES USING EXPOSURE ESTIMATES FROM A GEOGRAPHICAL INFORMATION SYSTEM Jonas Björk 1 & Ulf Strömberg 2 1 Competence Center for.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Simple Linear Regression. Deterministic Relationship If the value of y (dependent) is completely determined by the value of x (Independent variable) (Like.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Issues in Estimation Data Generating Process:
Gene-Environment Case-Control Studies
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 4 First Part.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
1 Probability and Statistics Confidence Intervals.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Sampling Design and Analysis MTH 494 LECTURE-11 Ossam Chohan Assistant Professor CIIT Abbottabad.
Copyright © 2009 Pearson Education, Inc. 9.2 Hypothesis Tests for Population Means LEARNING GOAL Understand and interpret one- and two-tailed hypothesis.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
1 Borgan and Henderson: Event History Methodology Lancaster, September 2006 Session 8.1: Cohort sampling for the Cox model.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Econometric analysis of CVM surveys. Estimation of WTP The information we have depends on the elicitation format. With the open- ended format it is relatively.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Estimating standard error using bootstrap
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Simple Linear Regression - Introduction
How to handle missing data values
Discrete Event Simulation - 4
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A AA

Outline Problem: Case-Control Studies with Gene- Environment relationships Efficient formulation when genes are observed Measurement errors in environmental variables Haplotype modeling and Robustness

Acknowledgment This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

Acknowledgment Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)

Software SAS and Matlab Programs Available at my web site under the software button Examples are given in the programs

Some Personal History I was born in Japan The coffee table is still in my house

Some Personal History My father lived in Seoul for 2 months in 1948 and 1 year in 1968 He took many photos of sights there, especially in 1948

Joonghwa moon at Deoksugung, 1948

Joonghwa moon at Deoksugung, today

The Prices of Drinks Were Pretty Low

Basic Problem Formalized Case control sample: D = disease Gene expression: G Environment, can include strata: X We are interested in main effects for G and X along with their interaction

Prospective Models Simplest logistic model General logistic model The function m(G,X  1 ) is completely general

Likelihood Function The likelihood is Note how the likelihood depends on two things: The distribution of (X,G) in the population The probability of disease in the population Neither can be estimated from the case-control study

When G is observed The usual choice is ordinary logistic regression It is semiparametric efficient if nothing is known about the distribution of G, X in the population Why semiparametric: what is unknown is the distribution of (G,X) in the population

When G is observed Logistic regression is thus robust to any modeling assumptions about the covariates in the population Unfortunately it is not very efficient for understanding interactions

Gene-Environment Independence In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata This assumption is often used in gene- environment interaction studies

G-E Independence Does not always hold! Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction Part of this talk is to model the distribution of G given X

Gene-Environment Independence If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. The reason is that you are putting a constraint on the retrospective likelihood

More Efficiency, G Observed A constraint on the population is to posit a parametric or semiparametric model for G given X Consequences: More efficient estimation of G effects Much more efficient estimation of G and (X,S) interactions.

The Formulation In the most general semiparametric setting, we have Question: What methods do we have to construct estimators?

Methodology We have developed two new ways of thinking about this problem In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation

The Hard Way Treat X as a discrete random variable whose mass points are the observed data points Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.

The Hard Way The maximization is not trivial to do correctly Result: an explicit profile likelihood that does not involve the distribution of X

Pretend Missing Data Formulation The following simple trick can be shown to be legitimate and semiparametric efficient Equivalently, we compute a semiparametric profiled likelihood Semiparametric because the distribution of X is not modeled

Pretend Missing Data Formulation The idea is to create a “pretend” study, which is one of random sampling with missing data We use an MAR regime. The “pretend” study mimics the case-control study

Pretend Missing Data Formulation Suppose you have a large but finite population of size N Then, there are with the disease There are without the disease

Pretend Missing Data Formulation In a case-control sample, we randomly select n 1 with the disease, and n 0 without. The fraction of people with disease status D=d that we observe is

Pretend Missing Data Formulation Then let’s make up a “pretend” study, that has random sampling with missing data I take a random sample I get to observe (D,X,G) when D=d with probability I will say that if I observe (D,X,G). Then

Pretend Missing Data Formulation In this pretend missing data formulation, ordinary logistic regression is simply We have a model for G given X, hence we compute This has a simple explicit form, as follows

Result Define This is the intercept that ordinary logistic regression actually estimates –It only gets the slope right

Result Define Further define

Result Then, the semiparametric efficient profiled likelihood function is Trivial to compute.

Result In the rare disease case, we have the further simplification that

Interesting Technical Point Profile pseudo-likelihood acts like a likelihood Information Asymptotics are (almost) exact

Typical Simulation Example MSE Efficiency of Profile method compared to ordinary logistic regression

Typical Empirical Example

Consequence #1 We have a formal likelihood: This is also a legitimate semiparametric profile likelihood Anything you can do with a likelihood you can do with a semiparametric profile likelihood

Consequences #2-#3 Measurement Error in the Gene: Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) Measurement Error in the Environment : The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.

Advertisement Lobach, et al., Biometrics, in press

Consequences #4-#5 Flexible Modeling of Covariate Effects: Modeling some components by penalized regression splines The LASSO and other likelihood-based methods apply Model Averaging: Can entertain/average various risk models Bayesian methods are asymptotically correct

Consequence #6 Model Robustness: One can model average/select/LASSO various models for the distribution of G given X Main Point: Our method results in a legitimate likelihood, hence can be treated as such

Modeling the Gene Now turn to models for the gene Given such models likelihood calculations can be used for model fitting We will consider haplotypes

Haplotypes Haplotypes consist of what we get from our mother and father at more than one site Mother gives us the haplotype h m = (A m,B m ) Father gives us the haplotype h f = (a f,b f ) Our diplotype is H dip = {(A m,B m ), (a f,b f )}

Haplotypes Unfortunately, we cannot presently observe the two haplotypes We can only observe genotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes Thus, if we were really H dip = {(A m,B m ), (a f,b f )}, then the data we would see would simply be the unordered set (A,a,B,b) However, this is also consistent with a different diplotype, namely H dip = {(a m,B m ), (A f,b f )} Note that the number of copies of the (a,b) haplotype differs in these two cases The true diploid = haplotype pair is missing

Missing Haplotypes The likelihood in terms of the diploid is We observe the genotypes G The likelihood of the observed data is

Missing Haplotypes The likelihood of the observed data is Note how easy this was: it is really the profiled semiparametric likelihood of the observed data

Haplotypes Danyu Lin has a nice EM-based program for estimating haplotype frequencies It accepts data in text format with SAS missing data conventions The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) /~lin/hapstat/

Haplotype Fitting Models that assume haplotype-environment independence are straightforward to fit via EM Danyu Lin’s program can do this as well as our SAS program The remaining issue is how to gain robustness against deviations from this assumed independence

Robustness We build robustness by specifying models for diplotypes given the environmental variables We first run a program to get a preliminary estimate of haplotype frequency We use the most frequent haplotype as a reference haplotype

Haplotypes Approach: Start with a logistic model for the unobserved haplotypes H given covariates X In practice, we collapse all rare haplotypes into the reference haplotype to eliminate many variables

Haplotypes Approach: Start with a logistic model for the unobserved haplotypes H given covariates X This gives us the model:

Haplotypes Since the diplotypes are not observed, for identifiability we need further constraints Example: One simple additive-type model is that

Haplotypes Further identification: Assume that the population as a whole is in HWE, so that

Haplotypes Summary: We have two models

Haplotypes Summary: The models are linked Let F(x) be the marginal distribution of X Then

Haplotypes In this set up, we have a particular form for hence is defined through them and the marginal distribution of X

Marginal Distributions of X Three approaches for estimating F(x) Profiled likelihood If pr(D=1) is known, weighted mixture of empirical cdf for cases and controls For rare disease, the empirical cdf for the controls

Summary Population model for the diplotypes, e.g., HWE Conditional model for diplotypes given environment Various estimates of marginal distribution of environment and the crucial link

Haplotypes Analysis The resulting method adds robustness EM-algorithms enable fast computation Explicit asymptotic theory (not trivial) The method is also semiparametric efficient

Haplotypes Analysis Simulations indicate the gain in robustness

The NAT2 Example Study of colorectal adenoma, a precursor to colon cancer 628 cases and 635 controls The gene NAT2 is known to be important in the metabolism of smoking-related carcinogens X: age, gender, whether one smokes or used to smoke 6 SNPS Haplotype is of interest

The NAT2 Example 7 Haplotypes had frequency > 0.5% The most frequent was treated as baseline, additive risk model for the diplotypes Interactions of smoking variable with the haplotype in the risk model Interactions of the smoking variable with the haplotypes in the gene model

The NAT2 Example Current smoking and haplotype interaction Estimates.e.P-value Independence Dependence

The NAT2 Example In this example, recognizing the possibility that the gene distribution may depend on the environment (smoking) changes the analysis Plus, we get a p-value < 0.05!

Further work These is another way to get robustness that we have just submitted The idea is that the haplotypes and the environment are independent given the genotypes That is, once you know the genotypes, the haplotypes are determined solely by random mating.

Further work We then have two estimates: Haplotype-environment unconditional independence Independence conditional on the genotype Then we do a penalized likelihood analysis –Likelihood is the conditional independence likelihood –The penalty is the L1 distance from the unconditional independence estimate

Further work The result is increased robustness and major gains in efficiency

Summary Fully flexible risk models Flexible models for genes/haplotypes given covariates Computable semiparametric efficient inference that is more powerful than ordinary logistic regression and more robust than gene- environment independence

Thanks!