Bias Adjustment in Whole-Genome Scans

Slides:



Advertisements
Similar presentations
Association Tests for Rare Variants Using Sequence Data
Advertisements

Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
What is Interaction for A Binary Outcome? Chun Li Department of Biostatistics Center for Human Genetics Research September 19, 2007.
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Generalized Linear Models
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 4: Taking Risks and Playing the Odds: OR vs.
POTH 612A Quantitative Analysis Dr. Nancy Mayo. © Nancy E. Mayo A Framework for Asking Questions Population Exposure (Level 1) Comparison Level 2 OutcomeTimePECOT.
Andrew Thomson on Generalised Estimating Equations (and simulation studies)
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
Estimating Incremental Cost- Effectiveness Ratios from Cluster Randomized Intervention Trials M. Ashraf Chaudhary & M. Shoukri.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Empirical Efficiency Maximization: Locally Efficient Covariate Adjustment in Randomized Experiments Daniel B. Rubin Joint work with Mark J. van der Laan.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 22.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
High resolution QTL mapping in genotypically selected samples from experimental crosses Selective mapping (Fig. 1) is an experimental design strategy for.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Logistic Regression: Regression with a Binary Dependent Variable.
Applied Regression Analysis BUSI 6220
Virtual University of Pakistan
Estimating standard error using bootstrap
Bootstrap and Model Validation
University of Colorado at Boulder
Sample Size Determination
Making inferences from collected data involve two possible tasks:
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Notes on Logistic Regression
How many study subjects are required ? (Estimation of Sample size) By Dr.Shaik Shaffi Ahamed Associate Professor Dept. of Family & Community Medicine.
Sampling distribution
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Inference for the mean vector
Genome Wide Association Studies using SNP
Generalized Linear Models
Lecture 1: Fundamentals of epidemiologic study design and analysis
Imputation-based local ancestry inference in admixed populations
When we free ourselves of desire,
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Stochastic Hydrology Hydrological Frequency Analysis (II) LMRD-based GOF tests Prof. Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
Differential Privacy and Statistical Inference: A TCS Perspective
2. Stratified Random Sampling.
Sue Todd Department of Mathematics and Statistics
BOOTSTRAPPING: LEARNING FROM THE SAMPLE
Genetic Association Analysis under Complex Survey Sampling: The Hispanic Community Health Study/Study of Latinos  Dan-Yu Lin, Ran Tao, William D. Kalsbeek,
Cox Regression Model Under Dependent Truncation
Genetic Association Analysis under Complex Survey Sampling: The Hispanic Community Health Study/Study of Latinos  Dan-Yu Lin, Ran Tao, William D. Kalsbeek,
Arpita Ghosh, Fei Zou, Fred A. Wright 
Bias in Estimates of Quantitative-Trait–Locus Effect in Genome Scans: Demonstration of the Phenomenon and a Method-of-Moments Procedure for Reducing Bias 
Product moment correlation
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Multivariate Linear Regression
Random-Effects Model Aimed at Discovering Associations in Meta-Analysis of Genome- wide Association Studies  Buhm Han, Eleazar Eskin  The American Journal.
Diego Calderon, Anand Bhaskar, David A
Interpreting Epidemiologic Results.
Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, Itsik Pe’er 
BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS
Chapter 6 Logistic Regression: Regression with a Binary Dependent Variable Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
MGS 3100 Business Analysis Regression Feb 18, 2016
Logical Inference on Treatment Efficacy When Subgroups Exist
Hong Zhang, Judong Shen & Devan V. Mehrotra
Presentation transcript:

Bias Adjustment in Whole-Genome Scans Fei Zou fzou@bios.unc.edu Department of Biostatistics Carolina Center for Genome Sciences University of North Carolina at Chapel Hill

Bias correction for estimates of genetic risk A number of recent papers have observed (e.g. Garner, 2007) that when genome-wide significance thresholds are applied in testing, estimates of risk are inflated (the “winner’s curse” or “Beavis effect”) the magnitude of odds ratio estimates may be upwardly biased, posing a challenge for replication/confirmation or extension to additional populations- under powered.

Original versus Corrected Odds-Ratio Estimates for Three Published Genetic Association Studies

Some approaches have been proposed for improved estimation maximizing the conditional likelihood for genotype outcomes, given declared significance (Zöllner and Pritchard, 2007). bootstrapping of genotype-phenotype values to provide an empirical correction (Sun and Bull, 2005; Yu et al., 2007) both require original data and may be computationally prohibitive.

An approximate conditional likelihood approach (Ghosh et al., 2008; Zhong and Prentice, 2008, similar results in clinical trials literature) Assume we have a parameter of interest Assume that this or similar Wald-like statistic is used to declare significance Declare multiple-test corrected significance if significance threshold Defining , we have

z=5.2 z=5.33 Example. Using c=5.0 (similar to a genome scan threshold, with nominal a=5.7X10-7), the desired shrinkage is apparent. If the observed z is well above the threshold, the unconditional and conditional likelihoods are similar. 6

Thus we have defined a new “m version” of the problem, for which we use as an approximate conditional likelihood approximate conditional Here the m.l.e. of the conditional likelihood may not be optimal in any sense. It can be (theoretically) shown that no unbiased estimator of m exists. Naïve estimator (equal to z) Conditional m.l.e. A low m.s.e. estimator A compromise estimator

and at any time we may convert back to b using Clearly this approach can be applied in a variety of settings. In genome-wide association studies, we might have For a one-parameter genetic model, i.e., recessive, additive, dominant action of the SNP genotype c will typically be in the range of 5-6

. Alternatively, Zhong and Prentice (2008) propose the standard LR approach for CI:

Performance (expectation and m. s. e Performance (expectation and m.s.e.) and confidence intervals, m version 95% confidence bounds, obtained by inverting test procedure using the conditional density for z (given significance) 10

Confidence Interval 95% confidence bounds, obtained by inverting test procedure using the conditional density for z (given significance) -version

Confidence interval coverage shown to be accurate All of the performance results in the idealized m-version of the problem carry over in the realistic version of the problem Ghosh at el. 2008 provide simulations under a variety of genetic models, under a “worst-case” scenario with 500 cases, 500 controls. Confidence interval coverage shown to be accurate 12

95% nominal coverage n=1000 n=5000 n=10000 Dominant model prevalence of disease=.01

A related problem arises often in genomics applications: For example: given a SNP that is significant in a genome scan for primary phenotype 1, we may want to perform inference about its effect on secondary phenotype 2 (which is correlated with phenotype 1). E.g. Type II diabetes and obesity. Another example: we run a genome scan for SNP effects (G), as well as environmental (E) and GXE effects. We only care about the E and GXE effects if the SNP is declared significant for G.

Two-m version of the problem Assume the corresponding z’s are bivariate normal with correlation r Bias for m2 is r times the bias in m1. Bias does not depend on m2. 18

More generally

Typically estimated from likelihood Obtain multivariate versions of earlier point estimates

Two Binary Traits Two dichotomous traits and following Palmgren (bivariate logistic) model, and a SNP having dominant effect on each trait denotes dichotomous SNP genotype We examined ranging from -0.7(OR 0.5) to 0.7(OR 2) and fixed at 0.3. Disease prevalence=0.1 for each c=5, MAF=0.25 Correlation between estimators can be determined from data Used to induce dependence between traits

Dichotomous primary and continuous secondary traits : dichotomous : continuous denotes SNP genotype We examined ranging from -0.34(OR 0.5) to 0.34(OR 2) and fixed at 0.3.

Binary Y2

Quantitative Y2

Scott and Wild (1995), Lee and Scott (1997) The previous results were applied to the situation where the data were sampled prospectively. For genome-wide association studies, a much more common situation is one in which the data are sampled retrospectively based on phenotype Y1, typically case control status. Problem: under the retrospective sampling scheme, the relationship between (dichotomous) Y2 and X becomes complicated and no longer logistic (Scott and Wild 1995), even though the relationship between Y1 and X is still logistic. Appropriate conditional likelihood with respect to the retrospective sampling scheme has to be considered when analyzing the secondary phenotype Y2 for retrospective data Scott and Wild (1995), Lee and Scott (1997) Lin and Zeng (2008) for GWAS secondary phenotype analysis 25

Biasness in Secondary Phenotype Analysis Simulation set up: Bivariate logistic; P(Y1=1)=0.05 and P(Y2=1)=0.2; MAF=0.25.

We directly maximize the retrospective log-likelihood: assuming that is known.

Binary Y2 (retrospective)

Binary Y2 (prospective)

Quantitative Y2 (retrospective)

Quantitative Y2 (prospective)

Gene by environment interaction A dichotomous trait , a single SNP having dominant effect on the trait, environment and their interaction E is dichotomous (0,1), G is dichotomous (0,1) We examined ranging from -0.7(OR 0.5) to 0.7(OR 2) and fixed and at 0.3 and 0.2 respectively. Disease prevalence=0.01 c=5, MAF=0.25, ncases=ncontrols=500 Correlation between coefficients can be determined from likelihood 32

Gene x Environment Gene Environment 33

The use of our approximate conditional likelihood has additional advantages. The focus on the Wald statistic means that we can: use other parameterizations (not necessarily odds ratios) use the approach even when covariates are fitted in the model. This is a key advantage, as correction for population stratification is often performed using covariates Apply the approach to published summary tables. We need only c, , and . The last value, if not provided directly, can be inferred from z, the p-value, or a published odds ratio confidence interval. What about the extra bias in reporting only the most significant SNP in a genome-wide association study?

References – Bias correction in risk estimates Allison DB, Fernandez JR, Heo M, Zhu S, Etzel C, Beasley TM, Amos CI (2002) Bias in estimates of quantitative-trait-locus effect in genome scans: demonstration of the phenomenon and a method-of-moments procedure for reducing bias. Am J Hum Genet 70:575-585 Garner C (2007) Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol 31:288-295 Ghosh A, Zou F, Wright FA. (2008) Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet.82:1064-74 Goring HH, Terwilliger JD, Blangero J (2001) Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 69:1357-1369 Rothman N, Skibola CF, Wang SS, Morgan G, Lan Q, Smith MT, Spinelli JJ, et al. (2006) Genetic variation in TNF and IL10 and risk of non-Hodgkin lymphoma: a report from the InterLymph Consortium. Lancet Oncol 7:27-38 Sun L, Bull SB (2005) Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol 28:352-367 Wald A (1943) Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society 54:426-482 Wang WY, Barratt BJ, Clayton DG, Todd JA (2005) Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6:109-118 Yu K, Chatterjee N, Wheeler W, Li Q, Wang S, Rothman N, Wacholder S (2007) Flexible design for following up positive findings. Am J Hum Genet 81:540-551 Zhong H, Prentice RL (2008) Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics Feb 28 2008 (Epub). Zöllner S, Pritchard JK (2007) Overcoming the winner's curse: estimating penetrance parameters from case-control data. Am J Hum Genet 80:605-615

Collaborators Arpita Ghosh Fred A. Wright