1 Control of Population Stratification in Whole-Genome Scans Fei Zou Department of Biostatistics Carolina Center for Genome Sciences.

Slides:



Advertisements
Similar presentations
Population structure.
Advertisements

Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
Association Tests for Rare Variants Using Sequence Data
Genetic Analysis of Genome-wide Variation in Human Gene Expression Morley M. et al. Nature 2004,430: Yen-Yi Ho.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
SHI Meng. Abstract The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants,
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Basics of Linkage Analysis
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Signatures of Selection
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
1 FSTL4 and SEMA5A are associated with alcohol dependence: meta- analysis of two genome-wide association studies Kesheng Wang, PhD Department of Biostatistics.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Population Stratification
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
Genome-Wide Association Study (GWAS)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Quantitative Genetics
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Methods in genome wide association studies. Norú Moreno
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
The International Consortium. The International HapMap Project.
C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
A simple method to localise pleiotropic QTL using univariate linkage analyses of correlated traits Manuel Ferreira Peter Visscher Nick Martin David Duffy.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.
Principal components analysis
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Population stratification
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Principal components analysis
Stratification Lon Cardon University of Oxford
Genome Wide Association Studies using SNP
Linking Genetic Variation to Important Phenotypes
Presentation transcript:

1 Control of Population Stratification in Whole-Genome Scans Fei Zou Department of Biostatistics Carolina Center for Genome Sciences University of North Carolina at Chapel Hill

Outline Introduction: –Genome-wide association study (GWAS) –Population Stratification Genomic control Principal component analysis (PCA) Shrinkage PCA EigenCorr Remarks and Conclusions

Genome-wide association (GWA) study A GWA study is an approach that involves rapidly scanning markers across the genomes of many people to find genetic variations associated with a particular disease/trait. Single nucleotide polymorphisms (SNPs): DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. High Dimensional –# of SNPs: 500K/1M SNPs across the entire genome –# of samples: thousands or ten thousands

Association Mapping Copied (with modifications) from psb.stanford.edu/psb06/presentations/association_mapping.pdf Cases Controls Significant (SNP) difference in distributions?

GWA Study Examples Mar 2005: Age-related macular degeneration Feb 2007: Type 2 diabetes Apr 2007: Obesity …… provides a catalog of published GWA Studies. GWA study Database:

GWA Studies GWA studies are –susceptible to population stratification (Cardon&Palmer 2003; Knowler et al 1988) which occurs when differences in disease prevalence and differences in allele frequencies –spurious association (increased Type I error)

Population Stratification Example:

Control of Population Stratification genomic control and related methods attempt to find an average inflation factor to deal with overdispersion of test statistics due to stratification (Devlin and Roeder, 1999; Schork, 1999). structured assessment of Pritchard et al., 1999, 2000a, 2000b, Satten et al. 2001) attempts to infer population origin more directly, and perform stratified testing. principal components analysis (PCA) of Zhang, Zhu and Zhao (2001) proposed using PCA to estimate genetic background covariates 8

Control of Population Stratification PCA-based methods are appealing –One disadvantage to classical PCA approach is that the number of markers cannot exceed the number of subjects Price et al (2006) exploited the structure of rescaled genotype matrices to extend the PCA method to modern GWA studies, in which hundreds of thousands of SNPs are genotyped. –This approach (or similar) has become very popular for GWA studies.

Control of Population Stratification Let g ij represent the (i,j)th element of the genotype matrix g, corresponding to SNP i and individual j, i=1,…,M and j=1,…,N –The data are coded numerically (say according to the number of minor alleles), and typically can assume three values (0,1, or 2). Each row i of g is (a) mean-centered; (b) variance-standardized to obtain M x N matrix X. The principal component scores for the n individuals are used to infer ancestry and used as covariates, e.g. in logistic regression –Singular Value Decomposition (SVD): where D=diag{d j }; U: loading matrix and P: normalized PC matrix. –Turns out the eigenvectors of are proportional to the principal component scores. With K sub-populations mixed, we need K-1 PCs to represent the stratification (think of each SNP having K different SNP allele frequencies)

Control of Population Stratification In principle one can use the entire dataset for stratification control, ranging from moderate-scale candidate gene studies to whole genome scans. Unfortunately, the use of all available data presents a problem, as well. Both structured assessment and PCA approaches can be heavily influenced by correlated markers. Patterson et al (2006) used a regression approach to reduce the influence of correlated markers. Fellay et al. (2007) utilized a ``thinning'' approach in which only a subset of markers with low pairwise correlation is retained for stratification control. The criteria for thinning are somewhat arbitrary, and one may lose information.

12 Example 1 A GWAS dataset. After filtering, 2,559 samples and 701,859 SNPs Do these clumps really represent stratification?

13 SNP marker order 2q 8p 6p 17q Example 1, cont. 2q: lactase gene region; 6p: MHC region; 8p and 17q: inversion regions

14 In this dataset and many others, we find the same chromosomal regions showing up again and again Some of them may be good to include (lactase gene), in the sense of corresponding to ancestry (North-South gradient in Europeans) Some may be bad (inversions on 8p, 17q), if they are evenly mixed into the population. Thinning of markers may be okay, but might throw out entire regions considered very plausible for association (e.g. HLA) We desire a less extreme approach than thinning, but not too complicated.

15 The problem with dependent SNPs is that they will exert large influence merely due to correlation. Principal components rewards correlation by finding directions in the data that have large variance. We propose a shrunken genotype method instead. Approach: create new data matrix, where w is a diagonal weight matrix that somehow “downweights” sets of correlated SNPs. Our choice of weights follows the logic that linear combinations of genotypes should exert influence determined by the amount of independent information.

16 We propose the following as weights for the i th SNP, where r ii’ is the sample correlation of the genotype data between SNPs i and i’. We consider only nearby SNPs in a window (usually several of hundred SNPs), and above some minimum correlation threshold.

17 This choice of weights has the desirable properties: When all markers uncorrelated, If a group of M’ markers are perfectly correlated with each other, their (joint) influence on variance is reduced If all M markers have a common positive pairwise correlation, then for a constant c, and we are back to standard PC analysis.

18 Example 2. Cystic Fibrosis Gene Modifier Study (M. Knowles, PI), association of genotype with lung function. 81 Ancestry-informative SNPs used for stratification control in a candidate gene study. Turned out to be self-reported African-American

19 Example 2., cont.

20 Example 1 revisited with shrunken genotypes

21 Example 1 revisited with shrunken genotypes SNP marker order

Example 3 With HAP-SAMPLE software ( we simulated 450 CEU samples, 50 YRI samples, and 50 JP+CH samples respectively using the SNPs on the Affymetrix 100K array [Wright, et al. 2007]. We then generated an additional 225 admixed individuals using our modified version of HAP- SAMPLE. HAP-SAMPLE generates data by resampling from existing phased Hapmap datasets, and therefore preserves the observed local LD structure.

standardshrinkage regression thinning

Example 4 How methods perform for subtle population stratification. Phase 3 CEU and TSI Hapmap unrelated samples. We removed all children whose parents are also Hapmap samples. Additionally, we excluded one CEU subject who had a very high estimated identical by descent (IBD) value (> 0.8) with another CEU subject. After filtering, the final dataset contained 185 individuals (108 CEU and 77 TSI samples). These CEU samples are known to have the northern and western European ancestry, while the TSI samples represent Toscani individuals from Italy.

standard shrinkage regression thinning

How Many PCs How many PCs for follow-up analyses? –Top 10 PCs (Price et al 2006) –Top 7 PCs (Sullivan et al 2008) –Tracy-Widom (TW) test (Patterson et al 2006): may select over 100 PCs GAIN Schizophrenia sudy (162 PCs with P-values from TW test <0.01) power genetic effect estimate computing time

Connection between GC and PCA Let be the jth column of P

Connection between GC and PCA Quantitative Trait: assuming linear model: with test statistic: By Theorem 1: which provides a direct relationship between the mean version of GC and the PC-phenotype correlations and eigenvalues.

Connection between GC and PCA Case-control Trait: –Model –Score test statistic: –Therefore: which again provides a direct relationship between the mean version of GC and the PC-phenotype correlations and eigenvalues.

Comparison Between GC and PCA GC and PCA related but also fundamentally different –GC: inflation factor assumed constant across all null SNPs –PCA can be viewed alternatively as control of inflation by locus specific factors

Comparison Between GC and PCA Suppose PC1 fully recovers the two subpopulations: the test statistic S i at the ith SNP that does not acknowledge the stratification is approximately distributed as with mean where u ij is the (i,j)th element of the loading matrix U

EigenCorr: Eigenvalue and Correlation Based PC Selection Procedure EigenCorr score: reflects the effect of jth PC on the mean of the test statistics Null distribution of the EigenCorr scores can be directly estimated under the assumption that the PCs and phenotype are uncorrelated.

Simulations Case 1: 1000 samples with 5 subpopulations (210 samples from each of the first 4 subpopulations and 160 from subpopulation 5); 20K unrelated SNPs with model

Simulation Set 2: schizophrenia GWAS study; 1847 samples with 810K SNPs; population stratification is simulated via the following model: TW test: 162 PCs with P<0.01 On average 4.95 PCs are picked by EigenCorr.

37 Conclusions/future directions Shrinkage of numeric-coded genotype data appears to offer an effective means to obtain meaningful principal components for stratification analysis. But what are the optimal weights? We find that PCs have a natural correspondence to inflation of association test statistics. i.e., PC-based covariate corrections are not arbitrary, but are in some sense a “correct” way to handle the data. Even simple examinations of the results give information and insight about the genome. Software is available at

Seunggeun Lee Fred Wright Collaborators

39 References – stratification control Spielman, R. S., McGinnis, R. E., and Ewens, W. J. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet., 52: 506–516, Devlin B, Roeder K Genomic control for association studies. Biometrics 55: Schork NJ, Fallin D, Xu X, Blumenfeld M, Cohen D The future of genetic case-control studies. Am J Hum Genet 65:A86. Pritchard JK, Rosenberg NA Use of Unlinked Genetic Markers to Detect Population Stratification in Association Studies Am. J. Hum. Genet. 65: Pritchard JK, Stephens M, Donnelly P. 2000a. Inference of population structure using multilocus genotype data. Genetics 155: Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. 2000b. Association mapping in structured population. Am J Hum Genet 67: Zhu X, Zhang SL, Zhao HY, Cooper RS Association mapping using a mixture model for complex traits. Genetic Epidemiol 23: Zhang SL, Zhu XF, Zhao HY On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genetic Epidemiology, 24: Price et al (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904 – 909. Fellay, et al. (2007) A Whole-Genome Association Study of Major Determinants for Host Control of HIV-1. Science 317, 944

40 EXTRA SLIDES

41 base pair position Single-SNP genome wide scan association analysis ( analysis of simulated data from HapSample, ) Evidence of association of case- control status with SNP marker genotype

42 Genotype association -Selection bias -Unacknowledged dependence Hidden pitfalls Multiple testing error Naive pitfalls Transcript profiling Reproducible, global Less reproducible, or not global eQTL Metabolomics/ Proteomics “Pathway” analysis TestingTesting/ Inference PLATFORM Technology Statistics

43 GWAS Simulation, 100K SNPs, moderate stratification, 1000 simulation 1800 samples from population 1 and 200 samples from population 2, where disease risk varies by population (OR 2.5). 50K independent markers were simulated with minor allele frequency ranging from 0.05 to 0.5. Baseline F st was simulated from 20 SNPs with high F st values were simulated from U(0.1, 0.3) as highly ancestry informative. An additional 50K SNPs simulated by using 5% of the SNPs as “seeds” within artificial LD blocks with pairwise |  | ranging from 0.75 to No Adjustment Traditional PCA Shrinkage PCA Type I errors caused by the 20 highest F st SNPs alone. Observed F st Assoc. P-value threshold Inflated Type I error, even at stringent thresholds

44 GWAS simulation, cont. – results from one of the simulated datasets Before shrinkageAfter shrinkage