EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 6: Population stratification Peter Kraft Bldg 2 Rm
Population stratification Confounding due to correlated differences in allele frequencies and disease risks across unobserved subpopulations Extent and impact varies –Likely to be negligible when source population is made up of many subpopulations, small differences in allele freqs and disease risk [e.g. non-Hispanic European Americans] –More likely to be appreciable when source population is made up of [or an admixture of] two subpopulations with larger differences in allele freqs and disease risks [e.g. African Americans, Mexicans, Puerto Ricans]
Classic example Knowler et al. (1998) AJHG 43:520
When stratified by Indian heritage, no evidence of association between Gm and diabetes was found. But degree of Indian heritage is a confounder.
But differences in allele frequencies and disease rates do not always lead to population stratification bias…
Bladder cancer incidence, NAT2 genotype frequencies in 8 European populations Men Women Adapted from Wacholder et al. (2000) JNCI 92: Degree of confounding depends on In special case of Armitage Trend Test
…for bias to result, need correlated differences
Campbell [Nat Genet 2005] found a correlation between alleles in the lactase gene and height in a European-American sample But they observed strong trends in both LCT allele frequency and height with respect to North-South European grand- parental ancestry The association between the LCT SNP and height in the total sample was strong (p<10 -6 ). The association was weakened when data were stratified by grandparental ancestry. It disappeared when tested in two independent, ethnically homogeneous studies, one in Poland and one in Sweden (the latter being a family-based study). Voila, population stratification bias in a European American population. But how common is this? Does it mean we should estimate and adjust for population stratification in studies of U.S. whites?
a) Evidence against population stratification b) Potential for population stratification—but gradient in allele frequencies should follow gradient in phenotype What’s missing from this argument?
G X D Population structure, no confounding G X D No population structure, no confounding G X D Population structure, potential confounding Adjusting for X unnecessary or insufficient; can even reduce power
What to do? Match on ethnic ancestry –Self report may not be accurate –Ethnicity [e.g. “race”] may not be good surrogate for ancestry –Difficult to match mixed ancestry subjects Adjust using multiple unlinked markers –“Structured association” Use markers to test for and assign individuals to latent classes Most popular software: STRUCTURE (J Pritchard) –“Genomic control” Estimate “test statistic inflation” and adjust accordingly –Adjust for multiple random, unlinked markers Surrogate for genetic variation across subpopulation Most popular software: EIGENSTRAT (A Price, N Patterson) Use family-based controls –Siblings (conditional logistic) –Case-parent “pseudocontrols” (TDT, FBAT etc.)
Am. J. Hum. Genet., 76: , 2005 Clusters based on 326 microsatellites Self report of ethnicity is a good surrogate for gross differences in ancestry…
Selection of a set of SNPs for population stratification ILLUMINA 550K ILLUMINA 317K AFFYMETRIX 500K SNP Remove SNPs with call rate < 90% on either Illumina or Affymetrix platform Remove untyped or monomorphic SNPs in YRI or (JPT+CHB) Remove SNPs with P-values for HW proportion < SNP SNPs Select a set of SNPs with local parwise r 2 < Slide courtesy of G Thomas
A model of a structured population Population studied : -Europe : CEPH founders => 60 individuals HapMap -African : YRI founders => 59 individuals HapMap -Asian : CHB => 44 individuals HapMap -Asian : JPT => 45 individuals HapMap -Native American : Mexican => 30 individuals Penn State U.* -Native American : Mayan => 25 individuals Penn State U.* -African Americans => 15 individuals Penn State U.* - "Latino" => 7 individuals SNP500 * Courtesy of X. Mao, E. Parra and M. Shriver Total of 285 individuals Slide courtesy of G Thomas
First principal component.2 Third principal component Second principal component CEU CHB, JPT YRI Latino Native American African American CEU CHB, JPT YRI Latino Native American African American First to third components First principal component Slide courtesy of G Thomas
But cannot capture within-ethnicity variation… nd PC 1st PC ATBC=Finns Plot of 1 st and 2 nd principal components of variation for ca. 10,000 self-described European(-descended) subjects in the CGEMS prostate cancer GWAS
Structured Association Genotype multiple unlinked, anonymous markers –Very unlikely to be (or be near) causal loci –Best to choose ancestry informative markers (AIMs) Known for African, European, native American populations Not known to distinguish among European populations To test for strat’n, sum disease-marker chi-squares –This sum has d.f. = sum of individual tests’ d.f. Use clustering algorithm to estimate structure –STRUCTURE, ADMIXMAP based on pop’n genetics models Structure does not assume allele freqs in ancestral popn’s known Admixmap does –Use estimated admixture as covariates or matching vars Pritchard & Rosenberg (1999) AJHG 65: Pritchard et al. (2000) Genetics 155:
Toy example 150 subjects from Pop’n 1 –Disease incidence 15% –150 markers with allele freqs ~ Beta(1,10) –Allele freqs, markers independent of disease 150 subjects from Pop’n 2 –Disease incidence 30% –150 markers with allele freqs ~ Beta(1,10) –Allele freqs, markers independent of disease
Adjustment=appropriate stratified analysis … but test for stratification Sum chi-squares = on 150 d.f. p =.065 Pop’n stratification appears to inflate Type I error rate…
Still, there is strong evidence that there are two distinct subpopulations.
Based on that same panel of 10,000 markers Thomas et al, submitted Admixture proportions for subjects from 1 st and 2 nd stages of CGEMS prostate scan In practice, STRUCTURE is applied to a “spiked” data set (your data plus three HapMap samples) to detect gross outliers or data handling errors
Drawbacks to structured assocation Computationally intensive Markers should be unlinked Model-based User has to specify number of ancestral populations
Genomic control For modest pop’n stratification, test stat X dist’n is roughly = 2 So why not estimate (a.k.a. ) to get X * = X/ ?
= /150 = 1.18
How many markers needed? Nature Genetics 36, (2004)
Likelihoods for inflation factors for studies with 1000 cases and controls Nature Genetics 36, (2004)
Nature Genetics 36, (2004)
But setting practical problems aside, is genomic control the right thing to do? Population stratification bias –Under null, Armitage trend test X 2 is distributed as: Cryptic relatedness –Under null, -1 X 2 is distributed as 1 2 (0) 1 2 (ξ), where ξ = NΔ 2 /(2 2 ) where = 1+(a 11 -a 01 ) 2 2 f N/(1+f) Genomic control corrects this kind of distortion… … but not this A Whittemore, unpublished MS
Adjusting for many random markers Unlike genomic control: –Doesn’t penalize the innocent for the sins of the guilty –Does a better job penalizing the guilty Variants: –Price: use Principal Components to summarize many markers Can use clever computational trick (Tracy-Wisdom statistic: Patterson 2007 PLOS Genet) to decide how many components, or just eyeballs PC plots Adjust for these structure-related PCs –Wang/Balding: adjust for SNPs in non-candidate genes –Epstein & Satten: Wang/Balding meets propensity score
Clear Population Stratification Bias Q-Q plot for NHS Hair Color Scan λ=1.24 λ=1.02 -log 10 p-value Black line: unadjusted. Red line: adjusted for top four PCs
No Clear Population Stratification Bias Q-Q plot for Prostate Cancer Scan
CGEMS prostate cancer example Not a surprise? –Empiric evidence of subtle genetic differences across region even in U.S. self-described whites is mounting… –…and there is some evidence of variation in prostate cancer rates across regions… –… but (a) the latter pattern is complex and its causes are unclear, and (b) the chance that the two patterns would coincide is small.
Caveats: Not a Foolproof Panacea Rule of thumb: need at least 1,000 markers –Much more is better! Linked markers can distort PCs Will not rescue poor design
Prostate Cancer - Population Structure BPC3 12 Sub-cohorts
7 White Sub-cohorts
1 Japanese Sub-cohort 1 Hawaiian Sub-cohort 1 Latino Sub-cohort 2 African American Sub-cohorts
Tough to fix using naïve application of EIGENSTRAT (better to match cases and controls on inferred ancestry, cf PLINK IBD matching or K Roeder [in preparation?]) Red=cases Black=Controls
Pop’n strat’n bias: to recap A concern for recently admixed populations Less of a concern for U.S. non-Hispanic Europeans –Still, with large sample sizes small effects will be detected –May affect many markers across genome Good study design can avoid worst bias Genomic control may help –Difficult to callibrate for small p-value thresholds, –Can be too conservative or too anti-conservative, depending on (unknown) degree of pop’n strat’n Structured association intuitive and effective –But performance greatly enhanced by use of AIMs… –…in absence of AIMs, degree of stratification overestimated References Pritchard & Rosenberg (1999) AJHG 65: Testing for/estimating structure Pritchard et al. (2000) Genetics 155: Testing for/estimating structure Devlin & Roeder (1999) Biometrics 55: Genomic control Bacanu et al. (2000) AJHG 66: Genomic control Wacholder et al. (2000) JNCI 14: Extent of pop’n strat’n Reich et al. (2001) Genet Epidemiol 20:4-16Genomic control Thomas & Witte (2002) CEBP 11: Extent of pop’n strat’n Wacholder et al. (2002) CEBP 11: Extent of pop’n strat’n Freedman et al. (2004) 36: Genomic control Marchini et al. (2004) Nat Genet 36: Extent of pop’n strat’n, genomic control Tang et al. (2005) AJHG 76: Self-reported ethnicity and genetic structure