Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx
What is Population Stratification (PS) ? In narrow sense PS is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure. In broad sense PS can be regarded as the presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.
PS & False Positives False Positives (inflation) Association could be due to the underlying structure of the population, even there is no disease-locus association.
An Example of PS-caused False Positive Sub-population 1 case control total risk A 72 8 80 9/1 a 18 2 20 90 10 100 Sub-population 2 3 27 30 1/9 7 63 70 Mixed population 75 35 110 2.14 25 65 0.38 200 1.00 No disease-locus association. Risk difference between sub-populations. Allele Frequency difference between sub-populations. False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)
Mantel-Haenszel Test for Stratification (1) Adjusted RR (2) Standard error An Example Chi-square test (3)
Linear Model Usually Q is unknown, needs to be estimated Marker data Population structure variable Genetic background variable Membership variable Subgroup/sub-population variable Ancestry/admixture proportion variable Usually Q is unknown, needs to be estimated
Estimating Q by Eigen-analysis singular values X = U S VT idv1 idv2 idv3 snp1 2 1 snp2 snp3 snp4 snp5 3.81 0.00 2.05 1.13 T S2 eigenvalues -0.28 -0.95 0.11 -0.75 0.29 0.59 -0.60 0.08 -0.80 -0.55 0.33 0.34 -0.78 -0.10 -0.27 -0.16 0.04 -0.71 -0.20 0.14 0.52 -0.15 -0.93 0.20 14.51 0.00 4.21 1.28 Q1 Q2 Q3 Eigenvector of COV(X) References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT) Or SAS Proc PRINCOM; R svd() and eigen()
Eigen-analysis of HapMap Populations Q2 Q1
(for admixed population) Estimating Q by MLE (for admixed population) G: Observed genotypes of admixed [and parental populations] Q: Allelic frequencies in parental populations P : Individual membership to be estimated Goal: obtain P that maximizes Pr(G|P,Q) Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly) Compute P(i) by solving Compute Q(i) by solving Iterate Steps 1 and 2 until convergence. Tang et al. Genetic Epidemiology, 2005(28): 289–301
(for admixed population) Estimating Q by MCMC (for admixed population) Observed G : genotypes of admixed [and parental populations] Unknown Z : admixed individuals’ membership from ancestral populations Problem: How to estimate Z ? Bayesian and Markov Chain Monte Carlo (MCMC) methods Assume ancestral population number K (see next slide) Define prior distribution Pr(Z) under K Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) Average over large number of MCMC samples to obtain estimate of Z Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE
Infer Population Number (K)
Linear Model (an example including m Q-variables) SAS Proc REG, Proc GENMOD; R lm(), glm() Generalized, can fit binary/categorical y
Unified Mixed Model (more general) Inferred population membership SNP(s) Covariate(s) ID matrix Modeling the resemblance among individuals V = Z G Z ' + R
Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model Based on MVN, the likelihood of trait (y) in a matrix form is: no. of individuals (in a pedigree) mean phenotype vector nn variance-covariance matrix phenotype vector Kinship (IBD) matrix (nn ) V = Z G Z ' + R
Kinship Inbreeding Coefficient The inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD). Identical By Descent (IBD) Two alleles come from the same ancestry. Kinship/Coancestry The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, then inbreeding coefficient of Z = coancestry between X and Y Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)
Kinship Matrix (expected probability of allele sharing among relatives)
Resources for Mixed Model with Kinship Matrix Software Kinship Mixed Model Data SAS Proc INBREED Proc MIXED Quantitative trait Pedigree data Proc GLIMMIX Quantitative/qualitative trait, Pedigree data R : kinship makekinship() lmekin() R: emma emma.kinship() emma.REML.t() Using maker data to calculate kinship EMMAX emmax-kin emmax
Diagnosis of Inflation of False Positives Inflation: more false positives than expected under the null In GWAS, usually due to PS Can be caused by inappropriate statistical methods even with no PS May (not necessarily) indicate PS
Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null no inflation inflation Histogram -log10(p) Q-Q plot
Inflation Rate (IR) For Binary Trait For Continuous Trait Devlin et al. 2004 For Binary Trait For Continuous Trait Amin , Duijn, Aulchenko, 2007
Genomic Control (by IR) For Binary Trait For Continuous Trait Or based on p-value
Practice Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in trait.csv); Investigate p-values to see if there is any inflation; Try to explain why; List some possible methods to reduce or control the inflation; Choose one method, apply it to the data; Does it work? Try to explain why. Clearly document each step of you analysis. The is no standard answer, feel free to try anything you like ! Report back to linusan@wustl.edu and qunyuan@wustl.edu in one week. Thanks !