Population Stratification

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population structure.
Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)
Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Evaluation of a new tool for use in association mapping Structure Reinhard Simon, 2002/10/29.
METHODS FOR HAPLOTYPE RECONSTRUCTION
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Bayesian Estimation in MARK
Genome-wide association mapping Introduction to theory and methodology
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
1 Associating Genomic Variations with Phenotypes Model comparison, rare variants, and analysis pipeline Qunyuan Zhang Division of Statistical Genomics.
MALD Mapping by Admixture Linkage Disequilibrium.
Association Modeling With iPlant
Structural Equation Modeling
Visual Recognition Tutorial
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Inferences About Process Quality
Introduction to Regression with Measurement Error STA431: Spring 2013.
Correlation and Regression Analysis
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Process of Genetic Epidemiology Migrant Studies Familial AggregationSegregation Association StudiesLinkage Analysis Fine Mapping Cloning Defining the Phenotype.
Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) Detecting Association between Rare Variants and Complex Traits Qunyuan Zhang, Ingrid Borecki,
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
1 Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Mar. 29,
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
1 Haplotyping Algorithm Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Mar. 6, 2008.
Lecture 13: Population Structure
California Pacific Medical Center
Lecture 2: Statistical learning primer for biologists
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Tutorial I: Missing Value Analysis
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Review of statistical modeling and probability theory Alan Moses ML4bio.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.
Copyright © 2008 by Nelson, a division of Thomson Canada Limited Chapter 18 Part 5 Analysis and Interpretation of Data DIFFERENCES BETWEEN GROUPS AND RELATIONSHIPS.
Principal components analysis
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Estimating standard error using bootstrap
Regression Models for Linkage: Merlin Regress
Probability Theory and Parameter Estimation I
Principal components analysis
Genome Wide Association Studies using SNP
More about Posterior Distributions
Model-free Estimation of Recent Genetic Relatedness
What are BLUP? and why they are useful?
EM for Inference in MV Data
IBD Estimation in Pedigrees
EM for Inference in MV Data
Power Calculation for QTL Association
Presentation transcript:

Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx

What is Population Stratification (PS) ? In narrow sense PS is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure. In broad sense PS can be regarded as the presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.

PS & False Positives False Positives (inflation) Association could be due to the underlying structure of the population, even there is no disease-locus association.

An Example of PS-caused False Positive Sub-population 1 case control total risk A 72 8 80 9/1 a 18 2 20 90 10 100 Sub-population 2 3 27 30 1/9 7 63 70 Mixed population 75 35 110 2.14 25 65 0.38 200 1.00 No disease-locus association. Risk difference between sub-populations. Allele Frequency difference between sub-populations. False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)

Mantel-Haenszel Test for Stratification (1) Adjusted RR (2) Standard error An Example Chi-square test (3)

Linear Model Usually Q is unknown, needs to be estimated Marker data Population structure variable Genetic background variable Membership variable Subgroup/sub-population variable Ancestry/admixture proportion variable Usually Q is unknown, needs to be estimated

Estimating Q by Eigen-analysis singular values X = U S VT idv1 idv2 idv3 snp1 2 1 snp2 snp3 snp4 snp5 3.81 0.00 2.05 1.13 T S2 eigenvalues -0.28 -0.95 0.11 -0.75 0.29 0.59 -0.60 0.08 -0.80 -0.55 0.33 0.34 -0.78 -0.10 -0.27 -0.16 0.04 -0.71 -0.20 0.14 0.52 -0.15 -0.93 0.20 14.51 0.00 4.21 1.28 Q1 Q2 Q3 Eigenvector of COV(X) References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT) Or SAS Proc PRINCOM; R svd() and eigen()

Eigen-analysis of HapMap Populations Q2 Q1

(for admixed population) Estimating Q by MLE (for admixed population) G: Observed genotypes of admixed [and parental populations] Q: Allelic frequencies in parental populations P : Individual membership to be estimated Goal: obtain P that maximizes Pr(G|P,Q) Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly) Compute P(i) by solving Compute Q(i) by solving Iterate Steps 1 and 2 until convergence. Tang et al. Genetic Epidemiology, 2005(28): 289–301

(for admixed population) Estimating Q by MCMC (for admixed population) Observed G : genotypes of admixed [and parental populations] Unknown Z : admixed individuals’ membership from ancestral populations Problem: How to estimate Z ? Bayesian and Markov Chain Monte Carlo (MCMC) methods Assume ancestral population number K (see next slide) Define prior distribution Pr(Z) under K Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) Average over large number of MCMC samples to obtain estimate of Z Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE

Infer Population Number (K)

Linear Model (an example including m Q-variables) SAS Proc REG, Proc GENMOD; R lm(), glm() Generalized, can fit binary/categorical y

Unified Mixed Model (more general) Inferred population membership SNP(s) Covariate(s) ID matrix Modeling the resemblance among individuals V = Z G Z ' + R

Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model Based on MVN, the likelihood of trait (y) in a matrix form is: no. of individuals (in a pedigree) mean phenotype vector nn variance-covariance matrix phenotype vector Kinship (IBD) matrix (nn ) V = Z G Z ' + R

Kinship Inbreeding Coefficient The inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD). Identical By Descent (IBD) Two alleles come from the same ancestry. Kinship/Coancestry The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, then inbreeding coefficient of Z = coancestry between X and Y Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)

Kinship Matrix (expected probability of allele sharing among relatives)

Resources for Mixed Model with Kinship Matrix Software Kinship Mixed Model Data SAS Proc INBREED Proc MIXED Quantitative trait Pedigree data Proc GLIMMIX Quantitative/qualitative trait, Pedigree data R : kinship makekinship() lmekin() R: emma emma.kinship() emma.REML.t() Using maker data to calculate kinship EMMAX emmax-kin emmax

Diagnosis of Inflation of False Positives Inflation: more false positives than expected under the null In GWAS, usually due to PS Can be caused by inappropriate statistical methods even with no PS May (not necessarily) indicate PS

Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null no inflation inflation Histogram -log10(p) Q-Q plot

Inflation Rate (IR) For Binary Trait For Continuous Trait Devlin et al. 2004 For Binary Trait For Continuous Trait Amin , Duijn, Aulchenko, 2007

Genomic Control (by IR) For Binary Trait For Continuous Trait Or based on p-value

Practice Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in trait.csv); Investigate p-values to see if there is any inflation; Try to explain why; List some possible methods to reduce or control the inflation; Choose one method, apply it to the data; Does it work? Try to explain why. Clearly document each step of you analysis. The is no standard answer, feel free to try anything you like ! Report back to linusan@wustl.edu and qunyuan@wustl.edu in one week. Thanks !