Population stratification

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population structure.
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Lab 3 : Exact tests and Measuring Genetic Variation.
Evaluation of a new tool for use in association mapping Structure Reinhard Simon, 2002/10/29.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sampling distributions of alleles under models of neutral evolution.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
BMI 731- Winter 2005 Chapter1: SNP Analysis Catalin Barbacioru Department of Biomedical Informatics Ohio State University.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Joint Linkage and Linkage Disequilibrium Mapping
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Signatures of Selection
Fundamentals of Forensic DNA Typing Slides prepared by John M. Butler June 2009 Appendix 3 Probability and Statistics.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Quantitative Genetics
Analysis of genome-wide association studies
PLINK tutorial, December 2006; Shaun Purcell, PLINK gPLINK Haploview Whole genome association software tutorial Shaun Purcell.
Population Stratification
HARDY-WEINBERG EQUILIBRIUM
Population Stratification
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
Figure S1. Quantile-quantile plot in –log10 scale for the individual studies The red line represents concordance of observed and expected values. The shaded.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Molecular & Genetic Epi 217 Association Studies
Lecture 13: Population Structure October 5, 2015.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Methods in genome wide association studies. Norú Moreno
Lab 7. Estimating Population Structure. Goals 1.Estimate and interpret statistics (AMOVA + Bayesian) that characterize population structure. 2.Demonstrate.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
Lecture 15: Linkage Analysis VII
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
PLINK / Haploview Whole genome association software tutorial
The International Consortium. The International HapMap Project.
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Genetic background and population stratification Shaun Purcell 1,2 & Pak Sham 1 1 Social, Genetic & Developmental Psychiatry Research Centre, IoP, KCL,
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Data Quality Control Suzanne M. Leal Baylor College of Medicine Copyrighted © S.M. Leal 2015.
Chi square and Hardy-Weinberg
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Robert Page Doctoral Student in Dr. Voss’ Lab Population Genetics.
Genome Wide Association Studies using SNP
Preparing data for GWAS analysis
Population stratification
Correlation for a pair of relatives
Ida Moltke, Matteo Fumagalli, Thorfinn S. Korneliussen, Jacob E
Presentation transcript:

Population stratification Background & PLINK practical

Variation between, within populations Any two humans differ ~0.1% of their genome (1 in ~1000bp) ~8% of this variation is accounted for by the major continental racial groups Majority of variation is within group but genetic data can still be used to accurately cluster individuals although biological concept of “race” in this context controversial

Stratified populations: Wahlund effect Sub-population 1 2 A1 0.1 0.9 A2 0.9 0.1 A1A1 0.01 0.81 A1A2 0.18 0.18 A2A2 0.81 0.01 1+2 0.5 0.41 (0.25) 0.18 (0.50)

Quantifying population structure Expected average heterozygosity in random mating subpopulation (HS) in total population (HT) from the previous example, HS = 0.18 , HT = 0.5 Wright’s fixation index FST = ( HT - HS ) / HT FST = 0.64 0.01 - 0.05 for European populations 0.1 - 0.3 for most divergent populations

Male/female acceptance rate Confounding due to unmeasured variables is a common issue in epidemiology “Simpson’s paradox” Berkley sex bias case claim that female graduate applicants were prejudiced against 44% men accepted, 35% women but, stratified by department, no intra-department differences (see figure) i.e. women more likely to apply to departments that were harder to get into (for both males and females) In genetic association studies, “accepted or not”  disease or not “male/female”  genetic variant “department”  ancestry Happens when both outcome and genotype frequencies vary between different ethnic groups in the sample Department Of all applicants, % female Department

Approaches to detecting stratification using genome-wide SNP data Genomic control average correction factor for test statistics ratio of median chi-sq to expectation under null (0.456 for 1df) Clustering approaches assign individuals to groups model based and distance based Principal components analysis, multidimensional scaling continuous indices of ancestry

Genomic control 2 No stratification Test locus Unlinked ‘null’ markers 2 Stratification  adjust test statistic

Structured association LD observed under stratification Unlinked ‘null’ markers Subpopulation A Subpopulation B

Discrete subpopulation model K sub-populations, “latent classes” Sub-populations vary in allele frequencies Random mating within subpopulation Within each subpopulation Hardy-Weinberg and linkage equilibrium For population as a whole Hardy-Weinberg and linkage disequilibrium

Worked example Look at Excel spreadsheet ~pshaun/pop-strat.xls Scenario: two sub-populations, of equal frequency in total population. We know allele frequencies for 5 markers unlinked markers Problem: For a given individual with genotypes on these 5 markers, what is the probability of belonging to population 1 versus population 2? Allele frequencies: Steps: 1) Class-specific allele frequencies  class-specific genotype frequencies (HWE) 2) Single locus  multi-locus (5 marker) genotype frequencies (LE), P(G|C) 3) Prior probability of class, P(C). Hint: we are given this above. 4) Bayes theorem to give P(C|G) from P(G|C) and P(C)

Statistical approaches to uncover hidden population substructure Goal : assign each individual to class C of K Key : conditional independence of genotypes, G within classes (LE, HWE) P(C) prior probabilities P(G | C) class-specific allele/genotype frequencies P(C | G) posterior probabilities Bayes theorem: Problem: in practice, we don’t know P(G|C) or P(C) either! Solution: EM algorithm (LPOP), or Bayesian approaches (STRUCTURE) Sum over j = 1 to K classes

E-M algorithm P(C) -2LL P(C | G) P(G |C) E step: counting individuals and alleles in classes P(C) -2LL P(C | G) P(G |C) Converged? M step: Bayes theorem, assume conditional independence

Stratification analysis in PLINK Calculate IBS sharing between all pairs “--genome” command; can take long time, but can be parallelized easily generates (large) .genome file can be used to spot sample duplicates also contains IBD estimates: these are only meaningful within a ~homogeneous sample Given IBS data, perform clustering complete linkage clustering can specify various constraints, e.g. PPC test, cluster size (e.g. 1:1 matching) or # of clusters Given IBS data, perform MDS extract first K components, e.g.4-6 plot each component, each pair of components

Multidimensional scaling/PCA Pairwise allele-sharing metric Han Chinese Japanese Reference Same population Different population Hierarchical clustering

Distribution of IBS between and within HapMap subpopulations YRI/CEU YRI/CHB YRI/JPT CEU/CHB CEU/JPT YRI CEU CHB JPT CHB/JPT CEU (P/O) YRI (P/O)

Multidimensional scaling (MDS) analysis HapMap data (equiv. to PCA) CEPH/European Yoruba Han Chinese Japanese ~2K SNPs

Han Chinese Japanese ~10K SNPs

PPC (pairwise population concordance) test { AA , BB } : { AB , AB } 1 : 2 { individual 1 , individual 2 } Expected 1:2 ratio in individuals from same population Significance test of a binomial proportion Note: Requires analysis to be of subset of SNPs in approx. LE within sub-population. Would also be sensitive to inbreeding

Two example pairs: (50K SNPs with 100% genotyping) Ind1 Ind2 { AA , BB } { AB , AB } Ratio p-value CHB 3451 6927 1 : 2.007 0.569 JPT 3484 6595 1 : 1.892 0.004 1 HCB1 HCB8 HCB26 HCB5 HCB15 2 HCB2 HCB45 HCB12 3 HCB3 HCB14 HCB32 HCB18 HCB27 HCB23 HCB30 4 HCB4 HCB38 HCB39 HCB20 5 HCB6 HCB21 HCB43 6 HCB7 HCB29 HCB31 HCB11 HCB40 HCB24 HCB33 7 HCB9 HCB16 HCB22 8 HCB10 HCB44 HCB19 HCB41 HCB42 HCB35 HCB36 9 HCB13 HCB17 HCB34 HCB25 HCB28 HCB37 10 JPT1 JPT19 JPT13 JPT16 JPT29 JPT36 11 JPT2 JPT28 12 JPT3 JPT17 JPT38 JPT44 JPT8 JPT23 13 JPT4 JPT18 JPT21 JPT27 JPT41 JPT43 14 JPT5 JPT30 JPT39 JPT42 JPT9 15 JPT6 JPT37 JPT24 16 JPT7 JPT12 JPT10 JPT25 JPT14 JPT26 JPT34 JPT33 17 JPT11 JPT31 JPT40 JPT15 JPT22 18 JPT20 19 JPT32* 20 JPT35 Proportion of all CHB-CHB pairs significant = 0.076 Proportion of all CHB-JPT pairs significant = 0.475 (Power for difference at p=0.05 level)

MDS analysis Often useful to treat each MDS component as a QT and perform WGAS (regress it on all SNPs), to ask: what is the genomic control lambda? If not >>1, then the component probably does not represent true, major stratification which genomic regions load particularly strongly on the component (i.e. which regions show largest frequency differences between the groups the component is distinguishing?)

Practical example: bipolar GWAS CONTROLS CASES USA UK Evaluated via permutation that within site the average case is equally similar to the average control as another case

Fine-scale genetic variation reflects geography Novembre et al, Nature (2008)