EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 6: Population stratification Peter Kraft

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population structure.
Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)
Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Association Tests for Rare Variants Using Sequence Data
SHI Meng. Abstract The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants,
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Genome-wide association mapping Introduction to theory and methodology
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Basics of Linkage Analysis
Human Genetics Genetic Epidemiology.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Association Analysis SeattleSNPs March 21, 2006 Dr. Chris Carlson FHCRC.
Genome-Wide Association Studies Xiaole Shirley Liu Stat 115/215.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Introduction to Population Stratification. Standard definition of confounding A confounder is 1. Associated with the exposure in the study base 2. Associated.
Robust and powerful sibpair test for rare variant association
IBD genetics in children across diverse populations Subra Kugathasan, MD Professor of Pediatrics and Human Genetics Emory University.
Population Stratification
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
1 Control of Population Stratification in Whole-Genome Scans Fei Zou Department of Biostatistics Carolina Center for Genome Sciences.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Population Stratification
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
Molecular & Genetic Epi 217 Association Studies
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
INTRODUCTION TO ASSOCIATION MAPPING
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 14: Population Assignment and Individual Identity October 8, 2015.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Genetic background and population stratification Shaun Purcell 1,2 & Pak Sham 1 1 Social, Genetic & Developmental Psychiatry Research Centre, IoP, KCL,
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent.
Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Population stratification
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007.
Genome-wide Association
Genetic Association Analysis
Stratification Lon Cardon University of Oxford
Pharmacogenetics: Implications of race and ethnicity on defining genetic profiles for personalized medicine  Victor E. Ortega, MD, Deborah A. Meyers,
Genome Wide Association Studies using SNP
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
Imputation-based local ancestry inference in admixed populations
Population stratification
Genome-wide Association Studies
Genome-wide Association
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Proportioning Whole-Genome Single-Nucleotide–Polymorphism Diversity for the Identification of Geographic Population Structure and Genetic Ancestry  Oscar.
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Yu Zhang, Tianhua Niu, Jun S. Liu 
Results from a GWAS of prostate cancer in the KP population (8,399 cases and 38,745 controls), highlighting key chromosomal regions. Results from a GWAS.
Presentation transcript:

EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 6: Population stratification Peter Kraft Bldg 2 Rm

Population stratification Confounding due to correlated differences in allele frequencies and disease risks across unobserved subpopulations Extent and impact varies –Likely to be negligible when source population is made up of many subpopulations, small differences in allele freqs and disease risk [e.g. non-Hispanic European Americans] –More likely to be appreciable when source population is made up of [or an admixture of] two subpopulations with larger differences in allele freqs and disease risks [e.g. African Americans, Mexicans, Puerto Ricans]

Classic example Knowler et al. (1998) AJHG 43:520

When stratified by Indian heritage, no evidence of association between Gm and diabetes was found. But degree of Indian heritage is a confounder.

But differences in allele frequencies and disease rates do not always lead to population stratification bias…

Bladder cancer incidence, NAT2 genotype frequencies in 8 European populations Men Women Adapted from Wacholder et al. (2000) JNCI 92: Degree of confounding depends on In special case of Armitage Trend Test

…for bias to result, need correlated differences

Campbell [Nat Genet 2005] found a correlation between alleles in the lactase gene and height in a European-American sample But they observed strong trends in both LCT allele frequency and height with respect to North-South European grand- parental ancestry The association between the LCT SNP and height in the total sample was strong (p<10 -6 ). The association was weakened when data were stratified by grandparental ancestry. It disappeared when tested in two independent, ethnically homogeneous studies, one in Poland and one in Sweden (the latter being a family-based study). Voila, population stratification bias in a European American population. But how common is this? Does it mean we should estimate and adjust for population stratification in studies of U.S. whites?

a) Evidence against population stratification b) Potential for population stratification—but gradient in allele frequencies should follow gradient in phenotype What’s missing from this argument?

G X D Population structure, no confounding G X D No population structure, no confounding G X D Population structure, potential confounding Adjusting for X unnecessary or insufficient; can even reduce power

What to do? Match on ethnic ancestry –Self report may not be accurate –Ethnicity [e.g. “race”] may not be good surrogate for ancestry –Difficult to match mixed ancestry subjects Adjust using multiple unlinked markers –“Structured association” Use markers to test for and assign individuals to latent classes Most popular software: STRUCTURE (J Pritchard) –“Genomic control” Estimate “test statistic inflation” and adjust accordingly –Adjust for multiple random, unlinked markers Surrogate for genetic variation across subpopulation Most popular software: EIGENSTRAT (A Price, N Patterson) Use family-based controls –Siblings (conditional logistic) –Case-parent “pseudocontrols” (TDT, FBAT etc.)

Am. J. Hum. Genet., 76: , 2005 Clusters based on 326 microsatellites Self report of ethnicity is a good surrogate for gross differences in ancestry…

Selection of a set of SNPs for population stratification ILLUMINA 550K ILLUMINA 317K AFFYMETRIX 500K SNP Remove SNPs with call rate < 90% on either Illumina or Affymetrix platform Remove untyped or monomorphic SNPs in YRI or (JPT+CHB) Remove SNPs with P-values for HW proportion < SNP SNPs Select a set of SNPs with local parwise r 2 < Slide courtesy of G Thomas

A model of a structured population Population studied : -Europe : CEPH founders => 60 individuals HapMap -African : YRI founders => 59 individuals HapMap -Asian : CHB => 44 individuals HapMap -Asian : JPT => 45 individuals HapMap -Native American : Mexican => 30 individuals Penn State U.* -Native American : Mayan => 25 individuals Penn State U.* -African Americans => 15 individuals Penn State U.* - "Latino" => 7 individuals SNP500 * Courtesy of X. Mao, E. Parra and M. Shriver Total of 285 individuals Slide courtesy of G Thomas

First principal component.2 Third principal component Second principal component CEU CHB, JPT YRI Latino Native American African American CEU CHB, JPT YRI Latino Native American African American First to third components First principal component Slide courtesy of G Thomas

But cannot capture within-ethnicity variation… nd PC 1st PC ATBC=Finns Plot of 1 st and 2 nd principal components of variation for ca. 10,000 self-described European(-descended) subjects in the CGEMS prostate cancer GWAS

Structured Association Genotype multiple unlinked, anonymous markers –Very unlikely to be (or be near) causal loci –Best to choose ancestry informative markers (AIMs) Known for African, European, native American populations Not known to distinguish among European populations To test for strat’n, sum disease-marker chi-squares –This sum has d.f. = sum of individual tests’ d.f. Use clustering algorithm to estimate structure –STRUCTURE, ADMIXMAP based on pop’n genetics models Structure does not assume allele freqs in ancestral popn’s known Admixmap does –Use estimated admixture as covariates or matching vars Pritchard & Rosenberg (1999) AJHG 65: Pritchard et al. (2000) Genetics 155:

Toy example 150 subjects from Pop’n 1 –Disease incidence 15% –150 markers with allele freqs ~ Beta(1,10) –Allele freqs, markers independent of disease 150 subjects from Pop’n 2 –Disease incidence 30% –150 markers with allele freqs ~ Beta(1,10) –Allele freqs, markers independent of disease

Adjustment=appropriate stratified analysis … but test for stratification Sum chi-squares = on 150 d.f.  p =.065 Pop’n stratification appears to inflate Type I error rate…

Still, there is strong evidence that there are two distinct subpopulations.

Based on that same panel of 10,000 markers Thomas et al, submitted Admixture proportions for subjects from 1 st and 2 nd stages of CGEMS prostate scan In practice, STRUCTURE is applied to a “spiked” data set (your data plus three HapMap samples) to detect gross outliers or data handling errors

Drawbacks to structured assocation Computationally intensive Markers should be unlinked Model-based User has to specify number of ancestral populations

Genomic control For modest pop’n stratification, test stat X dist’n is roughly =  2  So why not estimate  (a.k.a. ) to get X * = X/  ?

 = /150 = 1.18

How many markers needed? Nature Genetics 36, (2004)

Likelihoods for inflation factors for studies with 1000 cases and controls Nature Genetics 36, (2004)

Nature Genetics 36, (2004)

But setting practical problems aside, is genomic control the right thing to do? Population stratification bias –Under null, Armitage trend test X 2 is distributed as: Cryptic relatedness –Under null, -1 X 2 is distributed as  1 2 (0)  1 2 (ξ), where ξ = NΔ 2 /(2  2 ) where = 1+(a 11 -a 01 ) 2 2 f N/(1+f) Genomic control corrects this kind of distortion… … but not this A Whittemore, unpublished MS

Adjusting for many random markers Unlike genomic control: –Doesn’t penalize the innocent for the sins of the guilty –Does a better job penalizing the guilty Variants: –Price: use Principal Components to summarize many markers Can use clever computational trick (Tracy-Wisdom statistic: Patterson 2007 PLOS Genet) to decide how many components, or just eyeballs PC plots Adjust for these structure-related PCs –Wang/Balding: adjust for SNPs in non-candidate genes –Epstein & Satten: Wang/Balding meets propensity score

Clear Population Stratification Bias Q-Q plot for NHS Hair Color Scan λ=1.24 λ=1.02 -log 10 p-value Black line: unadjusted. Red line: adjusted for top four PCs

No Clear Population Stratification Bias Q-Q plot for Prostate Cancer Scan

CGEMS prostate cancer example Not a surprise? –Empiric evidence of subtle genetic differences across region even in U.S. self-described whites is mounting… –…and there is some evidence of variation in prostate cancer rates across regions… –… but (a) the latter pattern is complex and its causes are unclear, and (b) the chance that the two patterns would coincide is small.

Caveats: Not a Foolproof Panacea Rule of thumb: need at least 1,000 markers –Much more is better! Linked markers can distort PCs Will not rescue poor design

Prostate Cancer - Population Structure BPC3 12 Sub-cohorts

7 White Sub-cohorts

1 Japanese Sub-cohort 1 Hawaiian Sub-cohort 1 Latino Sub-cohort 2 African American Sub-cohorts

Tough to fix using naïve application of EIGENSTRAT (better to match cases and controls on inferred ancestry, cf PLINK IBD matching or K Roeder [in preparation?]) Red=cases Black=Controls

Pop’n strat’n bias: to recap A concern for recently admixed populations Less of a concern for U.S. non-Hispanic Europeans –Still, with large sample sizes small effects will be detected –May affect many markers across genome Good study design can avoid worst bias Genomic control may help –Difficult to callibrate for small p-value thresholds, –Can be too conservative or too anti-conservative, depending on (unknown) degree of pop’n strat’n Structured association intuitive and effective –But performance greatly enhanced by use of AIMs… –…in absence of AIMs, degree of stratification overestimated References Pritchard & Rosenberg (1999) AJHG 65: Testing for/estimating structure Pritchard et al. (2000) Genetics 155: Testing for/estimating structure Devlin & Roeder (1999) Biometrics 55: Genomic control Bacanu et al. (2000) AJHG 66: Genomic control Wacholder et al. (2000) JNCI 14: Extent of pop’n strat’n Reich et al. (2001) Genet Epidemiol 20:4-16Genomic control Thomas & Witte (2002) CEBP 11: Extent of pop’n strat’n Wacholder et al. (2002) CEBP 11: Extent of pop’n strat’n Freedman et al. (2004) 36: Genomic control Marchini et al. (2004) Nat Genet 36: Extent of pop’n strat’n, genomic control Tang et al. (2005) AJHG 76: Self-reported ethnicity and genetic structure