Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007.

Slides:



Advertisements
Similar presentations
Population structure.
Advertisements

What is an association study? Define linkage disequilibrium
Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
SHI Meng. Abstract The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants,
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
High-density admixture mapping to find genes for complex disease David Reich Harvard Medical School Department of Genetics Broad Institute July 13, 2004.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Genetic Association Analysis --- impact of NGS 1.
Patterns of population structure and admixture among human populations Katarzyna Bryc OEB 275br February 19, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Signatures of Selection
Office hours Wednesday 3-4pm 304A Stanley Hall. Fig Association mapping (qualitative)
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering.
Genetic Traits Quantitative (height, weight) Dichotomous (affected/unaffected) Factorial (blood group) Mendelian - controlled by single gene (cystic fibrosis)
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 6: Population stratification Peter Kraft
Association Analysis SeattleSNPs March 21, 2006 Dr. Chris Carlson FHCRC.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Course Overview Personalized Medicine: Understanding Your Own Genome Fall 2014.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
IBD genetics in children across diverse populations Subra Kugathasan, MD Professor of Pediatrics and Human Genetics Emory University.
HapMap: application in the design and interpretation of association studies Mark J. Daly, PhD on behalf of The International HapMap Consortium.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Whole genome association studies Introduction and practical Boulder, March 2009.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Association mapping: finding genetic variants for common traits & diseases Manuel Ferreira Queensland Institute of Medical Research Brisbane Genetic Epidemiology.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
INTRODUCTION TO ASSOCIATION MAPPING
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
The International Consortium. The International HapMap Project.
Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete populations Katarzyna Bryc Postdoctoral Fellow, Reich Lab, Harvard.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Motivations to study human genetic variation
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University.
Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.
Introduction to Genome Wide Association Studies Saharon Rosset Tel Aviv University.
Population stratification
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Understanding human admixture, and association mapping in admixed populations. Simon Myers.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
SNPs and complex traits: where is the hidden heritability?
Common variation, GWAS & PLINK
Gil McVean Department of Statistics
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Pharmacogenetics: Implications of race and ethnicity on defining genetic profiles for personalized medicine  Victor E. Ortega, MD, Deborah A. Meyers,
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
Michael W. Smith, Nick Patterson, James A. Lautenberger, Ann L
Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors  Michael Dannemann, Aida M.
A Genomewide Admixture Map for Latino Populations
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Admixture Mapping of an Allele Affecting Interleukin 6 Soluble Receptor and Interleukin 6 Levels  David Reich, Nick Patterson, Vijaya Ramesh, Philip L.
Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors  Michael Dannemann, Aida M.
Presentation transcript:

Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007

Outline 1.Introduction to population genetics 2. Whole-genome association studies (WGAS) 3. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

Outline 1.Introduction to population genetics 2. Whole-genome association studies (WGAS) 3. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

What is population genetics? The study of how genetic variation is distributed within and across populations.

Are different human populations actually genetically different?

Slightly.

Are different human populations actually genetically different? Slightly. 5-7% of worldwide human genetic variation is due to genetic differences between human populations. The remaining 93-95% of human genetic variation is due to differences within human populations (Rosenberg et al. 2002: Science 298, ).

Are different human populations actually genetically different? Slightly. 5-7% of worldwide human genetic variation is due to genetic differences between human populations. What about hair / skin / eye color? Exceptions due to natural selection.

Are different human populations actually genetically different? Slightly. 5-7% of worldwide human genetic variation is due to genetic differences between human populations. Why care about population differences? Use genetic data to decipher ancient history. Relevance to disease association studies.

International HapMap Project HapMap genotyped 270 samples: Utah samples of N. European ancestry (CEU) Han Chinese (CHB) Japanese (JPT) Yoruban samples from Nigeria (YRI)

International HapMap Project HapMap genotyped CEU CHB JPT YRI samples at 3.8 million single nucleotide polymorphisms (SNPs) (HapMap 2005: Nature 437, ).

International HapMap Project HapMap genotyped CEU CHB JPT YRI samples at 3.8 million single nucleotide polymorphisms (SNPs) (HapMap 2005: Nature 437, ). How to quantify genetic differences between populations? Define the F ST between two populations to be the proportion of overall variation attributable to differences between populations (Cavalli-Sforza et al. 1994: The History and Geography of Human Genes.)

International HapMap Project Define the F ST between two populations to be the proportion of overall variation attributable to differences between populations (Cavalli-Sforza et al. 1994: The History and Geography of Human Genes.) It follows that the difference in frequency between the two populations at a SNP with overall frequency p has variance 2F ST p(1-p).

International HapMap Project: F ST values CEUCHBJPTYRI CEU CHB JPT0.19 YRI

International HapMap Project: F ST values CEUCHBJPTYRI CEU CHB JPT0.19 e.g. p CHB = 50%, p YRI = 77% p CHB = 50%, p JPT = 56%

PCA results on HapMap data

Discrete clusters or continuous axes? “We identified six main genetic clusters of human populations” (Rosenberg et al. 2002: Science 298, ). “Gradual variation, rather than major genetic discontinuities or ‘races’, is typical of global human genetic diversity” (Serre and Paabo 2004: Genome Res 14, ). Also see Rosenberg et al. 2005: PLoS Genet 1, e70

Outline 1.Introduction to population genetics 2. Whole-genome association studies (WGAS) 3. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls)

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls.

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls. (Hirschhorn & Daly 2005: Nat Rev Genet 6, ).

Common Disease/Common Variants hypothesis The Common Disease/Common Variants hypothesis suggests that genetic risk for common diseases arises from a large number (e.g. up to 10 or more) of common variants (e.g. SNPs with frequency 10-90%) which each confer modest disease risk (e.g. 1.5x larger risk of disease per copy of unfavorable allele) (Reich & Lander 2001: Trends Genet 17, ). WGAS are aimed at detecting common variants (Hirschhorn & Daly 2005: Nat Rev Genet 6, ).

Successes of WGAS WGAS have identified risk variants for: Age-related Macular Degeneration (Klein et al. 2005: Science 308, 385-9) Obesity (Herbert et al. 2006: Science 312, ) Inflammatory Bowel Disease (Duerr et al. 2006: Science 314, ) Type 2 diabetes (Sladek et al. 2007: Nature 445, )

Advantages/disadvantages of WGAS Advantages: Effective for common variants of modest risk No prior knowledge of disease pathways required Fine localization of disease variant Disadvantages: Large number of hypotheses tested reduces power High cost

Cost of WGAS Affymetrix 500K and Illumina 300K technologies: genotype hundreds of thousands of SNPs at a cost of about $500 per sample. Thus, a WGAS with 1000 Cases and 1000 Controls will incur about $1 million in genotyping costs.

Outline 1.Introduction to population genetics 2. Whole-genome association studies (WGAS) 3. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls. (Hirschhorn & Daly 2005: Nat Rev Genet 6, ).

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 of 10 million SNPs total Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls. (Hirschhorn & Daly 2005: Nat Rev Genet 6, ).

Outline 1.Introduction to population genetics 2. Whole-genome association studies (WGAS) 3. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

LD and haplotypes: Recombination MotherFather Child

LD and haplotypes: Recombination Population at time 0 Many generations later

Linkage disequilibrium and haplotypes

haplotype

Linkage disequilibrium and haplotypes.... SNP #1 SNP #2

Linkage disequilibrium and haplotypes.... SNP #1 SNP #2 SNP #1 A G SNP #2 C G

Linkage disequilibrium and haplotypes.... SNP #1 SNP #2 SNP #1 A G SNP #2 C G SNP #1 and SNP #2 are perfect proxies (perfect LD) The r 2 between SNP #1 and SNP #2 is 100%

Linkage disequilibrium and haplotypes.. SNP #1 SNP #1 A G More generally, SNP #1 might be an imperfect proxy (imperfect LD) for all SNPs within 10,000 bp.

Linkage disequilibrium and haplotypes.. SNP #1 SNP #1 A G More generally, SNP #1 might be an imperfect proxy (imperfect LD) for all SNPs within 10,000 bp. WGAS: choose a subset of ,000 tag SNPs so that all SNPs are in strong LD (r 2 > 0.8) with a tag SNP.

Linkage disequilibrium and haplotypes.. SNP #1 SNP #1 A G More generally, SNP #1 might be an imperfect proxy (imperfect LD) for all SNPs within 10,000 bp. WGAS: choose a subset of ,000 tag SNPs so that all SNPs are in strong LD (r 2 > 0.8) with a tag SNP. Haplotype association mapping: don’t need causal SNP.

Affymetrix 500K and Illumina 300K Proportion of HapMap SNPs which are well tagged (r2 > 0.8) by at least one of the tag SNPs in Affymetrix 500K or Illumina 300K, respectively: CEUCHB+JPT YRI Affy 500K 65% 66% 41% Illum 300K 75% 63% 28% (Barrett & Cardon 2006: Nat Genet 38, )

Affymetrix 500K and Illumina 300K Proportion of HapMap SNPs which are well tagged (r2 > 0.8) by at least one of the tag SNPs in Affymetrix 500K or Illumina 300K, respectively: CEUCHB+JPT YRI Affy 500K 65% 66% 41% Illum 300K 75% 63% 28% (Barrett & Cardon 2006: Nat Genet 38, )

Population differences in extent of LD West African 10,000 bp European 50,000 bp East Asian 50,000 bp Native American >100,000 bp Reich et al. 2001: Nature 411, Conrad et al. 2006: Nat Genet 38,

Population differences in extent of LD West African 10,000 bp no bottleneck European 50,000 bp out of Africa 50kya East Asian 50,000 bp out of Africa 50kya Native American >100,000 bp Bering strait 15kya Reich et al. 2001: Nature 411, Conrad et al. 2006: Nat Genet 38,

Population differences in extent of LD West African 10,000 bp no bottleneck European 50,000 bp out of Africa 50kya East Asian 50,000 bp out of Africa 50kya Native American >100,000 bp Bering strait 15kya Kosrae >>100,000 bp island settled 2kya Bonnen et al. 2006: Nat Genet 38, also see Service et al. 2006: Nat Genet 38,

Future challenges SNPs that are not in strong LD (r2 > 0.8) with any of the tag SNPs in Affymetrix 500K (or Illumina 300K) may still be well-captured using pairs of tag SNPs, or more generally, sets of n tag SNPs for some value of n (de Bakker et al. 2005: Nat Genet 37, ) (also see Zaitlen et al. 2007: Am J Hum Genet 80, ). However, increased number of hypotheses tested may reduce power rather than increasing power (Pe’er et al. 2006: Nat Genet 38, 663-7). Related approach: impute all HapMap SNPs and then carry out WGAS using those imputed SNPs.

Outline 1.Introduction to population genetics 2.An unsolved problem in population genetics 3. Whole-genome association studies (WGAS) 4. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

HapMapaaaaaaaaaaa Whole-genome association studies Phenotype Ancestry SNP case N. Europe T control S. Europe C ???

HapMapaaaaaaaaaaa Whole-genome association studies Phenotype Ancestry SNP case N. Europe T control S. Europe C ??? Stratification: spurious associations due to ancestry differences between cases and controls.

Height association study Phenotype Ancestry Lactase SNP tall stratification N. Europe T short S. Europe C chr 2 (Campbell et al. 2005: Nat Genet 37, ) in European Americans.

Population stratification Phenotype Ancestry Lactase SNP tall stratification N. Europe T short S. Europe C chr 2 spurious association due to stratification! (Campbell et al. 2005: Nat Genet 37, )

EIGENSTRAT: use PCA to correct for stratification 1. Apply principal components analysis to infer continuous axes of genetic variation. Cavalli-Sforza et al book Cavalli-Sforza et al. 1993: Science 259, Patterson et al. 2006: PLoS Genet 2, e190 Price et al. 2006: Nat Genet 38, 904-9

EIGENSTRAT: use PCA to correct for stratification 1. Apply principal components analysis to infer continuous axes of genetic variation. 2. For each inferred axis Subtract from each genotype and each phenotype an amount attributable to ancestry along that axis. 3.Evaluate association between ancestry-adjusted genotypes and ancestry-adjusted phenotypes, using Armitage trend test.

Toy Example

Example of axis of variation + 0 _ Cavalli-Sforza et al book Cavalli-Sforza et al. 1993: Science 259,

European American population structure: What’s inside the melting pot? ???

European American data set Brigham Rheumatoid Arthritis Sequential Study (BRASS): 488 European American samples with rheumatoid arthritis, genotyped on a 100K Affy chip (116,204 SNPs).

Results: top two axes of variation

NW Europe SE Europe

Lactase persistence association study Lactase Persistent? SNP Yes stratification N. Europe T No S. Europe C ??? inferred from LCT gene on chr 2 (known to perfectly predict lactase persistence)

Lactase persistence association study Lactase Persistent? SNP Yes stratification N. Europe T No S. Europe C ??? inferred from LCT gene on chr 2 (Enattah et al. 2002) Many associated SNPs near LCT gene on chr 2.

Lactase persistence association study Persistent? Ancestry SNP on chr 3 Yes stratification N. Europe G No S. Europe A P-value = (after correcting for 116,204 hypotheses tested) ?!? rs

Lactase persistence association study Persistent? Ancestry SNP on chr 3 Yes stratification N. Europe G No S. Europe A spurious association due to stratification! rs

Lactase persistence association study: correcting for stratification Persistent? Ancestry SNP on chr 3 Yes stratification N. Europe G No S. Europe A Correcting for stratification (and for 116,204 hypotheses tested): Genomic Control P-value = EIGENSTRAT P-value = rs

Future challenges Given genetic data (e.g. SNP data) from a set of samples of unknown ancestry: what is the best way to describe the “population structure” in the data – i.e. departures from the panmictic model of a single randomly mating population ? Principal Components Analysis STRUCTURE model-based clustering program Pritchard et al. 2000: Genetics 155, Falush et al. 2003: Genetics 164,

Future challenges Given genetic data (e.g. SNP data) from a set of samples of unknown ancestry: what is the best way to describe the “population structure” in the data – i.e. departures from the panmictic model of a single randomly mating population ? Principal Components Analysis STRUCTURE model-based clustering program Pritchard et al. 2000: Genetics 155, Falush et al. 2003: Genetics 164, These methods both fail on HapMap data.

PCA results on HapMap data

The problem: none of the principal components are able to distinguish CHB from JPT – even if looking at lower principal components.

PCA results on CHB and JPT only

The problem: discernment between CHB and JPT requires analyzing CHB+JPT populations separately.

But what if population structure is continuous?

Outline 1.Introduction to population genetics 2.An unsolved problem in population genetics 3. Whole-genome association studies (WGAS) 4. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

Admixture association references Methodology and ANCESTRYMAP program: Patterson et al. 2004: Am J Hum Genet 74, Admixture mapping in African Americans: Smith et al. 2004: Am J Hum Genet 74, Successes of admixture mapping in African Americans: Reich et al. 2005: Nat Genet 37, Freedman et al. 2006: PNAS 103, Admixture mapping in Latino populations: Price et al. 2007: Am J Hum Genet, in press

1 generation ago 2 generations ago 3 generations ago 4 generations ago Latino admixture creates a mosaic European chromosomes Native American chromosomes European + Native American chromosomes Today

How does Latino admixture mapping work? European chromosome Native American chromosome Disease locus Cases with disease

The Signal of Latino Admixture Association 100% 50% 0% Position on chromosome (cM) % Native American Ancestry

Admixture association: future challenges How to best integrate haplotype association and admixture association signals in a WGAS of an admixed population?

Acknowledgements Nick Patterson, Robert Plenge, Michael Weinblatt, Nancy Shadick, Fuli Yu, David Cox, Alicja Waliszewska, Gavin McDonald, Arti Tandon, Christine Schirmer, Julie Neubauer, Gabriel Bedoya, Constanza Duque, Alberto Villegas, Maria Catira Bortolini, Francisco Salzano, Carla Gallo, Guido Mazzotti, Marcela Tello-Ruiz, Laura Riba, Carlos Aguilar-Salinas, Samuel Canizales-Quinteros, Marta Menjivar, William Klitz, Brian Henderson, Chris Haiman, Cheryl Winkler, Teresa Tusie-Luna, Andres Ruiz-Linares, and David Reich