Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007.

Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007

Outline 1.Introduction to population genetics 2. Whole-genome association studies (WGAS) 3. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

What is population genetics? The study of how genetic variation is distributed within and across populations.

Are different human populations actually genetically different?

Slightly.

Are different human populations actually genetically different? Slightly. 5-7% of worldwide human genetic variation is due to genetic differences between human populations. The remaining 93-95% of human genetic variation is due to differences within human populations (Rosenberg et al. 2002: Science 298, 2381-5).

Are different human populations actually genetically different? Slightly. 5-7% of worldwide human genetic variation is due to genetic differences between human populations. What about hair / skin / eye color? Exceptions due to natural selection.

Are different human populations actually genetically different? Slightly. 5-7% of worldwide human genetic variation is due to genetic differences between human populations. Why care about population differences? Use genetic data to decipher ancient history. Relevance to disease association studies.

International HapMap Project HapMap genotyped 270 samples: Utah samples of N. European ancestry (CEU) Han Chinese (CHB) Japanese (JPT) Yoruban samples from Nigeria (YRI)

International HapMap Project HapMap genotyped CEU CHB JPT YRI samples at 3.8 million single nucleotide polymorphisms (SNPs) (HapMap 2005: Nature 437, 1299-1320).

International HapMap Project HapMap genotyped CEU CHB JPT YRI samples at 3.8 million single nucleotide polymorphisms (SNPs) (HapMap 2005: Nature 437, 1299-1320). How to quantify genetic differences between populations? Define the F ST between two populations to be the proportion of overall variation attributable to differences between populations (Cavalli-Sforza et al. 1994: The History and Geography of Human Genes.)

International HapMap Project Define the F ST between two populations to be the proportion of overall variation attributable to differences between populations (Cavalli-Sforza et al. 1994: The History and Geography of Human Genes.) It follows that the difference in frequency between the two populations at a SNP with overall frequency p has variance 2F ST p(1-p).

International HapMap Project: F ST values CEUCHBJPTYRI CEU0.11 0.16 CHB0.0070.19 JPT0.19 YRI

International HapMap Project: F ST values CEUCHBJPTYRI CEU0.11 0.16 CHB0.0070.19 JPT0.19 e.g. p CHB = 50%, p YRI = 77% p CHB = 50%, p JPT = 56%

PCA results on HapMap data

Discrete clusters or continuous axes? “We identified six main genetic clusters of human populations” (Rosenberg et al. 2002: Science 298, 2381-5). “Gradual variation, rather than major genetic discontinuities or ‘races’, is typical of global human genetic diversity” (Serre and Paabo 2004: Genome Res 14, 1679-85). Also see Rosenberg et al. 2005: PLoS Genet 1, e70

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls)

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls.

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls. (Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).

Common Disease/Common Variants hypothesis The Common Disease/Common Variants hypothesis suggests that genetic risk for common diseases arises from a large number (e.g. up to 10 or more) of common variants (e.g. SNPs with frequency 10-90%) which each confer modest disease risk (e.g. 1.5x larger risk of disease per copy of unfavorable allele) (Reich & Lander 2001: Trends Genet 17, 502-10). WGAS are aimed at detecting common variants (Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).

Successes of WGAS WGAS have identified risk variants for: Age-related Macular Degeneration (Klein et al. 2005: Science 308, 385-9) Obesity (Herbert et al. 2006: Science 312, 279-83) Inflammatory Bowel Disease (Duerr et al. 2006: Science 314, 1461-3) Type 2 diabetes (Sladek et al. 2007: Nature 445, 828-30)

Advantages/disadvantages of WGAS Advantages: Effective for common variants of modest risk No prior knowledge of disease pathways required Fine localization of disease variant Disadvantages: Large number of hypotheses tested reduces power High cost

Cost of WGAS Affymetrix 500K and Illumina 300K technologies: genotype hundreds of thousands of SNPs at a cost of about $500 per sample. Thus, a WGAS with 1000 Cases and 1000 Controls will incur about $1 million in genotyping costs.

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 SNPs Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls. (Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).

Whole-genome association studies (WGAS) Step 1. Obtain DNA samples from 1000 individuals with a specific disease (Cases) 1000 healthy individuals (Controls) Step 2. Genotype 1000 Cases and 1000 Controls at 100,000 – 500,000 of 10 million SNPs total Step 3. Look for a SNP with significantly different frequency in Cases vs. Controls. (Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).

LD and haplotypes: Recombination MotherFather Child

LD and haplotypes: Recombination Population at time 0 Many generations later

Linkage disequilibrium and haplotypes

haplotype

Linkage disequilibrium and haplotypes.... SNP #1 SNP #2

Linkage disequilibrium and haplotypes.... SNP #1 SNP #2 SNP #1 A G SNP #2 C G

Linkage disequilibrium and haplotypes.... SNP #1 SNP #2 SNP #1 A G SNP #2 C G SNP #1 and SNP #2 are perfect proxies (perfect LD) The r 2 between SNP #1 and SNP #2 is 100%

Linkage disequilibrium and haplotypes.. SNP #1 SNP #1 A G More generally, SNP #1 might be an imperfect proxy (imperfect LD) for all SNPs within 10,000 bp.

Linkage disequilibrium and haplotypes.. SNP #1 SNP #1 A G More generally, SNP #1 might be an imperfect proxy (imperfect LD) for all SNPs within 10,000 bp. WGAS: choose a subset of 100-500,000 tag SNPs so that all SNPs are in strong LD (r 2 > 0.8) with a tag SNP.

Linkage disequilibrium and haplotypes.. SNP #1 SNP #1 A G More generally, SNP #1 might be an imperfect proxy (imperfect LD) for all SNPs within 10,000 bp. WGAS: choose a subset of 100-500,000 tag SNPs so that all SNPs are in strong LD (r 2 > 0.8) with a tag SNP. Haplotype association mapping: don’t need causal SNP.

Affymetrix 500K and Illumina 300K Proportion of HapMap SNPs which are well tagged (r2 > 0.8) by at least one of the tag SNPs in Affymetrix 500K or Illumina 300K, respectively: CEUCHB+JPT YRI Affy 500K 65% 66% 41% Illum 300K 75% 63% 28% (Barrett & Cardon 2006: Nat Genet 38, 659-62)

Population differences in extent of LD West African 10,000 bp European 50,000 bp East Asian 50,000 bp Native American >100,000 bp Reich et al. 2001: Nature 411, 199-204 Conrad et al. 2006: Nat Genet 38, 1251-60

Population differences in extent of LD West African 10,000 bp no bottleneck European 50,000 bp out of Africa 50kya East Asian 50,000 bp out of Africa 50kya Native American >100,000 bp Bering strait 15kya Reich et al. 2001: Nature 411, 199-204 Conrad et al. 2006: Nat Genet 38, 1251-60

Population differences in extent of LD West African 10,000 bp no bottleneck European 50,000 bp out of Africa 50kya East Asian 50,000 bp out of Africa 50kya Native American >100,000 bp Bering strait 15kya Kosrae >>100,000 bp island settled 2kya Bonnen et al. 2006: Nat Genet 38, 214-7 also see Service et al. 2006: Nat Genet 38, 556-560

Future challenges SNPs that are not in strong LD (r2 > 0.8) with any of the tag SNPs in Affymetrix 500K (or Illumina 300K) may still be well-captured using pairs of tag SNPs, or more generally, sets of n tag SNPs for some value of n (de Bakker et al. 2005: Nat Genet 37, 1217-23) (also see Zaitlen et al. 2007: Am J Hum Genet 80, 683-91). However, increased number of hypotheses tested may reduce power rather than increasing power (Pe’er et al. 2006: Nat Genet 38, 663-7). Related approach: impute all HapMap SNPs and then carry out WGAS using those imputed SNPs.

Outline 1.Introduction to population genetics 2.An unsolved problem in population genetics 3. Whole-genome association studies (WGAS) 4. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

HapMapaaaaaaaaaaa Whole-genome association studies Phenotype Ancestry SNP case N. Europe T control S. Europe C ???

HapMapaaaaaaaaaaa Whole-genome association studies Phenotype Ancestry SNP case N. Europe T control S. Europe C ??? Stratification: spurious associations due to ancestry differences between cases and controls.

Height association study Phenotype Ancestry Lactase SNP tall stratification N. Europe T short S. Europe C chr 2 (Campbell et al. 2005: Nat Genet 37, 868-72) in European Americans.

Population stratification Phenotype Ancestry Lactase SNP tall stratification N. Europe T short S. Europe C chr 2 spurious association due to stratification! (Campbell et al. 2005: Nat Genet 37, 868-72)

EIGENSTRAT: use PCA to correct for stratification 1. Apply principal components analysis to infer continuous axes of genetic variation. Cavalli-Sforza et al. 1994 book Cavalli-Sforza et al. 1993: Science 259, 630-46 Patterson et al. 2006: PLoS Genet 2, e190 Price et al. 2006: Nat Genet 38, 904-9

EIGENSTRAT: use PCA to correct for stratification 1. Apply principal components analysis to infer continuous axes of genetic variation. 2. For each inferred axis Subtract from each genotype and each phenotype an amount attributable to ancestry along that axis. 3.Evaluate association between ancestry-adjusted genotypes and ancestry-adjusted phenotypes, using Armitage trend test.

Toy Example

Example of axis of variation + 0 _ Cavalli-Sforza et al. 1994 book Cavalli-Sforza et al. 1993: Science 259, 630-46

European American population structure: What’s inside the melting pot? ???

European American data set Brigham Rheumatoid Arthritis Sequential Study (BRASS): 488 European American samples with rheumatoid arthritis, genotyped on a 100K Affy chip (116,204 SNPs).

Results: top two axes of variation

NW Europe SE Europe

Lactase persistence association study Lactase Persistent? SNP Yes stratification N. Europe T No S. Europe C ??? inferred from LCT gene on chr 2 (known to perfectly predict lactase persistence)

Lactase persistence association study Lactase Persistent? SNP Yes stratification N. Europe T No S. Europe C ??? inferred from LCT gene on chr 2 (Enattah et al. 2002) Many associated SNPs near LCT gene on chr 2.

Lactase persistence association study Persistent? Ancestry SNP on chr 3 Yes stratification N. Europe G No S. Europe A P-value = 0.0000002 (after correcting for 116,204 hypotheses tested) ?!? rs10511418

Lactase persistence association study Persistent? Ancestry SNP on chr 3 Yes stratification N. Europe G No S. Europe A spurious association due to stratification! rs10511418

Lactase persistence association study: correcting for stratification Persistent? Ancestry SNP on chr 3 Yes stratification N. Europe G No S. Europe A Correcting for stratification (and for 116,204 hypotheses tested): Genomic Control P-value = 0.0023 EIGENSTRAT P-value = 1.0000 rs10511418

Future challenges Given genetic data (e.g. SNP data) from a set of samples of unknown ancestry: what is the best way to describe the “population structure” in the data – i.e. departures from the panmictic model of a single randomly mating population ? Principal Components Analysis STRUCTURE model-based clustering program Pritchard et al. 2000: Genetics 155, 945-59 Falush et al. 2003: Genetics 164, 1567-86

Future challenges Given genetic data (e.g. SNP data) from a set of samples of unknown ancestry: what is the best way to describe the “population structure” in the data – i.e. departures from the panmictic model of a single randomly mating population ? Principal Components Analysis STRUCTURE model-based clustering program Pritchard et al. 2000: Genetics 155, 945-59 Falush et al. 2003: Genetics 164, 1567-86 These methods both fail on HapMap data.

PCA results on HapMap data

The problem: none of the principal components are able to distinguish CHB from JPT – even if looking at lower principal components.

PCA results on CHB and JPT only

The problem: discernment between CHB and JPT requires analyzing CHB+JPT populations separately.

But what if population structure is continuous?

Outline 1.Introduction to population genetics 2.An unsolved problem in population genetics 3. Whole-genome association studies (WGAS) 4. Applications of population genetics to WGAS: i. Linkage disequilibrium and haplotypes ii. Population stratification iii. Admixture association signals

Admixture association references Methodology and ANCESTRYMAP program: Patterson et al. 2004: Am J Hum Genet 74, 979-1000 Admixture mapping in African Americans: Smith et al. 2004: Am J Hum Genet 74, 1001-13 Successes of admixture mapping in African Americans: Reich et al. 2005: Nat Genet 37, 1113-8 Freedman et al. 2006: PNAS 103, 14068-73 Admixture mapping in Latino populations: Price et al. 2007: Am J Hum Genet, in press

1 generation ago 2 generations ago 3 generations ago 4 generations ago Latino admixture creates a mosaic European chromosomes Native American chromosomes European + Native American chromosomes Today

How does Latino admixture mapping work? European chromosome Native American chromosome Disease locus Cases with disease

The Signal of Latino Admixture Association 100% 50% 0% 20 40 60 80 100 120 140 Position on chromosome (cM) % Native American Ancestry

Admixture association: future challenges How to best integrate haplotype association and admixture association signals in a WGAS of an admixed population?

Acknowledgements Nick Patterson, Robert Plenge, Michael Weinblatt, Nancy Shadick, Fuli Yu, David Cox, Alicja Waliszewska, Gavin McDonald, Arti Tandon, Christine Schirmer, Julie Neubauer, Gabriel Bedoya, Constanza Duque, Alberto Villegas, Maria Catira Bortolini, Francisco Salzano, Carla Gallo, Guido Mazzotti, Marcela Tello-Ruiz, Laura Riba, Carlos Aguilar-Salinas, Samuel Canizales-Quinteros, Marta Menjivar, William Klitz, Brian Henderson, Chris Haiman, Cheryl Winkler, Teresa Tusie-Luna, Andres Ruiz-Linares, and David Reich

Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007.

Similar presentations

Presentation on theme: "Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007.

Similar presentations

Presentation on theme: "Population genetics and whole- genome disease association studies Alkes L. Price Harvard Medical School & Broad Institute of MIT and Harvard April 5, 2007."— Presentation transcript:

Similar presentations

About project

Feedback