Presentation is loading. Please wait.

Presentation is loading. Please wait.

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Similar presentations


Presentation on theme: "The 1000 Genomes Project Gil McVean Department of Statistics, Oxford."— Presentation transcript:

1 The 1000 Genomes Project Gil McVean Department of Statistics, Oxford

2 Questions Why do we need a comprehensive map of human genetic variation? How will data from the 1000G project be used in medical genetics? How can we start accessing the hard part of the genome? What has the 1000G project told us about the genetic structure of populations?

3

4

5 The role of the 1000G Project in medical genetics A representation of ‘normal’ variation –95% of variants at 1% frequency in populations of interest A resource for increasing the power of existing genome-wide association studies A development platform for sequencing / statistical / computational technologies A forum for establishing best practice and standards in genome sequencing

6 Samples for the 1000 Genomes Project Major population groups comprised of subpopulations of c. 100 each GBR FIN TSI IBS CEU JPT CHB CHS CDX KHV GWB GHN YRI MAB LWK MXL CLM ASW AJM ACB PEL PUR Samples from S. Asia

7 Three key components to the 1000G Project design

8 1. Population-scale genome sequencing Haplotypes 2x 10x

9 2. Capturing diversity by sequencing related populations

10 2. Integrating data types Low coverage GW data Exome SNP genotype Array CGH Genome sequence

11 http://browser.1000genomes.org Open and unrestricted access

12 What did we learn from the pilot?

13 TSI* CEU JPT CHB CHS* YRI LWK* *Exon pilot only 270 genomes in the pilot project

14 The 1000G pilots 40x 2-4x 50x

15 Lesson 1. The low-coverage model works for variant discovery

16 A near complete record of common variants CEU

17 The low-coverage model works almost as well for discovery as exome sequencing Number of sites found in exome data Number of sites in low coverage data count of alternate allele in low coverage (in 688 shared samples) number of sites

18 Low coverage sequencing can detect structural variants

19 Lesson 2. The low coverage model works for SNP genotyping

20 A set of accurate genotypes/haplotypes CEU

21 Marginal callingJoint calling

22 Lesson 3. The genome has a large grey area where variant calling is hard

23

24 Lesson 4. Joint calling of different variant types substantially improves the quality of calls

25

26 Where is the project now?

27 Phase1 GBR FIN TSI CEU JPT CHB CHS YRI LWK MXL CLM ASW PUR IBS

28 New data types Data typePilotPhase 1 (now) Deep genomes6- Low coverage genomes1791,094 Deep exonic697 (1,000 genes)977 (full exomes) Chip genotypes-1,542 (OMNI2.5)

29 New variants VariantPilotPhase 1 (now) Total SNP15.2M38.9M Known SNP6.8M8.5M Novel SNP8.4M30.4M Short INDELs1.3M4.7M** ftp://ftp.1000genomes.ebi.ac.uk **Estimated from chromosome 20

30 Genotype accuracy is improving HomRefHetHomAltOverall Error rate0.16%0.76%0.39%0.37%

31 Deletion SNPs (from LC, EX, OMNI) Indels We are beginning to tackle variant integration

32 How will people use the 1000G project data?

33 1. Screening for functional variants

34 Fraction of variant sites present in an individual that are NOT already represented in dbSNP DateFraction not in dbSNP February, 200098% February, 200180% April, 200810% February, 20112% May 20111%

35 Rates of individual genome variant ‘rediscovery’ c. 250 LOF / person c. 75 HGMD DM

36 USH2A Mutations cause with Usher syndrome 66 missense variants in dbSNP 2/3 detected in 1000 Genomes Pilot One HGMD ‘disease-causing’ variant homozygous in 3 YRI –Other reports indicate this is not a real disease-causing variant

37 2. Imputation into existing GWAS studies

38 IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … 11101010101011 … … 00111110000111 … … 11110000011101 … … 00101011100101 … … 1.2..1.0.0..22… … 11220110200122 … Imputed genotypes

39

40

41 3. Designing new genome-wide genotyping platforms

42 Illumina

43 How can we access the hard part of the genome? The paradigm of mapping reads to a reference fails when the genome contains sequence highly diverged from, or absent in the reference We have been developing de novo assembly algorithms using coloured de Bruijn graphs to identify complex variants and genotype them For example, we can type classical HLA alleles from WGS data without read alignment

44 ACGCGTC ACGTGTC

45 3501/5703 from lab-typing Zam Iqbal

46 Lessons learnt about related populations

47 GBR FIN TSI CEU IBS

48 Closely related populations can have substantially different rare variants

49 Spatial heterogeneity in non-genetic risk can differentially confound association studies for rare and common variants Iain Mathieson

50 Thanks to the many... Steering committee –Co-chairs: Richard Durbin and David Altshuler Samples and ELSI Committee –Co-chairs: Aravinda Chakravarti and Leena Peltonen Data Production Group –Co-chairs: Elaine Mardis and Stacey Gabriel Analysis Group –Co-Chairs: Gil McVean and Goncalo Abecasis –Subgroups in gene-targeted sequencing (Richard Gibbs) and population genetics (Molly Przeworski) Structural Variation Group –Co-chairs: Matt Hurles, Charles Lee and Evan Eichler DCC –Co-Chairs: Paul Flicek and Steve Sherry

51


Download ppt "The 1000 Genomes Project Gil McVean Department of Statistics, Oxford."

Similar presentations


Ads by Google