The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford

Questions Why do we need a comprehensive map of human genetic variation? How will data from the 1000G project be used in medical genetics? How can we start accessing the hard part of the genome? What has the 1000G project told us about the genetic structure of populations?

The role of the 1000G Project in medical genetics A representation of ‘normal’ variation –95% of variants at 1% frequency in populations of interest A resource for increasing the power of existing genome-wide association studies A development platform for sequencing / statistical / computational technologies A forum for establishing best practice and standards in genome sequencing

Samples for the 1000 Genomes Project Major population groups comprised of subpopulations of c. 100 each GBR FIN TSI IBS CEU JPT CHB CHS CDX KHV GWB GHN YRI MAB LWK MXL CLM ASW AJM ACB PEL PUR Samples from S. Asia

Three key components to the 1000G Project design

1. Population-scale genome sequencing Haplotypes 2x 10x

2. Capturing diversity by sequencing related populations

2. Integrating data types Low coverage GW data Exome SNP genotype Array CGH Genome sequence

http://browser.1000genomes.org Open and unrestricted access

What did we learn from the pilot?

TSI* CEU JPT CHB CHS* YRI LWK* *Exon pilot only 270 genomes in the pilot project

The 1000G pilots 40x 2-4x 50x

Lesson 1. The low-coverage model works for variant discovery

A near complete record of common variants CEU

The low-coverage model works almost as well for discovery as exome sequencing Number of sites found in exome data Number of sites in low coverage data count of alternate allele in low coverage (in 688 shared samples) number of sites

Low coverage sequencing can detect structural variants

Lesson 2. The low coverage model works for SNP genotyping

A set of accurate genotypes/haplotypes CEU

Marginal callingJoint calling

Lesson 3. The genome has a large grey area where variant calling is hard

Lesson 4. Joint calling of different variant types substantially improves the quality of calls

Where is the project now?

Phase1 GBR FIN TSI CEU JPT CHB CHS YRI LWK MXL CLM ASW PUR IBS

New data types Data typePilotPhase 1 (now) Deep genomes6- Low coverage genomes1791,094 Deep exonic697 (1,000 genes)977 (full exomes) Chip genotypes-1,542 (OMNI2.5)

New variants VariantPilotPhase 1 (now) Total SNP15.2M38.9M Known SNP6.8M8.5M Novel SNP8.4M30.4M Short INDELs1.3M4.7M** ftp://ftp.1000genomes.ebi.ac.uk **Estimated from chromosome 20

Genotype accuracy is improving HomRefHetHomAltOverall Error rate0.16%0.76%0.39%0.37%

Deletion SNPs (from LC, EX, OMNI) Indels We are beginning to tackle variant integration

How will people use the 1000G project data?

1. Screening for functional variants

Fraction of variant sites present in an individual that are NOT already represented in dbSNP DateFraction not in dbSNP February, 200098% February, 200180% April, 200810% February, 20112% May 20111%

Rates of individual genome variant ‘rediscovery’ c. 250 LOF / person c. 75 HGMD DM

USH2A Mutations cause with Usher syndrome 66 missense variants in dbSNP 2/3 detected in 1000 Genomes Pilot One HGMD ‘disease-causing’ variant homozygous in 3 YRI –Other reports indicate this is not a real disease-causing variant

2. Imputation into existing GWAS studies

IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … 11101010101011 … … 00111110000111 … … 11110000011101 … … 00101011100101 … … 1.2..1.0.0..22… … 11220110200122 … Imputed genotypes

3. Designing new genome-wide genotyping platforms

Illumina

How can we access the hard part of the genome? The paradigm of mapping reads to a reference fails when the genome contains sequence highly diverged from, or absent in the reference We have been developing de novo assembly algorithms using coloured de Bruijn graphs to identify complex variants and genotype them For example, we can type classical HLA alleles from WGS data without read alignment

ACGCGTC ACGTGTC

3501/5703 from lab-typing Zam Iqbal

Lessons learnt about related populations

GBR FIN TSI CEU IBS

Closely related populations can have substantially different rare variants

Spatial heterogeneity in non-genetic risk can differentially confound association studies for rare and common variants Iain Mathieson

Thanks to the many... Steering committee –Co-chairs: Richard Durbin and David Altshuler Samples and ELSI Committee –Co-chairs: Aravinda Chakravarti and Leena Peltonen Data Production Group –Co-chairs: Elaine Mardis and Stacey Gabriel Analysis Group –Co-Chairs: Gil McVean and Goncalo Abecasis –Subgroups in gene-targeted sequencing (Richard Gibbs) and population genetics (Molly Przeworski) Structural Variation Group –Co-chairs: Matt Hurles, Charles Lee and Evan Eichler DCC –Co-Chairs: Paul Flicek and Steve Sherry

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Similar presentations

Presentation on theme: "The 1000 Genomes Project Gil McVean Department of Statistics, Oxford."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Similar presentations

Presentation on theme: "The 1000 Genomes Project Gil McVean Department of Statistics, Oxford."— Presentation transcript:

Similar presentations

About project

Feedback