The 1000 Genomes Project Gil McVean Department of Statistics, Oxford
Questions Why do we need a comprehensive map of human genetic variation? How will data from the 1000G project be used in medical genetics? How can we start accessing the hard part of the genome? What has the 1000G project told us about the genetic structure of populations?
The role of the 1000G Project in medical genetics A representation of ‘normal’ variation –95% of variants at 1% frequency in populations of interest A resource for increasing the power of existing genome-wide association studies A development platform for sequencing / statistical / computational technologies A forum for establishing best practice and standards in genome sequencing
Samples for the 1000 Genomes Project Major population groups comprised of subpopulations of c. 100 each GBR FIN TSI IBS CEU JPT CHB CHS CDX KHV GWB GHN YRI MAB LWK MXL CLM ASW AJM ACB PEL PUR Samples from S. Asia
Three key components to the 1000G Project design
1. Population-scale genome sequencing Haplotypes 2x 10x
2. Capturing diversity by sequencing related populations
2. Integrating data types Low coverage GW data Exome SNP genotype Array CGH Genome sequence
Open and unrestricted access
What did we learn from the pilot?
TSI* CEU JPT CHB CHS* YRI LWK* *Exon pilot only 270 genomes in the pilot project
The 1000G pilots 40x 2-4x 50x
Lesson 1. The low-coverage model works for variant discovery
A near complete record of common variants CEU
The low-coverage model works almost as well for discovery as exome sequencing Number of sites found in exome data Number of sites in low coverage data count of alternate allele in low coverage (in 688 shared samples) number of sites
Low coverage sequencing can detect structural variants
Lesson 2. The low coverage model works for SNP genotyping
A set of accurate genotypes/haplotypes CEU
Marginal callingJoint calling
Lesson 3. The genome has a large grey area where variant calling is hard
Lesson 4. Joint calling of different variant types substantially improves the quality of calls
Where is the project now?
Phase1 GBR FIN TSI CEU JPT CHB CHS YRI LWK MXL CLM ASW PUR IBS
New data types Data typePilotPhase 1 (now) Deep genomes6- Low coverage genomes1791,094 Deep exonic697 (1,000 genes)977 (full exomes) Chip genotypes-1,542 (OMNI2.5)
New variants VariantPilotPhase 1 (now) Total SNP15.2M38.9M Known SNP6.8M8.5M Novel SNP8.4M30.4M Short INDELs1.3M4.7M** ftp://ftp.1000genomes.ebi.ac.uk **Estimated from chromosome 20
Genotype accuracy is improving HomRefHetHomAltOverall Error rate0.16%0.76%0.39%0.37%
Deletion SNPs (from LC, EX, OMNI) Indels We are beginning to tackle variant integration
How will people use the 1000G project data?
1. Screening for functional variants
Fraction of variant sites present in an individual that are NOT already represented in dbSNP DateFraction not in dbSNP February, % February, % April, % February, 20112% May 20111%
Rates of individual genome variant ‘rediscovery’ c. 250 LOF / person c. 75 HGMD DM
USH2A Mutations cause with Usher syndrome 66 missense variants in dbSNP 2/3 detected in 1000 Genomes Pilot One HGMD ‘disease-causing’ variant homozygous in 3 YRI –Other reports indicate this is not a real disease-causing variant
2. Imputation into existing GWAS studies
IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … … … … … … … … … … … … Imputed genotypes
3. Designing new genome-wide genotyping platforms
Illumina
How can we access the hard part of the genome? The paradigm of mapping reads to a reference fails when the genome contains sequence highly diverged from, or absent in the reference We have been developing de novo assembly algorithms using coloured de Bruijn graphs to identify complex variants and genotype them For example, we can type classical HLA alleles from WGS data without read alignment
ACGCGTC ACGTGTC
3501/5703 from lab-typing Zam Iqbal
Lessons learnt about related populations
GBR FIN TSI CEU IBS
Closely related populations can have substantially different rare variants
Spatial heterogeneity in non-genetic risk can differentially confound association studies for rare and common variants Iain Mathieson
Thanks to the many... Steering committee –Co-chairs: Richard Durbin and David Altshuler Samples and ELSI Committee –Co-chairs: Aravinda Chakravarti and Leena Peltonen Data Production Group –Co-chairs: Elaine Mardis and Stacey Gabriel Analysis Group –Co-Chairs: Gil McVean and Goncalo Abecasis –Subgroups in gene-targeted sequencing (Richard Gibbs) and population genetics (Molly Przeworski) Structural Variation Group –Co-chairs: Matt Hurles, Charles Lee and Evan Eichler DCC –Co-Chairs: Paul Flicek and Steve Sherry