The 1000 Genomes Project Gil McVean Department of Statistics, Oxford
What is the 1000 Genomes Project? A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest A large international collaboration –UK, USA, China, Germany An exploration of the use of next-generation technologies for population-scale genome sequencing A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies
Samples for the main project UK FIN TSI ESP CEU JPT CHB CHS DAI KVT GMB GHN YRI MLW LWK Major population groups comprised of subpopulations of c. 100 each MXL ASW new CMB PRO
Population-scale genome sequencing Haplotypes 2x 10x
Pilot experiments Pilot 1 –Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and CHB+JPT Pilot 2 –High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI) Pilot 3 –Exons from 1000 genes to 20x in c samples (largely European) Complete!
The 1000G Low Coverage Pilot 185 individuals from 4 populations – CEU (63), CHB (30), JPT (30), YRI (62) PopulationTechnologyN IndividualsMapped Bases (billions) Mean Coverage / Individual CEUSLX SOLiD CHBSLX JPTSLX YRISLX SOLiD Combined1851,
Even still, at lot of data isn’t much In the Pilot 1 sample 1 tera-basepairs leaves the CEU with… –6% of genotypes with 0 reads –16% of genotypes with < 2 reads –29% of genotypes with < 3 reads
ftp.1000genomes.ebi.ac.uk Pilot release expected Nov/Dec 2009 ftp-trace.ncbi.nih.gov/1000genomes/ftp
What has the project already generated?
Over 9 millions novel SNPs Total 17.2 M SNPs called Previously ~12M SNPs “known” (dbSNP 129) –7.9M confirmed –9.2M novel CEUYRI CHB+JPT CEU YRI CHB+JPT Total SNPsNovel SNPs Le Quang
A near complete record of common SNPs Durbin, Le Quang
A set of accurate genotypes This is about where simulations suggest we should be with 2-4x on 60 samples Note this quality is much much better than if calls were made marginally Durbin, Le Quang
Many novel indels and larger structural variants
Zam Iqbal Up to 50kb Novel sequence from de novo assembly
Some interesting biology - variation in SNP density
Some more interesting biology – high Fst SNPs Ryan Hernandez, Adam Auton
Even more interesting biology – loss of function mutations Daniel MacArthur
A robust and modular pipeline for analysis of population-scale sequence data
An efficient format for storing aligned reads and a set of tools to manipulate and view the files SAM/BAM format for storing (aligned) reads Bioinformatics (2009)
An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files
Using the 1000G data now
IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … … … … … … … … … … … … Imputed genotypes
Imputation performance across SNP types from P1 (CEU) from Affy 500k Annotation# SNPsInfo measure All414, MAF < 5%102, MAF > 5%312, UCSC Genes6, Depth < 1003,153 (0.7%)0.611 SimpRpts25, SimpRpts + Depth < 1001,652 (6.5%)0.671 SegDups24, SegDups + Depth < (2.7%)0.388 Jonathan Marchini
Looking forward... Already have data generated for c. 200 more Europeans –Data generation largely complete by mid 2010 Much work still to be done on accurate inference of all types of variation from NGS data Data already proven useful for a number of projects – please use it
Thanks to the many...