Download presentation
Presentation is loading. Please wait.
Published byNathan Miller Modified over 8 years ago
1
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford
2
Questions Why do we need a comprehensive map of human genetic variation? How will data from the 1000G project be used in medical genetics? How can we start accessing the hard part of the genome? What has the 1000G project told us about the genetic structure of populations?
5
The role of the 1000G Project in medical genetics A representation of ‘normal’ variation –95% of variants at 1% frequency in populations of interest A resource for increasing the power of existing genome-wide association studies A development platform for sequencing / statistical / computational technologies A forum for establishing best practice and standards in genome sequencing
6
Samples for the 1000 Genomes Project Major population groups comprised of subpopulations of c. 100 each GBR FIN TSI IBS CEU JPT CHB CHS CDX KHV GWB GHN YRI MAB LWK MXL CLM ASW AJM ACB PEL PUR Samples from S. Asia
7
Three key components to the 1000G Project design
8
1. Population-scale genome sequencing Haplotypes 2x 10x
9
2. Capturing diversity by sequencing related populations
10
2. Integrating data types Low coverage GW data Exome SNP genotype Array CGH Genome sequence
11
http://browser.1000genomes.org Open and unrestricted access
12
What did we learn from the pilot?
13
TSI* CEU JPT CHB CHS* YRI LWK* *Exon pilot only 270 genomes in the pilot project
14
The 1000G pilots 40x 2-4x 50x
15
Lesson 1. The low-coverage model works for variant discovery
16
A near complete record of common variants CEU
17
The low-coverage model works almost as well for discovery as exome sequencing Number of sites found in exome data Number of sites in low coverage data count of alternate allele in low coverage (in 688 shared samples) number of sites
18
Low coverage sequencing can detect structural variants
19
Lesson 2. The low coverage model works for SNP genotyping
20
A set of accurate genotypes/haplotypes CEU
21
Marginal callingJoint calling
22
Lesson 3. The genome has a large grey area where variant calling is hard
24
Lesson 4. Joint calling of different variant types substantially improves the quality of calls
26
Where is the project now?
27
Phase1 GBR FIN TSI CEU JPT CHB CHS YRI LWK MXL CLM ASW PUR IBS
28
New data types Data typePilotPhase 1 (now) Deep genomes6- Low coverage genomes1791,094 Deep exonic697 (1,000 genes)977 (full exomes) Chip genotypes-1,542 (OMNI2.5)
29
New variants VariantPilotPhase 1 (now) Total SNP15.2M38.9M Known SNP6.8M8.5M Novel SNP8.4M30.4M Short INDELs1.3M4.7M** ftp://ftp.1000genomes.ebi.ac.uk **Estimated from chromosome 20
30
Genotype accuracy is improving HomRefHetHomAltOverall Error rate0.16%0.76%0.39%0.37%
31
Deletion SNPs (from LC, EX, OMNI) Indels We are beginning to tackle variant integration
32
How will people use the 1000G project data?
33
1. Screening for functional variants
34
Fraction of variant sites present in an individual that are NOT already represented in dbSNP DateFraction not in dbSNP February, 200098% February, 200180% April, 200810% February, 20112% May 20111%
35
Rates of individual genome variant ‘rediscovery’ c. 250 LOF / person c. 75 HGMD DM
36
USH2A Mutations cause with Usher syndrome 66 missense variants in dbSNP 2/3 detected in 1000 Genomes Pilot One HGMD ‘disease-causing’ variant homozygous in 3 YRI –Other reports indicate this is not a real disease-causing variant
37
2. Imputation into existing GWAS studies
38
IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … 11101010101011 … … 00111110000111 … … 11110000011101 … … 00101011100101 … … 1.2..1.0.0..22… … 11220110200122 … Imputed genotypes
41
3. Designing new genome-wide genotyping platforms
42
Illumina
43
How can we access the hard part of the genome? The paradigm of mapping reads to a reference fails when the genome contains sequence highly diverged from, or absent in the reference We have been developing de novo assembly algorithms using coloured de Bruijn graphs to identify complex variants and genotype them For example, we can type classical HLA alleles from WGS data without read alignment
44
ACGCGTC ACGTGTC
45
3501/5703 from lab-typing Zam Iqbal
46
Lessons learnt about related populations
47
GBR FIN TSI CEU IBS
48
Closely related populations can have substantially different rare variants
49
Spatial heterogeneity in non-genetic risk can differentially confound association studies for rare and common variants Iain Mathieson
50
Thanks to the many... Steering committee –Co-chairs: Richard Durbin and David Altshuler Samples and ELSI Committee –Co-chairs: Aravinda Chakravarti and Leena Peltonen Data Production Group –Co-chairs: Elaine Mardis and Stacey Gabriel Analysis Group –Co-Chairs: Gil McVean and Goncalo Abecasis –Subgroups in gene-targeted sequencing (Richard Gibbs) and population genetics (Molly Przeworski) Structural Variation Group –Co-chairs: Matt Hurles, Charles Lee and Evan Eichler DCC –Co-Chairs: Paul Flicek and Steve Sherry
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.