Download presentation
Presentation is loading. Please wait.
1
1000G Pilot 3 Progress in silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L Kiran Garimella (Broad Institute) + C February 2, 2010 1
2
Acknowledgements Baylor Matthew Bainbridge Fuli Yu Donna Muzny Richard Gibbs Broad Chris Hartl Kiran Garimella Carrie Sougnez Mark DePristo WUGSC Dan Koboldt Bob Fulton WTSI Aarno Palotie Boston College Amit Indap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford Simon Gravel Carlos Bustamante Michigan Tom Blackwell 2
3
Data CEUTSICHBCHDJPTLWKYRI Number of samples 9066109107105108112 Sequencing technology SLX+454SLXSLX+454 454SLX+454 Per-sample coverage 78.20X65.20X45.40X60.25X52.79X31.29X58.12X Capture technologies: – Nimblegen solid phase – Agilent liquid phase Sequencing technologies: – SLX – 454 Data producers: – BCM – BI – WTSI – WUGSC Capture targets: – Started with ~1,000 genes / ~10,000 exons / 2.3Mb – 1.43Mb of total target length shared between 4 data centers used for this analysis Samples: – 697 total samples – 7 populations Sequence coverage: – Goal was deep per-sample coverage – Effective coverage somewhat reduced by fragment duplications 3
4
Pipelines Processing stepBCBI Read mapping SWMOSAIKMAQ (SLX) SSAHA2 (454) Duplicate filtering SWPicard MarkDuplicates (SLX) BCMMarkduplicates (454) Picard MarkDuplicates (SLX) Picard MarkDuplicates (454) Base quality recalibration SW GATK (SLX) None (454) GATK (SLX) GATK (454) SNP calling SWGigaBayes (BamBayes)UnifiedGenotyper CEU TSI CHB CHD JPT LWK YRI Union of all called sites in all 697 samples CEU TSI CHB CHD JPT LWK YRI Segregating sites in each population sample All 697 samples SNP calling SNP statistics 4
5
BC and BI call sets are converging Comparison #BC call version BC total calls BC unique calls BC & BI (intersection) BC || BI (union) BI unique calls BI total callsBI call version 12009/11/2011,580(5 5.96%) 733 (3.54%) 10,847 (54.34%) 20,695 (100%) 9,115 (44.04%) 19,962 (96.46%) v2 22009/11/2011,580 (65.75%) 1,480 (8.40%) 10,100 (62.60%) 17,613 (100%) 6,033 (34.25%) 16,133 (91.60%) v3 32010/01/2014,502 (79.35%) 2,144 (11.73%) 12,358 (76.60%) 18,277 (100%) 3,775 (20.65%) 16,133 (88.27%) v3 42010/01/2014,502 (72.91%) 1,741 (8.75%) 12,761 (64.16%) 19,890 (100%) 5,388 (27.01%) 18,149 (91.25%) v4 Comparison #CEUTSICHBCHDJPTLWKYRI 13,354 (73.87%)3,168 (65.88%)3,279 (66.23%)3,226 (68.42%)2,942 (47.79%)4,922 (70.56%)4,917 (72.08%) 23,036 (70.62%)2,893 (69.34%)2,938 (62,23%)2,783 (60.58%)2,545 (55.64%)4,486 (65.33%)4,253 (66.30%) 33,333 (74.63%)3,155 (73.15%)3,294 (66.80%)3,201 (66.69%)2,795 (58.40%)5,165 (73.18%)4,728 (71.29%) 43,489 (78.78%)3,281 (69.32)3,415 (69.74%)3,431 (72.81%)2,900 (50.86%)5,459 (78.55%)5,175 (78.59%) All called sites Called sites per population (BC/BI intersection) Intersection (% of union) Number of sites (% of union) 5
6
SNP calls (per population) CEUTSICHBCHDJPTLWKYRI samples 9066109107105108112 9066109107105108112 called SNPs 4,1023,7294,3404,2623,8836,0395,891 3,8164,2853,9723,8814,7196,3705,869 dbSNPs 2,4222,2572,0421,9241,9502,8722,897 2,3522,2001,8271,7531,7102,8252,856 % dbSNP 59.0460.5347.0545.1450.2247.5649.18 61.6451.3446.0045.1736.2444.3548.66 Ts/Tv (called SNPs) 2.732.782.823.062.853.452.92 3.142.383.153.161.833.173.15 novel SNPs 1,6801,4722,2982,3381,9333,1672,994 1,4642,0852,1452,1283,0093,5453,013 Ts/Tv (novel SNPs) 2.052.102.442.812.433.442.56 2.921.723.033.051.363.072.99 BC BI 6
7
SNP calls (all samples) BCBI Samples 697 Called SNPs 14,50218,149 dbSNPs 3,9484,041 dbSNP fraction 27.22%22.27% 5,388 SNPs 172 dbSNPs dbSNP=3.19% 1,741 SNPs 79 dbSNPs dbSNP=4.54% 12,761 SNPs 3,869 dbSNPs dbSNP=30.32% BC: 14,502 SNPs BI: 18,149 SNPs BC U BI = 19,890 7
8
Genotype call accuracy relative to HapMap3 CEUTSICHBCHDJPTLWKYRI FDR of variant genotypes in HapMap3 (%) 0.960.232.611.423.600.470.57 1.410.452.991.823.560.661.25 Correct calls (%)98.3998.9896.7698.2095.7299.0698.63 97.2298.2695.4597.3594.5598.7496.68 Accuracy of homozygote reference calls (%) 99.2099.8197.5298.6296.4299.6499.59 98.7999.6297.0798.2196.3399.4899.07 Accuracy of heterozygote calls (%) 97.5097.7297.9899.1296.8198.3796.53 94.4995.4394.1997.3792.3097.8990.81 Accuracy of homozygote non-reference calls (%) 97.31 98.4493.2795.7892.6998.2198.45 96.7798.7693.2695.1693.4397.6797.46 BC BI 8 Data quality in CHB and JPT samples seems consistently lower Statistics only include genotype calls at SNP sites in BC∩BI
9
Genotype calls All SNP sites consideredOnly SNP sites with >= 80% called genotypes # SNP sites=3,075 r=0.9979 # SNP sites=3,489 r=0.9921 Filtering: BC filters on genotype call quality BI reports a genotype for any site where at least one read covers Nominally, BI makes more calls than BC, and has, on average, higher AF 9 The Broad caller does not filter on genotype quality Good allele frequency concordance between BC and BI At genotype calls that passes BC filter, and BI also makes a call, no discordance was found
10
1KG validation executive summary Evaluated BI and BC calls against validation – 1KG chip 1 312/697 samples across 7 populations represented ~300 sites (150 novel) overlap with Pilot 3 target region Concordance with 1KG chip is very high – Where covered (> 5 reads): 302/312 (97%) of samples have >90% variant sensitivity 269/312 (86%) of samples have >90% genotype sensitivity – Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues Later sequencing has far greater concordance with chip than earlier sequencing 1. Details in Appendix 10
11
Nearly all samples in call-set overlap have high sensitivity and specificity Pilot 3 individual (312 individuals total after eliminating low-coverage samples) These 10 low-sensitivity samples have strange allele balances and are likely contaminated All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD) 11
12
Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations N Samples: 69 13 27 102 69324 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 BI/BCM 8/2008 ILMN/454 BI/BCM 1/2009 454 BCM 1/2009 454 BCM 8/2008 ILMN/454 BI/BCM 8/2008 ILMN/454 BI/BCM 10/2008 ILMN BI/SC 10/2008 ILMN BI/SC 2008/2009 ILMN/454 All Ctrs 2008/2009 ILMN/454 All Ctrs 12
13
Low-frequency / singleton validation: executive summary Low-frequency Sequenom assay 1 – Chose 105 putative novel singletons from early Pilot 3 46- CEU-sample callsets (called in at least 2/4 callers) – Validated sites in those 46 individuals 89/105 are true singletons 16/105 are false-positive singletons (hom-refs and two non- singletons) Concordance with low-frequency assay is very high – Callsets today (January 2010) In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons In BI and BC union, recovered all 89 singletons with 3 false- positives and 0 non-singletons 1. Details in Appendix 13
14
Call SetLoci Tested (after Sequenom filtering) Overlap with Test Set TP (PPV)FPTrue, but not Singleton BC ∩ BI1057171 (100%)00 BC ∪ BI 1059289 (97%)30 Whole Assay105 89 (85%)162 Callers are able to detect most singletons with very low false-positive rate Joint calls find every singleton in the assay, with exceedingly few false positives. 14
15
Conclusions / future directions Data quality has improved significantly over the life of the project Both BC and BI pipelines produce high-quality call sets – Good agreement between call sets – intersection highly concordant with experimental validation data – Estimated FP rate below 5% The current Pilot 3 release is the BC∩BI (intersection) call set We are proceeding with validations – Dual focus: accuracy and functional classes – Results will inform future releases 15
16
APPENDIX
17
Population spectrum of called SNPs
18
Population-spectrum of called SNPs 18 CEUTSICHBCHDJPTLWKYRI ALL called SNPs 4,1023,7294,3404,2623,8836,0395,891 14,502 3,8164,2853,9723,8814,7196,3705,869 18,149 BC BI Observation: BC call more SNPs on the population level, but less SNP sites overall Reason: BC tends to call the same site in more populations…
19
BC/BI SNP calls per population (more detail)
20
SNP calls (per population) CEUTSICHBCHDJPTLWKYRI samples 9066109107105108112 9066109107105108112 called SNPs 4,1023,7294,3404,2623,8836,0395,891 3,8164,2853,9723,8814,7196,3705,869 dbSNPs 2,4222,2572,0421,9241,9502,8722,897 2,3522,2001,8271,7531,7102,8252,856 % dbSNP 59.0460.5347.0545.1450.2247.5649.18 61.6451.3446.0045.1736.2444.3548.66 Ts/Tv (called SNPs) 2.732.782.823.062.853.452.92 3.142.383.153.161.833.173.15 novel SNPs 1,6801,4722,2982,3381,9333,1672,994 1,4642,0852,1452,1283,0093,5453,013 Ts/Tv (novel SNPs) 2.052.102.442.812.433.442.56 2.921.723.033.051.363.072.99 singletons 1,378 1,2641,6541,6861,2841,4301,457 1,2401,9111,5551,5002,3471,6921,489 Ts/Tv (singletons) 2.723.363.333.393.094.683.04 2.841.722.813.031.113.262.73 BC BI 20
21
Broad & BC calls: CEU Population: CEU (90 samples)BCBroad # SNPs called (Ts/Tv)4,102 (2.73)3,816 (3.14) #dbSNP (Ts/Tv)2,422 (3.40)2,352(3.28) # novel SNPs (Ts/Tv)1,680 (2.05)1,464 (2.92) # Singleton (Ts/Tv)1,378 (2.72)1,240 (2.84) 327 52(15.90%) 1.32 BC 613 122(19.90%) 0.92 3,489 2,300(65.92%) 3.47 SNP #dBSnp(%) Ts/Tv Broad
22
Broad & BC calls: CHB Population: CHB (109 samples)BCBroad # SNPs called (Ts/Tv)4,340 (2.82)3,972 (3.15) #dbSNP (Ts/Tv)2,042 (3.37)1,827 (3.30) # novel SNPs (Ts/Tv)2,298 (2.44)2,145 (3.03) # Singleton (Ts/Tv)1,654 (3.33)1,555 (2.81) 557 32(5.75%) 1.37 BC 925 247(26.70%) 1.23 3,415 1,795(52.56%) 3.74 Broad SNP #dBSnp(%) Ts/Tv
23
Broad & BC calls: CHD Population: CHD (107 samples)BCBroad # SNPs called (Ts/Tv) 4,262 (3.06) 3,881 (3.16) #dbSNP (Ts/Tv)1,924 (3.40)1,753 (3.30) # novel SNPs (Ts/Tv)2,338 (2.81)2,128 (3.05) # Singleton (Ts/Tv)1,686 (3.39)1,500 (3.03) 450 31(6.44%) 1.33 BC 831 200(24.07%) 1.68 3431 1,724(50.25%) 3.64 Broad SNP #dBSnp(%) Ts/Tv
24
Broad & BC calls: JPT Population: JPT (105 samples)BCBroad # SNPs called (Ts/Tv) 3,883 (2.85) 4,719 (1.83) #dbSNP (Ts/Tv)1,950 (3.39)1,710 (3.31) # novel SNPs (Ts/Tv)1,933 (2.43)3,009 (1.36) # Singleton (Ts/Tv)1,284 (3.09)2,347 (1.11) 983 271(27.57%) 1.54 BC 1819 31(1.70%) 0.74 2,900 1,679 (57.90%) 3.67 Broad SNP #dBSnp(%) Ts/Tv
25
Broad & BC calls: LWK Population: LWK (108 samples)BCBroad # SNPs called (Ts/Tv)6,039 (3.45)6,370 (3.17) #dbSNP (Ts/Tv)2,872 (3.46)2,825 (3.31) # novel SNPs (Ts/Tv)3,167 (3.44)3,545 (3.08) # Singleton (Ts/Tv)1,430(4.68)1,692 (3.26) 580 136(23.45%) 2.09 BC 911 89(9.77%) 1.56 5,459 2,736(50.12%) 3.67 Broad SNP #dBSnp(%) Ts/Tv
26
Broad & BC calls: TSI Population: TSI (66 samples)BCBroad # SNPs called (Ts/Tv)3,729 (2.78)4,285 (2.39) #dbSNP (Ts/Tv)2,257 (3.42)2,200 (3.40) # novel SNPs (Ts/Tv)1,472 (2.10)2,085 (1.72) # Singleton (Ts/Tv)1,264(3.36)1,911 (1.72) 448 105(23.44%) 0.71 BC 1,004 48(4.78%) 0.85 3,281 2152(65.59%) 3.54 Broad SNP #dBSnp(%) Ts/Tv
27
Broad & BC calls: YRI Population: TSI (66 samples)BCBroad # SNPs called (Ts/Tv)5,891(2.92)5,869 (3.15) #dbSNP (Ts/Tv)2897 (3.38)2,856 (3.34) # novel SNPs (Ts/Tv)2,994 (2.56)3,013 (2.99) # Singleton (Ts/Tv)1,489 (3.04)1,457 (2.73) 716 112(15.64%) 0.95 BC 694 71(1023%) 1.48 5,175 2,785(53.82%) 3.56 Broad SNP #dBSnp(%) Ts/Tv
28
BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call set
29
BC/BI genotype calls (CHB & CHD) All SNPs SNPs with >= 80% called genotypes All SNPs SNPs with >= 80% called genotypes #sites=3415 r=0.9925 #sites=3431 r=0.9941 CHD CHB #sites=3028 r=0.9993 #sites=3310 r=0.9991
30
BC/BI genotype calls (TSI & JPT) #sites=2900 r=0.9922 #sites=2370 r=0.9991 #sites=3108 r=0.9973 #sites=3281 r=0.9912 TSI JPT All SNPs SNPs with >= 80% called genotypes All SNPs SNPs with >= 80% called genotypes
31
BC/BI genotype calls (LWK & YRI) #sites=5337 r=0.9984 #sites=5459 r=0.9924 #sites=4276 r=0.9978 #sites=5175 r=0.9917 YRI LWK All SNPs SNPs with >= 80% called genotypes All SNPs SNPs with >= 80% called genotypes
32
Low frequency / singleton validation design
33
Per population PPV and sensitivity
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.