Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1 1. Max Planck Institute for Molecular Genetics, Berlin, Germany 2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA Mapping slides
MHC: Key Region for Common Diseases & Transplant Medicine MHC class I MHC class III MHC class II 29,74 31,59 32,34 33,21
MHC: Variation amongst Haplotypes Variation of MHC Haplotypes against PGF reference RCCX CNV HLA-DRB CNV Haplotype sequences 7 further MHC PGF reference sequence MHC class III MHC class II Variation amongst 8 MHC Haplotypes: 37.451 Substitutions 7.093 Short Indels Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18
Experimental Approach 100 Individuals 100 Libraries 3x96-well = 288 fosmid pools 40 kb haploid molecules 5000 fosmids One pool SNP Mapping for Prioritization of MHC Informative Pools Targeted Enrichment SOLiD NGS Platform Shotgunning complete 40 kb fosmids Complete Fosmid Pool Data Analysis Pipeline Identification of 40 kb fosmid sequences Haplotype A Haplotype B Phasing molecular fosmid sequences Contiguous MHC haplotype sequence T G A
Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
Pool of 15.000 Fosmids 22 Mill. Reads 50bp Mapping real data Pool of 15.000 Fosmids 22 Mill. Reads 50bp
Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
SNP calls: Haploid fosmids vs. genomic DNA gDNA # cov ref consen F3 coord 335 C Y 177/17 62511614 3345 T 3191/56 62512095 875 G A 862/25 62513689 1795 K 722/23 62513754 707 S 528/13 62515375 2643 1391/20 62517737 643 417/23 62518998 1074 R 554/21 62522445 606 226/21 62524689 639 M 167/15 62532474 158 89/14 62533464 1032 443/26 62534973 7 7/4 62537153 775 742/26 62540402 10 10/5 62540465 698 684/29 62541769 40 40/4 62542550 94 93/9 62542574 286 283/16 62543011 194 190/22 62543067 Fosmid # cov ref consen F3 coord 595 C T 572/91 62511614 3418 3278/98 62512095 2089 G A 2048/98 62513689 2238 2194/98 62513754 1134 1107/73 62515375 3104 2922/98 62517737 1033 1014/83 62518998 1799 1753/98 62522445 1053 1049/83 62524689 54 39/22 62527964 32 27/23 62529870 1374 1355/95 62532474 973 946/97 62533464 2850 2745/98 62534973 49 48/33 62537153 1888 1845/95 62540402 37 36/20 62540465 923 901/97 62541769 8411 W 2006/78 62542258 253 253/47 62542550
SNP Calling Accuracy in the MHC Affymetrix genotype information for 1583 SNP positions as reference standard: - Homozygous identical with reference: 957 - Heterozygous: 562 - Homozygous different from reference: 64 Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage) Percentage of error in genotype calling: 3.66% False positive rate: 0.1% False negative rate: 9.25% comparing the genotype calls from 1583 affy snps with the NGS calls 957 hom , 562 het, 64 hom non ref FP: proportionof hom reference for which NGS call is either heterozygous of homozygous non reference FP number of genotypes where the NGS call shows a variant and affy dos not FN number of genotypes whre affy calls a variant and NGS does not, the reason could be that NGS does not have coverage in all affy genomic positions
Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
Fosmids Detection Fosmid Detection Algorithm Assign each read to a single 1kb long bin. Select bins with more than 5 reads Perform allele calls for each heterozygous SNP. Mark bins with heterozygous calls Cluster adjacent bins as belonging to the same fosmid if: The gap distance between them is less than 10kb and There are no bins with heterozygous SNPs between them Keep fosmids with lengths between 3kb and 60kb The silde shows (A) the HT Mapping of 14 fosmids in the MHC region. (B) shows the same fosmids but detected though SOLid sequencing. Also shown in (B) some landmark genes where HT mapping are not able to reliable predict the precence of fosmids in that region. (C) shows a detailed read statistics for all 19 detected fosmids of this individual well. The start and end position marks the beginning and end of the fosmid. The bins column show the length in kb. the reads column list the number of reads for this fosmid. the cov/base column shows the average coverage per base for each fosmid. UCSC Genome browser http://genome.ucsc.edu/ Kent et al. 2002 Genome Res. 12(6):996-1006.
Size distribution of read-contigs Fosmids Detection Size distribution of read-contigs 20 – 50 kb fosmid sized contigs
Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
Haplotyping Locus Event Alleles Hap 1 Alleles Hap 2 1 SNV T C 2 Deletion - 3 A G 4 Insertion GC Locus Event Alleles 1 SNV C,T 2 Deletion C,- 3 A,G 4 Insertion -,GC The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping
Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
ReFHap Problem Formulation For two alleles a1, a2 For two rows i1, i2 of M f1 - 1 f2 Score -1 s(M,1,2) = 1
ReFHap Problem Formulation For a cut I of rows of M
ReFHap Algorithm Reduce the problem to Max-Cut. Solve Max-Cut Build haplotypes according with the cut Locus 1 2 3 4 5 f1 - f2 f3 f4 4 1 -1 3 1 2 1 -1 3 h1 00110 h2 11001
ReFHap Algorithm Build G=(V,E,w) from M Sort E from largest to smallest weight Init I with a random subset of V For each e in the first k edges I’ ← GreedyInit(G,e) I’ ← GreedyImprovement(G,I’) If s(M, I) < s(M, I’) then I ← I’
ReFHap Algorithm Classical greedy algorithm 1 3 4 2 1 3 4 2
ReFHap Algorithm Edge flipping 1 2 3 4 2 1 3 4
Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS Libraries Mate Pair & Paired End Genomic DNA Paired End 16 Barcoded Pools Uniquely Mapped 47 Gb 15 Gb 1/3rd Number of Blocks 407 40 1/10th Av. Block Length 438 bp 85 kb 194 x Max. Block Length 3.7 kb 691 kb 186 x Total Length all Blocks 178 kb 3.4 Mb 19 x % of Phased SNPs 12 % 66 % 5 x
Phasing MHC: Preliminary Results Number of blocks: 8 N50 block length: 793 kb Maximum block length: 1.6 MB Total extent of all blocks: 3.8 MB Fraction of MHC phased into haplotype blocks: 95% Number of heterozygous SNPs: 8030 SNPs Fraction of SNPs phased: 86%
Acknowledgements Thank You! The Life Tech Team: Margret Hoehe Anita Suk Thomas Hübsch Roger Horton Sabrina Schulz Steffi Palczewski Britta Horstmann Gayle McEwen The Life Tech Team: Kevin McKernan Alexander Sartori Clarence Lee Dustin Holloway Jessica Spangler Heather Peckham Tristen Weaver Stephen McLaughlin Tamara Gilbert Tim Harkins Thank You!
Comparison Mapping algos COX Haplotype simulated reads
Phasing MHC