Download presentation
1
Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing
Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1 1. Max Planck Institute for Molecular Genetics, Berlin, Germany 2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA Mapping slides
2
MHC: Key Region for Common Diseases & Transplant Medicine
MHC class I MHC class III MHC class II 29,74 31,59 32,34 33,21
3
MHC: Variation amongst Haplotypes
Variation of MHC Haplotypes against PGF reference RCCX CNV HLA-DRB CNV Haplotype sequences 7 further MHC PGF reference sequence MHC class III MHC class II Variation amongst 8 MHC Haplotypes: Substitutions 7.093 Short Indels Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18
4
Experimental Approach
100 Individuals 100 Libraries 3x96-well = 288 fosmid pools 40 kb haploid molecules 5000 fosmids One pool SNP Mapping for Prioritization of MHC Informative Pools Targeted Enrichment SOLiD NGS Platform Shotgunning complete kb fosmids Complete Fosmid Pool Data Analysis Pipeline Identification of kb fosmid sequences Haplotype A Haplotype B Phasing molecular fosmid sequences Contiguous MHC haplotype sequence T G A
5
Data Analysis Pipeline
Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
6
Data Analysis Pipeline
Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
7
Pool of 15.000 Fosmids 22 Mill. Reads 50bp
Mapping real data Pool of Fosmids 22 Mill. Reads 50bp
8
Data Analysis Pipeline
Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
9
SNP calls: Haploid fosmids vs. genomic DNA
gDNA # cov ref consen F3 coord 335 C Y 177/17 3345 T 3191/56 875 G A 862/25 1795 K 722/23 707 S 528/13 2643 1391/20 643 417/23 1074 R 554/21 606 226/21 639 M 167/15 158 89/14 1032 443/26 7 7/4 775 742/26 10 10/5 698 684/29 40 40/4 94 93/9 286 283/16 194 190/22 Fosmid # cov ref consen F3 coord 595 C T 572/91 3418 3278/98 2089 G A 2048/98 2238 2194/98 1134 1107/73 3104 2922/98 1033 1014/83 1799 1753/98 1053 1049/83 54 39/22 32 27/23 1374 1355/95 973 946/97 2850 2745/98 49 48/33 1888 1845/95 37 36/20 923 901/97 8411 W 2006/78 253 253/47
10
SNP Calling Accuracy in the MHC
Affymetrix genotype information for 1583 SNP positions as reference standard: - Homozygous identical with reference: 957 - Heterozygous: 562 - Homozygous different from reference: 64 Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage) Percentage of error in genotype calling: 3.66% False positive rate: 0.1% False negative rate: 9.25% comparing the genotype calls from 1583 affy snps with the NGS calls 957 hom , 562 het, 64 hom non ref FP: proportionof hom reference for which NGS call is either heterozygous of homozygous non reference FP number of genotypes where the NGS call shows a variant and affy dos not FN number of genotypes whre affy calls a variant and NGS does not, the reason could be that NGS does not have coverage in all affy genomic positions
11
Data Analysis Pipeline
Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
12
Fosmids Detection Fosmid Detection Algorithm
Assign each read to a single 1kb long bin. Select bins with more than 5 reads Perform allele calls for each heterozygous SNP. Mark bins with heterozygous calls Cluster adjacent bins as belonging to the same fosmid if: The gap distance between them is less than 10kb and There are no bins with heterozygous SNPs between them Keep fosmids with lengths between 3kb and 60kb The silde shows (A) the HT Mapping of 14 fosmids in the MHC region. (B) shows the same fosmids but detected though SOLid sequencing. Also shown in (B) some landmark genes where HT mapping are not able to reliable predict the precence of fosmids in that region. (C) shows a detailed read statistics for all 19 detected fosmids of this individual well. The start and end position marks the beginning and end of the fosmid. The bins column show the length in kb. the reads column list the number of reads for this fosmid. the cov/base column shows the average coverage per base for each fosmid. UCSC Genome browser Kent et al Genome Res. 12(6):
13
Size distribution of read-contigs
Fosmids Detection Size distribution of read-contigs 20 – 50 kb fosmid sized contigs
14
Data Analysis Pipeline
Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline
15
Haplotyping Locus Event Alleles Hap 1 Alleles Hap 2 1 SNV T C 2 Deletion - 3 A G 4 Insertion GC Locus Event Alleles 1 SNV C,T 2 Deletion C,- 3 A,G 4 Insertion -,GC The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping
16
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
17
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
18
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
19
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm
20
ReFHap Problem Formulation
For two alleles a1, a2 For two rows i1, i2 of M f1 - 1 f2 Score -1 s(M,1,2) = 1
21
ReFHap Problem Formulation
For a cut I of rows of M
22
ReFHap Algorithm Reduce the problem to Max-Cut. Solve Max-Cut Build haplotypes according with the cut Locus 1 2 3 4 5 f1 - f2 f3 f4 4 1 -1 3 1 2 1 -1 3 h h
23
ReFHap Algorithm Build G=(V,E,w) from M
Sort E from largest to smallest weight Init I with a random subset of V For each e in the first k edges I’ ← GreedyInit(G,e) I’ ← GreedyImprovement(G,I’) If s(M, I) < s(M, I’) then I ← I’
24
ReFHap Algorithm Classical greedy algorithm 1 3 4 2 1 3 4 2
25
ReFHap Algorithm Edge flipping 1 2 3 4 2 1 3 4
26
Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS
Libraries Mate Pair & Paired End Genomic DNA Paired End 16 Barcoded Pools Uniquely Mapped 47 Gb 15 Gb 1/3rd Number of Blocks 407 40 1/10th Av. Block Length 438 bp 85 kb 194 x Max. Block Length 3.7 kb 691 kb 186 x Total Length all Blocks 178 kb 3.4 Mb 19 x % of Phased SNPs 12 % 66 % 5 x
27
Phasing MHC: Preliminary Results
Number of blocks: 8 N50 block length: 793 kb Maximum block length: 1.6 MB Total extent of all blocks: 3.8 MB Fraction of MHC phased into haplotype blocks: 95% Number of heterozygous SNPs: 8030 SNPs Fraction of SNPs phased: 86%
28
Acknowledgements Thank You! The Life Tech Team:
Margret Hoehe Anita Suk Thomas Hübsch Roger Horton Sabrina Schulz Steffi Palczewski Britta Horstmann Gayle McEwen The Life Tech Team: Kevin McKernan Alexander Sartori Clarence Lee Dustin Holloway Jessica Spangler Heather Peckham Tristen Weaver Stephen McLaughlin Tamara Gilbert Tim Harkins Thank You!
29
Comparison Mapping algos
COX Haplotype simulated reads
30
Phasing MHC
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.