Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1,

Slides:



Advertisements
Similar presentations
Why this paper Causal genetic variants at loci contributing to complex phenotypes unknown Rat/mice model organisms in physiology and diseases Relevant.
Advertisements

GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Bioinformatics Methods for Diagnosis and Treatment of Human Diseases Jorge Duitama Dissertation Defense for the Degree of Doctorate in Philosophy Computer.
Ion Mandoiu Computer Science and Engineering Department
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Bioinformatics Tools for Personalized Cancer Immunotherapy
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
High Throughput Sequencing
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Next generation sequencing Xusheng Wang 4/29/2010.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Whole Exome Sequencing for Variant Discovery and Prioritisation
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Genetic Variations Lakshmi K Matukumalli. Human – Mouse Comparison.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
GBS Bioinformatics Pipeline(s) Overview
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Informative SNP Selection Based on Multiple Linear Regression
Supplemental Figure 1A. A small fraction of genes were mapped to >=20 SNPs. Supplemental Figure 1B. The density of distance from the position of an associated.
Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Jin Zhang, Jiayin Wang and Yufeng Wu
Discovery tools for human genetic variations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Estimation Using Sequencing Reads
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Review of paper submitted to NAR - confidential
BF528 - Genomic Variation and SNP Analysis
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1 1. Max Planck Institute for Molecular Genetics, Berlin, Germany 2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA Mapping slides

MHC: Key Region for Common Diseases & Transplant Medicine MHC class I MHC class III MHC class II 29,74 31,59 32,34 33,21

MHC: Variation amongst Haplotypes Variation of MHC Haplotypes against PGF reference RCCX CNV HLA-DRB CNV Haplotype sequences 7 further MHC PGF reference sequence MHC class III MHC class II Variation amongst 8 MHC Haplotypes: 37.451 Substitutions 7.093 Short Indels Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18

Experimental Approach 100 Individuals 100 Libraries 3x96-well = 288 fosmid pools 40 kb haploid molecules 5000 fosmids One pool SNP Mapping for Prioritization of MHC Informative Pools Targeted Enrichment SOLiD NGS Platform Shotgunning complete 40 kb fosmids Complete Fosmid Pool Data Analysis Pipeline Identification of 40 kb fosmid sequences Haplotype A Haplotype B Phasing molecular fosmid sequences Contiguous MHC haplotype sequence T G A

Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

Pool of 15.000 Fosmids 22 Mill. Reads 50bp Mapping real data Pool of 15.000 Fosmids 22 Mill. Reads 50bp

Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

SNP calls: Haploid fosmids vs. genomic DNA gDNA # cov ref consen F3 coord 335 C Y 177/17 62511614 3345 T 3191/56 62512095 875 G A 862/25 62513689 1795 K 722/23 62513754 707 S 528/13 62515375 2643 1391/20 62517737 643 417/23 62518998 1074 R 554/21 62522445 606 226/21 62524689 639 M 167/15 62532474 158 89/14 62533464 1032 443/26 62534973 7 7/4 62537153 775 742/26 62540402 10 10/5 62540465 698 684/29 62541769 40 40/4 62542550 94 93/9 62542574 286 283/16 62543011 194 190/22 62543067 Fosmid # cov ref consen F3 coord 595 C T 572/91 62511614 3418 3278/98 62512095 2089 G A 2048/98 62513689 2238 2194/98 62513754 1134 1107/73 62515375 3104 2922/98 62517737 1033 1014/83 62518998 1799 1753/98 62522445 1053 1049/83 62524689 54 39/22 62527964 32 27/23 62529870 1374 1355/95 62532474 973 946/97 62533464 2850 2745/98 62534973 49 48/33 62537153 1888 1845/95 62540402 37 36/20 62540465 923 901/97 62541769 8411 W 2006/78 62542258 253 253/47 62542550

SNP Calling Accuracy in the MHC Affymetrix genotype information for 1583 SNP positions as reference standard: - Homozygous identical with reference: 957 - Heterozygous: 562 - Homozygous different from reference: 64 Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage) Percentage of error in genotype calling: 3.66% False positive rate: 0.1% False negative rate: 9.25% comparing the genotype calls from 1583 affy snps with the NGS calls 957 hom , 562 het, 64 hom non ref FP: proportionof hom reference for which NGS call is either heterozygous of homozygous non reference FP number of genotypes where the NGS call shows a variant and affy dos not FN number of genotypes whre affy calls a variant and NGS does not, the reason could be that NGS does not have coverage in all affy genomic positions

Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

Fosmids Detection Fosmid Detection Algorithm Assign each read to a single 1kb long bin. Select bins with more than 5 reads Perform allele calls for each heterozygous SNP. Mark bins with heterozygous calls Cluster adjacent bins as belonging to the same fosmid if: The gap distance between them is less than 10kb and There are no bins with heterozygous SNPs between them Keep fosmids with lengths between 3kb and 60kb The silde shows (A) the HT Mapping of 14 fosmids in the MHC region. (B) shows the same fosmids but detected though SOLid sequencing. Also shown in (B) some landmark genes where HT mapping are not able to reliable predict the precence of fosmids in that region. (C) shows a detailed read statistics for all 19 detected fosmids of this individual well. The start and end position marks the beginning and end of the fosmid. The bins column show the length in kb. the reads column list the number of reads for this fosmid. the cov/base column shows the average coverage per base for each fosmid. UCSC Genome browser http://genome.ucsc.edu/ Kent et al. 2002 Genome Res. 12(6):996-1006.

Size distribution of read-contigs Fosmids Detection Size distribution of read-contigs 20 – 50 kb fosmid sized contigs

Data Analysis Pipeline Fosmid Detection Program Read Alignment against Genome Fosmid Specific Matching Algorithm Pairing Fosmid Sequences Based Phasing Consensus Calling SNP Analysis Visualization & MHC Database In House Project Specific Analysis Pipeline SOLiD Standard Pipeline

Haplotyping Locus Event Alleles Hap 1 Alleles Hap 2 1 SNV T C 2 Deletion - 3 A G 4 Insertion GC Locus Event Alleles 1 SNV C,T 2 Deletion C,- 3 A,G 4 Insertion -,GC The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - f2 f3 fm

ReFHap Problem Formulation For two alleles a1, a2 For two rows i1, i2 of M f1 - 1 f2 Score -1 s(M,1,2) = 1

ReFHap Problem Formulation For a cut I of rows of M

ReFHap Algorithm Reduce the problem to Max-Cut. Solve Max-Cut Build haplotypes according with the cut Locus 1 2 3 4 5 f1 - f2 f3 f4 4 1 -1 3 1 2 1 -1 3 h1 00110 h2 11001

ReFHap Algorithm Build G=(V,E,w) from M Sort E from largest to smallest weight Init I with a random subset of V For each e in the first k edges I’ ← GreedyInit(G,e) I’ ← GreedyImprovement(G,I’) If s(M, I) < s(M, I’) then I ← I’

ReFHap Algorithm Classical greedy algorithm 1 3 4 2 1 3 4 2

ReFHap Algorithm Edge flipping 1 2 3 4 2 1 3 4

Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS Libraries Mate Pair & Paired End Genomic DNA Paired End 16 Barcoded Pools Uniquely Mapped 47 Gb 15 Gb 1/3rd Number of Blocks 407 40 1/10th Av. Block Length 438 bp 85 kb 194 x Max. Block Length 3.7 kb 691 kb 186 x Total Length all Blocks 178 kb 3.4 Mb 19 x % of Phased SNPs 12 % 66 % 5 x

Phasing MHC: Preliminary Results Number of blocks: 8 N50 block length: 793 kb Maximum block length: 1.6 MB Total extent of all blocks: 3.8 MB Fraction of MHC phased into haplotype blocks: 95% Number of heterozygous SNPs: 8030 SNPs Fraction of SNPs phased: 86%

Acknowledgements Thank You! The Life Tech Team: Margret Hoehe Anita Suk Thomas Hübsch Roger Horton Sabrina Schulz Steffi Palczewski Britta Horstmann Gayle McEwen The Life Tech Team: Kevin McKernan Alexander Sartori Clarence Lee Dustin Holloway Jessica Spangler Heather Peckham Tristen Weaver Stephen McLaughlin Tamara Gilbert Tim Harkins Thank You!

Comparison Mapping algos COX Haplotype simulated reads

Phasing MHC