University of Connecticut School of Engineering Assembler Reference Abyss 1.5.1 Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,

Slides:

Advertisements

Similar presentations

Marius Nicolae Computer Science and Engineering Department

Advertisements

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.

Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.

Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland.

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.

Henrik Lantz - BILS/SciLife/Uppsala University

Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.

NGS Bioinformatics Workshop 2

Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.

Next generation sequencing Xusheng Wang 4/29/2010.

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

De-novo Assembly Day 4.

Todd J. Treangen, Steven L. Salzberg

CUGI Pilot Sequencing/Assembly Projects Christopher Saski.

Introduction to next generation sequencing Rolf Sommer Kaas.

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.

GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Metagenomics Assembly Hubert DENISE

The Changing Face of Sequencing

The iPlant Collaborative

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

Jan Pačes Institute of Molecular Genetics AS CR

billion-piece genome puzzle

The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.

Effective Parallel Multicore-optimized K-mers Counting Algorithm

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results

De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.

Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.

Nanopore Sequencing Technology and Tools:

Quality Control & Preprocessing of Metagenomic Data

Cross_genome: Assembly Scaffolding using Cross-species Synteny

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Denovo genome assembly of Moniliophthora roreri

M. roreri de novo genome assembly using abyss/1.9.0-maxk96

Professors: Dr. Gribskov and Dr. Weil

Jin Zhang, Jiayin Wang and Yufeng Wu

2nd (Next) Generation Sequencing

DNA Sequencing By Dan Massa.

Assembly of Solexa tomato reads

Mapping rates of different transcript sets to the P

Roye Rozov Shamir group meeting 3/7/13

Apollo: A Sequencing-Technology-Independent, Scalable,

Presentation transcript:

University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., and Birol, I. (2009). Abyss: a parallel assembler for short read sequence data. Genome research, 19(6), 1117– Cabog 7.0 Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C., and Sutton, G. (2008). Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24), 2818–2824. Mira Barthelson, R., McFarlin, A. J., Rounsley, S. D., and Young, S. (2011). Plantagora: modeling whole genome sequencing and assembly of plant genomes. PLoS One, 6(12), e MaSuRCA Zimin, A. V., Marc¸ais, G., Puiu, D., Roberts, M., Salzberg, S. L., and Yorke, J. A. (2013). The masurca genome assembler. Bioinformatics, 29(21), 2669–2677. SGA Simpson, J. T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3), 549–556. SoapDenovo 2.04 Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al. (2012). Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1(1), 18. SPAdes Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., et al. (2012). Spades: a new genome assembly algorithm and its applications to singlecell sequencing. Journal of Computational Biology, 19(5), 455–477. Velvet Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5), 821–829. Hierarchical Genome Assembly Anas Al-Okaily and Ion Mӑndoiu Department of Computer Science and Engineering, University of Connecticut INTRODUCTION Assembly quality delivered by current assemblers improves only marginally or gets worse for ultra-deep genome sequencing data. Magoc et al Lonardi et al Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, the GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage. In this poster, we introduce and empirically evaluate a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. EVALUATED ASSEMBLERS THE CHALLENGE OF ULTRA-DEEP DATA ASSEMBLY HiSeq datasets (100bp) MiSeq datasets (250bp) IDENTIFIED GENE RESULTS HiSeq datasets (100bp) MiSeq datasets (250bp) BEST HGA PARAMETERS Assemblies were evaluated using multiple metrics computed using QUAST, including: Number of contigs Number of known genes completely or partially covered by the contigs N50, the contig length that covers at least 50% of the total length of the assembly NA50, computed like N50 after breaking misassembled contigs Genome fraction: percentage of genome bases aligned to at least on contig Duplication ratio: number of aligned contig bases divided by the number of reference bases aligned to at least one contig Number of global and local misassemblies Mismatches and indels per 100Kb Unaligned contig length DATASETS AND ACCURACY METRICS Empirical evaluation of this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads shows that HGA leads to a significant improvement in assembly quality for all evaluated assemblers and all datasets. In ongoing work we are evaluating the HGA methodology on ultra-deep BAC sequencing data. Availability: Version of HGA, implemented in Python, is available at Acknowledgements: This work has been partially supported by the Agriculture and Food Research Initiative Competitive Grant No from the USDA National Institute of Food and Agriculture. References Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075. Lonardi, S., Mirebrahim, H., Wanamaker, S., Alpert, M., Ciardo, G., Duma, D., Close, T.J. (2015), When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality, Bioinformatics, advance access. Magoc, T., Pabinger, S., Canzar, S., Liu, X., Su, Q., Puiu, D., Tallon, L. J., and Salzberg, S. L. (2013). GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics, 29(14), 1718–1725. CONCLUSIONS ASSEMBLY FLOWS CORRECTED N50 RESULTS The proposed hierarchical assembly flows consist of following steps: 1.Partitioning the reads into p disjoint parts, where p = 2, 4, or 8 2.Independent assembly of each part using one of the 8 evaluated assemblers, with kmer size between 21 to 101 in increments of 10 3.Merging the resulting contigs, respectively combinining them using the Velvet assembler with kmer size 31 and expected coverage = p 4.Reassembling the merged/combined contigs along with the original reads using SPAdes, again with kmer size between 21 to 101 in increments of 10 For each assembler, reported HGA results are for the assembly with the largest (uncorrected) N50 over the tested values of p and kmer sizes.