Apollo: A Sequencing-Technology-Independent, Scalable,

Slides:

Advertisements

Similar presentations

Marius Nicolae Computer Science and Engineering Department

Advertisements

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.

… Hidden Markov Models Markov assumption: Transition model:

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.

Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.

Hidden Markov Models for Sequence Analysis 4

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

billion-piece genome puzzle

Scalable and Coordinated Scheduling for Cloud-Scale computing

Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Hidden Markov Models BMI/CS 576

Learning, Uncertainty, and Information: Learning Parameters

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

What is a Hidden Markov Model?

Nanopore Sequencing Technology and Tools:

Quality Control & Preprocessing of Metagenomic Data

Reconstructing the Evolutionary History of Complex Human Gene Clusters

Introduction to programming

Author ：Shigeomi HARA Hiroshi DOUZONO Yoshio NOGUCHI

ASEN 5070: Statistical Orbit Determination I Fall 2014

A Fast Hybrid Short Read Fragment Assembly Algorithm

Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Human Resource Management By Dr. Debashish Sengupta

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Department of Computer Science

Speaker: Jim-an tsai advisor: professor jia-lin koh

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,

Nanopore Sequencing Technology and Tools:

Nanopore Sequencing Technology and Tools for Genome Assembly:

GateKeeper: A New Hardware Architecture

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu

Jin Zhang, Jiayin Wang and Yufeng Wu

CS 598AGB Genome Assembly Tandy Warnow.

Genome Read In-Memory (GRIM) Filter:

Research Methods in Behavior Change Programs

STEM Fair Graphs & Statistical Analysis

Hidden Markov Models Part 2: Algorithms

Objective of This Course

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping

MapView: visualization of short reads alignment on a desktop computer

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.

Soft Error Detection for Iterative Applications Using Offline Training

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Statistical Analysis Error Bars

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.

Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.

Artificial Neural Networks

Topological Signatures For Fast Mobility Analysis

(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.

Mapping rates of different transcript sets to the P

DESIGN OF EXPERIMENTS by R. C. Baker

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

BitMAC: An In-Memory Accelerator for Bitvector-Based

SneakySnake: A New Fast and Highly Accurate Pre-Alignment Filter

Presentation transcript:

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm Can Firtina1, Jeremie S. Kim1,2, Mohammed Alser1, Damla Senol Cali2, A. Ercument Cicek3, Can Alkan3, and Onur Mutlu1,2,3 1 2 3 1: High Throughout Sequencing (HTS) 2: Error Correction 3: Problem HTS: Produces large amount of sequencing data at relatively low cost compared to first-generation sequencing methods. Two types of HTS technologies: Second-generation sequencing technologies (e.g., Illumina) generate the most accurate reads (e.g., 99.9% accuracy), but the length of these reads are short (e.g., 100-300 basepairs). Third-generation sequencing technologies (e.g., PacBio’s SMRT) produce long reads (e.g., up to 2M basepairs) at the cost of high error rate (e.g., an error rate of 10%). Motivation: Long reads make it more likely to generate chromosome-size contigs but also more challenging as the error-prone reads often result in an erroneous assembly. Error-prone assemblies can be corrected in two ways: Correcting the errors of long reads before generating the assembly (i.e., error correction), which requires either: Reads from multiple sequencing technologies (costly) or High coverage long reads (costly) Correcting the errors of the assembly using long or short reads (i.e., assembly polishing) that Mostly works with only reads from a limited set of sequencing technologies Cannot use multiple read sets within a single run Cannot scale well to polish large genome Both approaches can improve accuracy of an assembly The technology and genome-size dependency prevents state-of-the-art assembly polishing algorithms from either Using all available read sets from multiple HTS technologies Polishing large genomes (e.g., a human genome) 4: Our Goal Provide a universal algorithm to improve accuracy of genome assembly that Uses read sets from all available HTS technologies within a single run Scales well to polish large genomes 5: Key Observations 6: Apollo Walkthrough Sequencing errors are not entirely random A profile hidden Markov model (pHMM) graph is a good fit to represent a sequence and its error profile Read-to-assembly alignment: Aligning reads to a contig provides a clue about the differences between a contig and an aligned read Read-to-assembly alignment can be used to train a pHMM- graph to correct the errors in the assembly Based on these observations, we propose a machine learning- based universal technology-independent assembly polishing algorithm, called Apollo 7: Experimental Setup and Data Sets 8: Applicability of the Polishing Algorithms to Large Genomes We evaluate the polished assemblies based on: Aligned Bases: The percentage of bases of an assembly that align to its reference Accuracy: The fraction of identical portions between the aligned bases of an assembly and its reference Polishing Score: Accuracy x Aligned Bases Runtime and the peak memory usage We ran all the tools on a server with 192 GB of memory by assigning 45 threads for each run Apollo is compared with Nanopolish, Racon, Quiver, and Pilon We used E.coli K-12, E.coli O157, E.coli O157:H7, Yeast S288C, Human CHM1, and Human HG002 data sets in our experiments Ground truth: Highly accurate assemblies either from the same sample or a well-known reference of the species Racon, Pilon, and Quiver cannot polish the large genome assembly using high coverage read sets due to high computational resources they require Racon is only able to polish a large genome when using low coverage read sets Apollo is the only assembly polishing algorithm that can scale well to polish large genome assemblies 9: Using Read Sets from Multiple Sequencing Technologies 10: Conclusion Two major functionalities that are not possible with prior tools: Apollo scales well with polishing large genome assemblies Apollo is the best tool that can consistently construct the most reliable Canu-generated assemblies when reads from multiple sequencing technologies are used We show there is a dramatic difference between non-machine learning based algorithms and the machine learning based ones in terms of runtime As future work, it is possible to accelerate the calculation of the Forward-Backward algorithm and the Viterbi algorithm using Tensor cores, SIMD, and GPUs. Apollo generates the most accurate Canu assemblies for a species than running other polishing tools multiple times Apollo never generates an assembly with a polishing score lower than the original assembly whereas other polishing tools may produce such assemblies Running Apollo once is significantly slower than running other polishing tools multiple times