Nanopore Sequencing Technology and Tools:

Slides:



Advertisements
Similar presentations
Oxford Nanopore Technologies
Advertisements

Key idea: SHM identifies matching by incrementally shifting the read against the reference Mechanism: Use bit-wise XOR to find all matching bps. Then use.
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
RNA Assembly Using extending method. Wei Xueliang
Next-generation sequencing
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.
CS 394C March 19, 2012 Tandy Warnow.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
billion-piece genome puzzle
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Effective Parallel Multicore-optimized K-mers Counting Algorithm
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Carnegie Mellon University, *Seagate Technology
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.
Getting the Most out of Scientific Computing Resources
Virginia Commonwealth University
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Lesson: Sequence processing
Getting the Most out of Scientific Computing Resources
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Nanopore Sequencing Technology and Tools:
1Carnegie Mellon University
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Metafast High-throughput tool for metagenome comparison
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Parallel Density-based Hybrid Clustering
Very important to know the difference between the trees!
Scalable Processor Design
University of Pittsburgh
Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,
Nanopore Sequencing Technology and Tools for Genome Assembly:
GateKeeper: A New Hardware Architecture
CS 598AGB Genome Assembly Tandy Warnow.
Genome Read In-Memory (GRIM) Filter:
Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques HPCA Session 3A – Monday, 3:15 pm,
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.
Bioinformatics: Buzzword or Discipline (???)
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.
Jintao Meng, PH.D candidate
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
TRC: Trace – Reference Compression
Apollo: A Sequencing-Technology-Independent, Scalable,
BitMAC: An In-Memory Accelerator for Bitvector-Based
SneakySnake: A New Fast and Highly Accurate Pre-Alignment Filter
Presentation transcript:

Nanopore Sequencing Technology and Tools: Computational Analysis of the Current State, Bottlenecks and Future Directions Damla Senol1, Jeremie Kim1,3, Saugata Ghose1, Can Alkan2 and Onur Mutlu3,1 1 Carnegie Mellon University, Pittsburgh, PA, USA 2 Bilkent University, Ankara, Turkey 3 ETH Zürich, Zürich, Switzerland Genome Sequencing Long Read Analysis Nanopore Sequencing Long reads Sequences with thousands of bases Sequences with higher error rates Suitable for de novo assembly De novo assembly is the method of Merging the reads in order to construct the original sequence Without the aid of a reference genome Assembly quality can be improved by using longer reads, since they can cover long repetitive regions. Nanopore sequencing is an emerging DNA sequencing technology. Long read length Portable and low cost Produces data in real-time Nanopore sequencers rely solely on the electrochemical structure of the different nucleotides for identification and measure the change in the ionic current as long strands of DNA (ssDNA) pass through the nano-scale protein pores. Genome sequencing is the process of determining the order of the DNA sequence in an organism’s genome. Pipeline and Current Tools Problem & Our Goal Problem The tools used for nanopore sequence analysis are of critical importance in order to increase the accuracy of the whole pipeline to take better advantage of long reads, and increase the speed of the whole pipeline to enable real-time data analysis. Our Goal Comprehensively analyze current publicly available tools in the whole pipeline for nanopore sequence analysis, with a focus on understanding their advantages, disadvantages, and performance bottlenecks. Provide guidelines for determining the appropriate tools for each step of the pipeline. Results and Analysis   Step 1 Wall Clock Time (h:m:s) Step 1 Memory Usage (GB) Step 2 Wall Clock Time (h:m:s) Step 2 Memory Usage (GB) Step 3 Wall Clock Time (h:m:s) Step 3 Memory Usage (GB) Number of Contigs Identity (%) Coverage (%) Metrichor + Canu – 44:12:31 5.76 1 98.04 99.31 Metrichor + Minimap + Miniasm 2:15 12.30 1:19 1.96 85.00 94.85 Metrichor + Graphmap + Miniasm 6:14 56.58 1:05 1.82 2 85.24 96.95 Nanonet + Canu 17:52:42 1.89 11:32:40 5.27 97.92 98.71 Nanonet + Minimap + Miniasm 1:13 9.45 33 0.69 85.50 93.72 Nanonet + Graphmap + Miniasm 3:18 29.16 32 0.65 85.36 92.05 Nanocall + Canu 47:04:53 37.73 Nanocall + Minimap + Miniasm 1:15 12.19 20 0.47 5 80.53 96.80 Nanocall + Graphmap + Miniasm 5:14 56.78 16 0.30 3 80.52 95.43 Deepnano + Canu 23:54:34 8.38 1:15:48 3.61 106 92.63 154.07 Deepnano + Minimap + Miniasm 1:50 11.71 1:03 1.31 82.37 91.62 Deepnano + Graphmap + Miniasm 5:18 54.64 58 1.10 82.39 91.60 Observation 1: Basecalling with Recurrent Neural Networks performs better than basecalling with Hidden Markov Models in terms of accuracy, speed, and memory usage. However, it has scalability limitations due to data sharing between threads. Observation 2: Sharing the computation of a read between parallel threads provides a constant and low memory usage, but data sharing between multiple sockets degrades the parallel speedup when number of threads reaches higher values. Observation 4: Canu, an assembler with error correction, produces high-quality assemblies but is slow compared to Miniasm, an assembler without error correction. Miniasm is suitable for fast initial analysis, and the quality of its assembly can be increased with an additional polishing step. Observation 5: Nanopolish is compatible only with reads basecalled by Metrichor. Polishing the draft assembly generated with Canu takes 5h52m and increases the accuracy from 98.04% to 99.46%. Polishing the draft assembly generated with Miniasm takes 5d2h54m and increases the accuracy from 85.00% to 92.31%. Observation 3: Storing minimizers instead of all k-mers does not affect the accuracy of the whole pipeline. However, Minimap has a lower memory usage and higher speed than GraphMap, since computation is decreased by shrinking the size of the dataset that needs to be considered.