Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Key idea: SHM identifies matching by incrementally shifting the read against the reference Mechanism: Use bit-wise XOR to find all matching bps. Then use.

International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.

Thank you for your introduction.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.

Codes for Deletion and Insertion Channels with Segmented Errors Zhenming Liu Michael Mitzenmacher Harvard University, School of Engineering and Applied.

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Face Detection using the Viola-Jones Method

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

DNA Computing BY DIVYA TADESERA. Contents  Introduction  History and its origin  Relevancy of DNA computing in 1. Hamilton path problem(NP problem)

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.

Module 5 Planning for SQL Server® 2008 R2 Indexing.

Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,

Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.

Hashing is a method to store data in an array so that sorting, searching, inserting and deleting data is fast. For this every record needs unique key.

Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Yu Cai Ken Mai Onur Mutlu

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

Carnegie Mellon University, *Seagate Technology

From Reads to Results Exome-seq analysis at CCBR

Parallelization of mrFAST on GPGPU Hongyi Xin, Donghyuk Lee Milestone II.

Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

1Carnegie Mellon University

Inference and search for the propositional satisfiability problem

Genomic Data Clustering on FPGAs for Compression

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,

Nanopore Sequencing Technology and Tools:

Nanopore Sequencing Technology and Tools for Genome Assembly:

GateKeeper: A New Hardware Architecture

Fault Injection: A Method for Validating Fault-tolerant System

Pyramid Sketch: a Sketch Framework

Genome Read In-Memory (GRIM) Filter:

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping

Accelerator Architecture in Computational Biology and Bioinformatics, February 24th, 2018, Vienna, Austria Exploring Speed/Accuracy Trade-offs in Hardware.

CSC2431 February 3rd 2010 Alecia Fowler

Virtual Memory Hardware

By Nuno Dantas GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping Mohammed Alser §, Hasan Hassan †, Hongyi.

Apollo: A Sequencing-Technology-Independent, Scalable,

BitMAC: An In-Memory Accelerator for Bitvector-Based

SneakySnake: A New Fast and Highly Accurate Pre-Alignment Filter

Presentation transcript:

Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie Mellon University § University of Washington ‡ ‡ University of California Los Angeles

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 2

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 3

Read Mapping A post-processing procedure after DNA sequencing Map many short DNA fragments (reads) to a known reference genome with some minor differences allowed 4 Reference genome Reads DNA, logically DNA, physically Mapping short reads to reference genome is challenging (billions of base pair reads)

Challenges Need to find many mappings of each read  A short read may map to many locations, especially with Next Generation DNA Sequencing  How can we find all mappings efficiently? Need to tolerate small variances/errors in each read  Each individual is different: Subject’s DNA may slightly differ from the reference (Mismatches, insertions, deletions)  How can we efficiently map each read with up to e errors present? Need to map each read very fast (i.e., performance is important)  Human DNA is 3.2 billion base pairs long  Millions to billions of reads (State-of-the-art mappers take weeks to map a human’s DNA)  How can we design a much higher performance read mapper? 5

Outline Read Mapping and its Challenges Hash Table-Based Mappers  Preprocess the reference into a Hash Table  Use Hash Table to map reads Problem and Goal Key Observations Mechanisms Results 6

Hash Table-Based Mappers [Alkan+ NG’09] AAAAAAAAAAAA AAAAAAAAAAAC AAAAAAAAAAAT CCCCCCCCCCCC TTTTTTTTTTTT NULL Reference genome k-mer or 12-mer Location list—where the k-mer occurs in reference gnome Once for a reference

Outline Read Mapping and its Challenges Hash Table-Based Mappers  Preprocess the reference into a Hash Table  Use Hash Table to map reads Problem and Goal Key Observations Mechanisms Results 8

12 Hash Table-Based Mappers [Alkan+ NG’09] AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT CCCCCCCCCCCC TTTTTTTTTTTT Reference Genome Hash Table (HT) read k-mers AAAAAAAAAAAA CCCCCCCCCCCC TTTTTTTTTTTT …AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT… AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT AAAAAAAAAAAA AAAAAAAAAAAAAACGCTTCCACCTTAATCTGGTTG.. read ***..****************************************.. Invalid mapping 9 Valid mapping ✔ Verification/Local Alignment

Advantages of Hash Table Based Mappers + Guaranteed to find all mappings + Tolerate up to e errors 10

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 11

Problem and Goal Poor performance of existing read mappers: Very slow  Verification/alignment takes too long to execute  Verification requires a memory access for reference genome + many base-pair wise comparisons between the reference and the read Goal: Speed up the mapper by reducing the cost of verification 12 95%

Reducing the Cost of Verification We observe that most verification calculations are unnecessary  1 out of 1000 potential locations passes the verification process We also observe that we can get rid of unnecessary verification calculations by  Detecting and rejecting early invalid mappings  Reducing the number of potential mappings 13

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 14

Key Observations Observation 1  Adjacent k-mers in the read should also be adjacent in the reference genome  Hence, mapper can quickly reject mappings that do not satisfy this property Observation 2  Some k-mers are cheaper to verify than others because they have shorter location lists (they occur less frequently in the reference genome) Mapper needs to examine only e+1 k-mers’ locations to tolerate e errors  Hence, mapper can choose the cheapest e+1 k-mers and verify their locations 15

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 16

FastHASH Mechanisms Adjacency Filtering (AF): Rejects obviously invalid mapping locations at early stage to avoid unnecessary verifications Cheap K-mer Selection (CKS): Reduces the absolute number of potential mapping locations 17

Adjacency Filtering (AF) Goal: detect invalid mappings at early stage Key Insight: For a valid mapping, adjacent k-mers in the read are also adjacent in the reference genome Key Idea: search for adjacent locations in the k-mers’ location lists  If more than e k-mers fail—there must be more than e errors—invalid mapping 18 AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT read Reference genome Valid mapping Invalid mapping

12 Adjacency Filtering (AF) AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT CCCCCCCCCCCC TTTTTTTTTTTT Reference Genome Hash Table (HT) read k-mers AAAAAAAAAAAA CCCCCCCCCCCC TTTTTTTTTTTT …AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT… AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT AAAAAAAAAAAA ? 36? 336? *** ? ? ✗ 19

FastHASH Mechanisms Adjacency Filtering (AF): Rejects obviously invalid mapping locations at early stage to avoid unnecessary verifications Cheap K-mer Selection (CKS): Reduces the absolute number of potential mapping locations 20

Cheap K-mer Selection (CKS) Goal: Reduce the number of potential mappings Key insight:  K-mers have different cost to examine: Some k-mers are cheaper as they have fewer locations than others (occur less frequently in reference genome) Key idea:  Sort the k-mers based on their number of locations  Select the k-mers with fewest locations to verify 21

Cheap K-mer Selection e=2 (examine 3 k-mers) 22 AAGCTCAATTTC CCTCCTTAATTT TCCTCTTAAGAA GGGTATGGCTAG AAGGTTGAGAGC CTTAGGCTTACC read loc. 338 … … … … 1K loc. 376 … … … … 2K loc loc loc. 388 … … … … 1K loc. Previous work needs to verify: 3004 locations FastHASH verifies only: 8 locations Locations Number of Locations Cheapest 3 k-mers Expensive 3 k-mers

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 23

Methodology Implemented FastHASH on top of state-of-the-art mapper: mrFAST  New version mrFAST over mrFAST Tested with real read sets generated from Illumina platform  1M reads of a human (160 base pairs)  500K reads of a chimpanzee (101 base pairs)  500K reads of a orangutan (70 base pairs) Tested with simulated reads generated from reference genome  1M simulated reads of human (180 base pairs) Evaluation system  Intel Core i7 Sandy Bridge machine  16 GB of main memory 24

FastHASH Speedup 25 orangutan simulated human chimpanzee 19x With FastHASH, new mrFAST obtains up to 19x speedup over previous version, without losing valid mappings

Analysis Reduction of potential mappings with FastHASH 26 99% FastHASH filters out over 99% of the potential mappings without sacrificing any valid mappings

Other Key Results (In the paper) FastHASH finds all possible valid mappings Correctly mapped all simulated reads (with fewer than e artificially added errors) 27

Outline Read Mapping and its Challenges Hash Table-Based Mappers Problem and Goal Key Observations Mechanisms Results Conclusion 28

Conclusion Problem: Existing read mappers perform poorly in mapping billions of short reads to the reference genome, in the presence of errors Observation: Most of the verification calculations are unnecessary Key Idea: To reduce the cost of unnecessary verification  Reject invalid mappings early (Adjacency Filtering)  Reduce the number of possible mappings to examine (Cheap K-mer Selection) Key Result: FastHASH obtains up to 19x speedup over the state-of-the-art mapper without losing valid mappings 29

Acknowledgements Carnegie Mellon University (Hongyi Xin, Donghyuk Lee, Samihan Yedkar and Onur Mutlu, co-authors) Bilkent University (Can Alkan, co-author) University of Washington (Evan Eichler and Can Alkan) UCLA (Farhad Hormozdiari, co-author) NIH (National Institutes of Health) for financial support 30

Thank you! Questions? Download link to FastHASH You can find the slides on SAFARI group website: 

Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie Mellon University § University of Washington ‡ ‡ University of California Los Angeles

Mapper Comparison: Number of Valid Mappings 33 Bowtie does not support error threshold larger than 3 FastHASH is able to find many more valid mappings than Bowtie and BWA

Mapper Comparison: Execution Time 34 Bowtie does not support error threshold larger than 3 FastHASH is slower for e <= 3, but is much more comprehensive (can find many more valid mappings)