BitMAC: An In-Memory Accelerator for Bitvector-Based

Slides:

Advertisements

Similar presentations

Key idea: SHM identifies matching by incrementally shifting the read against the reference Mechanism: Use bit-wise XOR to find all matching bps. Then use.

Advertisements

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,

Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem. Vamsi K. Kundeti.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Processor Architecture

A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.

These slides are based on the book:

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

Jehandad Khan and Peter Athanas Virginia Tech

Memory Management.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Seth Pugsley, Jeffrey Jestes,

Nanopore Sequencing Technology and Tools:

1Carnegie Mellon University

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Concurrent Data Structures for Near-Memory Computing

Mosaic: A GPU Memory Manager

Architecture & Organization 1

Genomic Data Clustering on FPGAs for Compression

Cache Memory Presentation I

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric.

Anne Pratoomtong ECE734, Spring2002

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,

Nanopore Sequencing Technology and Tools:

Nanopore Sequencing Technology and Tools for Genome Assembly:

GateKeeper: A New Hardware Architecture

Ambit In-memory Accelerator for Bulk Bitwise Operations

Architecture & Organization 1

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Genome Read In-Memory (GRIM) Filter:

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping

Accelerator Architecture in Computational Biology and Bioinformatics, February 24th, 2018, Vienna, Austria Exploring Speed/Accuracy Trade-offs in Hardware.

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.

Kiran Subramanyam Password Cracking 1.

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.

Memory Organization.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Sahand Kashani, Stuart Byma, James Larus 2019/02/16

LO2 – Understand Computer Software

Real time signal processing

Demystifying Complex Workload–DRAM Interactions: An Experimental Study

CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Funded by the Horizon 2020 Framework Programme of the European Union

Tesseract A Scalable Processing-in-Memory Accelerator

By Nuno Dantas GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping Mohammed Alser §, Hasan Hassan †, Hongyi.

Apollo: A Sequencing-Technology-Independent, Scalable,

SneakySnake: A New Fast and Highly Accurate Pre-Alignment Filter

Presentation transcript:

BitMAC: An In-Memory Accelerator for Bitvector-Based Sequence Alignment of Both Short and Long Genomic Reads Damla Senol Cali1, Gurpreet S. Kalsi2. Lavanya Subramanian3, Can Firtina4, Anant V. Nori2, Jeremie S. Kim4,1, Zulal Bingöl5, Rachata Ausavarungnirun6,1, Mohammed Alser4, Juan Gomez-Luna4, Amirali Boroumand1, Allison Scibisz1, Sreenivas Subramoney2, Can Alkan5, Saugata Ghose1, and Onur Mutlu4,1 1 Carnegie Mellon University, USA 2 Intel, USA 3 Facebook, USA 4 ETH Zürich, Switzerland 5 Bilkent University, Turkey 6 King Mongkut’s University of Technology North Bangkok, Thailand Problem String Matching with Bitap Algorithm Limitations of Bitap Read mapping is the critical first step of the genome sequence analysis pipeline. In read mapping, each read is aligned against a reference genome to verify whether the potential location results in an alignment for the read (i.e., read alignment). Read alignment can be viewed as an approximate (i.e., fuzzy) string matching algorithm. Approximate string matching is typically performed with an expensive quadratic-time dynamic programming algorithm, which consumes over 70% of the execution time of read alignment. Bitap algorithm (i.e., Shift-Or algorithm, or Baeza-Yates-Gonnet algorithm) [1] can perform exact string matching with fast and simple bitwise operations. Wu and Manber extended the algorithm [2] in order to perform approximate string matching. Step 1 – Preprocessing: For each character in the alphabet (i.e., A,C,G,T), generate a pattern bitmask that stores information about the presence of the corresponding character in the pattern. Step 2 – Searching: Compare all characters of the text with the pattern by using the preprocessed bitmasks, a set of bitvectors that hold the status of the partial matches and the bitwise operations. [1] Baeza-Yates, Ricardo, and Gaston H. Gonnet. "A new approach to text searching." Communications of the ACM 35.10 (1992): 74-82. [2] Wu, Sun, and Udi Manber. "Fast text search allowing errors." Communications of the ACM 35.10 (1992): 83-91. The algorithm itself cannot be parallelized due to data dependencies across loop iterations, Multiple searches can be done in parallel, but are limited by the number of compute units available in a CPU, Even if many compute units are made available (e.g., a GPU), Bitap is highly constrained by the amount of available memory bandwidth. Also, standard Bitap algorithm cannot perform the traceback step of the alignment. work effectively for both short and long reads. Our Goal and Contributions 3D-Stacked DRAM and Processing-in-Memory (PIM) Goal: Designing a fast and efficient customized accelerator for approximate string matching to enable faster read alignment, and therefore faster genome sequence analysis for both short and long reads. Contributions: Modified Bitap algorithm to perform efficient genome read alignment in memory, by parallelizing the Bitap algorithm by removing data dependencies across loop iterations, adding efficient support for long reads, and developing the first efficient Bitap-based algorithm for traceback. BitMAC, the first in-memory read alignment accelerator for both short accurate and long noisy reads. Hardware implementation of our modified Bitap algorithm, and Designed specifically to take advantage of the high internal bandwidth available in the logic layer of 3D-stacked DRAM chips. Recent technological advances in memory design allow architects to tightly couple memory and logic within the same chip with very high bandwidth, low latency and energy-efficient vertical connectors.  3D-stacked memories (Hybrid Memory Cube (HMC), High-Bandwidth Memory (HBM)) A customizable logic layer enables fast, massively parallel operations on large sets of data, and provides the ability to run these operations near memory at high bandwidth and low latency.  Processing-in-memory (PIM) BitMAC: Efficient In-Memory Bitap BitMAC: implementation of our modified PIM-friendly Bitap algorithm using a dedicated hardware accelerator that we add to the logic layer of a 3D-stacked DRAM chip. achieves high performance and energy efficiency with specialized compute units and data locality, balances the compute resources and available memory bandwidth per compute unit, scales linearly with the number of parallel compute units, provides generic applicability by performing both edit distance calculation and traceback for short and long reads. BitMAC consists of two components that work together to perform read alignment: (1) BitMAC-DC: calculates the edit distance between the reference genome and the query read, and (2) BitMAC-TB: performs traceback using the list of edit locations recorded by BitMAC-DC. BitMAC-DC: Systolic array based configuration for an efficient implementation of edit distance calculation. This allows us to provide parallelism across multiple iterations of read alignment. (1) computes the edit distance between the whole reference genome and the input reads, (2) computes the edit distance between the candidate regions reported by an initial filtering step and the input reads, and also (3) finds the candidate regions of the reference genome by running the accelerator with the exact matching mode. BitMAC-TB: Makes use of a low-power general-purpose PIM core to perform the traceback step of read alignment. New, efficient algorithm for traceback, which exploits the bitvectors generated by BitMAC-DC. Divides the matching region of the text (as identified by BitMAC-DC) into multiple windows, and then performs traceback in parallel on each window. This allows us to utilize the full memory bandwidth for the traceback step. Results Future Work When compared with the alignment steps of BWA-MEM and Minimap2, BitMAC achieves: For simulated PacBio and ONT datasets: 4997× and 79× throughput improvement  over the single-threaded baseline 455× and 7× throughput improvement  over the 12-threaded baseline 15.9× and 13.8× less power consumption For simulated llumina datasets: 102,558× and 16,867× throughput improvement  over the single-threaded baseline 8162× and 1445× throughput improvement  over the 12-threaded baseline We believe it is promising to explore: coupling PIM-based filtering methods with BitMAC to reduce the amount of required read alignments, enhancing the design of BitMAC to support different scoring schemas and affine gap penalties, analyzing the effects of a larger alphabet on BitMAC, and examining the benefit of using AVX-512 support for multiple pattern searches in parallel.