Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.

Slides:



Advertisements
Similar presentations
Key idea: SHM identifies matching by incrementally shifting the read against the reference Mechanism: Use bit-wise XOR to find all matching bps. Then use.
Advertisements

The University of Adelaide, School of Computer Science
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Memory Organization.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
CACHE MEMORY Cache memory, also called CPU memory, is random access memory (RAM) that a computer microprocessor can access more quickly than it can access.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Tamanna Chhabra, M. Oguzhan Kulekci, and Jorma Tarhio Aalto University.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
1 Lecture 5a: CPU architecture 101 boris.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.
Chapter Overview General Concepts IA-32 Processor Architecture
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
18-447: Computer Architecture Lecture 30B: Multiprocessors
ESE532: System-on-a-Chip Architecture
Seth Pugsley, Jeffrey Jestes,
Nanopore Sequencing Technology and Tools:
1Carnegie Mellon University
Reducing Hit Time Small and simple caches Way prediction Trace caches
x86 Processor Architecture
Concurrent Data Structures for Near-Memory Computing
CS 704 Advanced Computer Architecture
String Matching (Chap. 32)
Approximate Algorithms (chap. 35)
Assembly Language for Intel-Based Computers, 5th Edition
Cache Memory Presentation I
Introduction to Pentium Processor
Parallel and Multiprocessor Architectures
Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,
Nanopore Sequencing Technology and Tools:
Nanopore Sequencing Technology and Tools for Genome Assembly:
Linchuan Chen, Peng Jiang and Gagan Agrawal
GateKeeper: A New Hardware Architecture
Mattan Erez The University of Texas at Austin
Ambit In-memory Accelerator for Bulk Bitwise Operations
Genome Read In-Memory (GRIM) Filter:
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Accelerator Architecture in Computational Biology and Bioinformatics, February 24th, 2018, Vienna, Austria Exploring Speed/Accuracy Trade-offs in Hardware.
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Chapter 2: Data Manipulation
Memory Organization.
Chapter 2: Data Manipulation
Department of Electrical Engineering Technion
Computer Organization & Architecture 3416
Memory System Performance Chapter 3
CSE 502: Computer Architecture
Chapter 2: Data Manipulation
By Nuno Dantas GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping Mohammed Alser §, Hasan Hassan †, Hongyi.
Computer Architecture Lecture 30: In-memory Processing
Apollo: A Sequencing-Technology-Independent, Scalable,
BitMAC: An In-Memory Accelerator for Bitvector-Based
Presentation transcript:

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal Bingöl2, Jeremie S. Kim1,3, Rachata Ausavarungnirun1, Saugata Ghose1, Can Alkan2 and Onur Mutlu3,1 1 Carnegie Mellon University, Pittsburgh, PA, USA 2 Bilkent University, Ankara, Turkey 3 ETH Zürich, Zürich, Switzerland Bitap Algorithm Problem & Our Goal Processing-in-Memory Bitap algorithm (i.e., Shift-Or algorithm, or Baeza- Yates-Gonnet algorithm) [1] can perform exact string matching with fast and simple bitwise operations. Wu and Manber extended the algorithm [2] in order to perform approximate string matching. Step 1 – Preprocessing: For each character in the alphabet (i.e., A,C,G,T), generate a pattern bitmask that stores information about the presence of the corresponding character in the pattern. Step 2 – Searching: Compare all characters of the text with the pattern by using the preprocessed bitmasks, a set of bitvectors that hold the status of the partial matches and the bitwise operations. [1] Baeza-Yates, Ricardo, and Gaston H. Gonnet. "A new approach to text searching." Communications of the ACM 35.10 (1992): 74-82. [2] Wu, Sun, and Udi Manber. "Fast text search allowing errors." Communications of the ACM 35.10 (1992): 83-91. Problem: The operations used during bitap can be performed in parallel, but high-throughput parallel bitap computation requires a large amount of memory bandwidth that is currently unavailable to the processor. Read mapping is an application of approximate string matching problem, and thus can benefit from existing techniques used to optimize general-purpose string matching. Our Goal: Overcoming memory bottleneck of bitap by performing processing-in-memory to exploit the high internal bandwidth available inside new and emerging memory technologies. Using SIMD programming to take advantage of the high amount of parallelism available in the bitap algorithm. Recent technology that tightly couples memory and logic vertically with very high bandwidth connectors. Numerous Through Silicon Vias (TSVs) connecting layers, enable higher bandwidth and lower latency and energy consumption. Customizable logic layer enables fast, massively parallel operations on large sets of data, and provides the ability to run these operations near memory to alleviate the memory bottleneck. Acceleration of Bitap with PIM Acceleration of Bitap with SIMD NOTES: Intel Xeon Phi coprocessor has vector processing unit which utilizes Advanced Vector Extensions (AVX) with an instruction set to perform effective SIMD operations. Our current architecture is Knights Corner and it enables usage of 512-bit vectors performing 8 double precision or 16 single precision operations per single cycle. The recent system runs natively on a single MIC device and the read length must be at most 128 characters. 1) Get 4 pairs of reads and reference segments, prepare bitmasks of each read and assemble them into a vector. p1 p2 p3 p4 t1 t2 t3 t4 ... Reads Reference Segments p1 : p2 : p3 : p4 : _512b < B[A], B[C], B[G], B[T] > _512b < _128b, _128b, _128b, _128b > _512b < ... , ... , ... , ... > 2) Initialize status vectors, start iterating over 4 reference segments simultaneously. While iterating, assign the respective bitvectors of the reads as active and assemble them into a vector. Perform the bitwise operations to get R[0]. t1 t2 t3 t4 G C G A >> + OR + adjustment ops.* p1 p2 p3 p4 _512b < p1 (B[G] ) , p2( B[C] ) , p3 ( B[G] ) , p4 ( B[A] ) > R[0] 3) Integrate the result R[0] with insertion, deletion and substitution status vectors. Deactivate 128b portion of R[0]...R[d] if the respective t ends. Then, perform checking operations on the portion. *Adjustment ops.: Since the system represents entries with 128 bits and only 64-bit shift operation is supported by the instruction set, carry bit operations must be performed. NOTES: 7k+2 bitwise operations are completed sequentially for the computation of a single character in a bin. However, multiple characters from different bins are computed in parallel with the help of multiple logic modules (i.e., PIM accelerators) in the logic layer. If D is the number of iterations to complete the computation of one memory row, D*(7k+2) is the total number of bitwise ops per row, where D = (max # of accelerators) / (actual # of accelerators) insertion deletion substitution R[0] R[0] & insertion & deletion & substitution Check LSB of respective portion If LSB of R[d] is ‘0’, then the edit distance between read and reference segment is d . Results - SIMD Results - PIM Future Work Assuming a row size of 8 Kilobytes (65,536 bits) and a cache line size of 64 bytes (512 bits), there are 128 cache lines in a single row. Thus, Memory Latency (ML) = row miss latency + 127*(row hit latency) ~ 914 cycles. ML is constant (i.e., independent of # of accelerators). We perform the tests with read and reference segment pairs with 100bp long each. The total number of tested mappings is 3,000,000. Bitap-PIM: Improving the logic module in the logic layer in order to decrease the number of operations performed within a DRAM cycle. Providing a backtracing extension in order to generate CIGAR strings. Comparing Bitap-PIM the with state-of-the-art read mappers for both short and long reads. Bitap-SIMD: Extending the current system to work in offload mode for exploiting 4 MIC devices simultaneously. Optimizing the expensive adjustment operations (i.e., carry bit operations) to improve the performance of Bitap-SIMD. For the human chromosome 1 as the text and a read with 64bp as the pattern, Bitap-PIM provides 3.35x end- to-end speedup over Edlib [3], on average. [3] Šošić, Martin, and Mile Šikić. “Edlib: A C/C ++ Library for Fast, Exact Sequence Alignment Using Edit Distance.” Bioinformatics 33.9 (2017): 1394–1395.