Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

CPU Review and Programming Models CT101 – Computing Systems.
Computer Architecture and the Fetch-Execute Cycle Parallel Processor Systems.
Computer Architecture and Data Manipulation Chapter 3.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.
University of Kansas Research Interests David Andrews Rm. 324 Nichols
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Types Of Computer- Mainframe Computers. Alla’ Abu-Sultaneh 9B1.
Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Computer System Internal components - The processor - Main memory - I / O controllers - Buses External components (peripherals). These include: - keyboard.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
ICC Module 3 Lesson 1 – Computer Architecture 1 / 12 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 6 – Logic parallelism.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2: Data Manipulation
ACCELERATING VIRUS SCANNING WITH GPU Project by: Sinthuja K. Thipakar S. Computer Engineering Department, University of Peradeniya.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.
DIRECT MEMORY ACCESS and Computer Buses
Computers’ Basic Organization
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Nanopore Sequencing Technology and Tools:
OCR GCSE Computer Science Teaching and Learning Resources
Concurrent Data Structures for Near-Memory Computing
Mosaic: A GPU Memory Manager
CS 147 – Parallel Processing
Managing GPU Concurrency in Heterogeneous Architectures
Flynn’s Classification Of Computer Architectures
Addressing: Router Design
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Buses.
Computer Architecture 2
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric.
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,
Nanopore Sequencing Technology and Tools:
Nanopore Sequencing Technology and Tools for Genome Assembly:
MASK: Redesigning the GPU Memory Hierarchy
Mattan Erez The University of Texas at Austin
Today, DRAM is just a storage device
Ambit In-memory Accelerator for Bulk Bitwise Operations
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Genome Read In-Memory (GRIM) Filter:
Intro to Architecture & Organization
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Coe818 Advanced Computer Architecture
Lecture 24: Memory, VM, Multiproc
Chapter 2: Data Manipulation
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Chapter 2: Data Manipulation
13.3 Accelerating Access to Secondary Storage
ECE 463/563 Fall `18 Memory Hierarchies, Cache Memories H&P: Appendix B and Chapter 2 Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture,
Memory System Performance Chapter 3
Lesson Objectives A note about notes: Aims
Demystifying Complex Workload–DRAM Interactions: An Experimental Study
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Chapter 2: Data Manipulation
Computer Architecture Lecture 30: In-memory Processing
Apollo: A Sequencing-Technology-Independent, Scalable,
BitMAC: An In-Memory Accelerator for Bitvector-Based
Presentation transcript:

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal Bingöl2, Jeremie S. Kim1,3, Rachata Ausavarungnirun1, Saugata Ghose1, Can Alkan2 and Onur Mutlu3,1 1 2 3

Poster SEQ-15 CPU Memory Problem: Bitap algorithm can perform string matching with fast and simple bitwise operations. Due to the von Neumann architecture, memory bus between the CPU and the memory is the bottleneck of the high-throughput parallel bitap computations. Goals: Overcoming memory bus bottleneck of approximate string matching by performing processing-in-memory to exploit the high internal bandwidth available inside new and emerging memory technologies, and Using SIMD programming with Xeon Phi to take advantage of the high amount of bit parallelism available in the bitap algorithm. CPU Memory Low capacity memory bus The operations used during bitap can be performed in parallel, but high-throughput parallel bitap computation requires a large amount of memory bandwidth that is currently unavailable to the processor.  Memory bus is the bottleneck. This is a problem for all approximate string matching problems but bitap is PIM- and SIMD-friendly since it is entirely based on bitvectors and elementary bitwise operations. PIM -> perform bitwise operations near memory by bypassing the memory bus and with the help of many DRAM rows throughput is pretty high. SIMD -> we use its local high bandwidth memory/cache. Goal is again bypassing the memory bus in the offload mode.