Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.

Slides:

Advertisements

Similar presentations

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Advertisements

CPU Review and Programming Models CT101 – Computing Systems.

Computer Architecture and the Fetch-Execute Cycle Parallel Processor Systems.

Computer Architecture and Data Manipulation Chapter 3.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.

University of Kansas Research Interests David Andrews Rm. 324 Nichols

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

Types Of Computer- Mainframe Computers. Alla’ Abu-Sultaneh 9B1.

Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Computer System Internal components - The processor - Main memory - I / O controllers - Buses External components (peripherals). These include: - keyboard.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.

ICC Module 3 Lesson 1 – Computer Architecture 1 / 12 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 6 – Logic parallelism.

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 2: Data Manipulation

ACCELERATING VIRUS SCANNING WITH GPU Project by: Sinthuja K. Thipakar S. Computer Engineering Department, University of Peradeniya.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.

DIRECT MEMORY ACCESS and Computer Buses

Computers’ Basic Organization

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Nanopore Sequencing Technology and Tools:

OCR GCSE Computer Science Teaching and Learning Resources

Concurrent Data Structures for Near-Memory Computing

Mosaic: A GPU Memory Manager

CS 147 – Parallel Processing

Managing GPU Concurrency in Heterogeneous Architectures

Flynn’s Classification Of Computer Architectures

Addressing: Router Design

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Computer Architecture 2

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric.

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM

Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,

Nanopore Sequencing Technology and Tools:

Nanopore Sequencing Technology and Tools for Genome Assembly:

MASK: Redesigning the GPU Memory Hierarchy

Mattan Erez The University of Texas at Austin

Today, DRAM is just a storage device

Ambit In-memory Accelerator for Bulk Bitwise Operations

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Genome Read In-Memory (GRIM) Filter:

Intro to Architecture & Organization

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping

Coe818 Advanced Computer Architecture

Lecture 24: Memory, VM, Multiproc

Chapter 2: Data Manipulation

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.

Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,

Chapter 2: Data Manipulation

13.3 Accelerating Access to Secondary Storage

ECE 463/563 Fall `18 Memory Hierarchies, Cache Memories H&P: Appendix B and Chapter 2 Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture,

Memory System Performance Chapter 3

Lesson Objectives A note about notes: Aims

Demystifying Complex Workload–DRAM Interactions: An Experimental Study

CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Chapter 2: Data Manipulation

Computer Architecture Lecture 30: In-memory Processing

Apollo: A Sequencing-Technology-Independent, Scalable,

BitMAC: An In-Memory Accelerator for Bitvector-Based

Presentation transcript:

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal Bingöl2, Jeremie S. Kim1,3, Rachata Ausavarungnirun1, Saugata Ghose1, Can Alkan2 and Onur Mutlu3,1 1 2 3

Poster SEQ-15 CPU Memory Problem: Bitap algorithm can perform string matching with fast and simple bitwise operations. Due to the von Neumann architecture, memory bus between the CPU and the memory is the bottleneck of the high-throughput parallel bitap computations. Goals: Overcoming memory bus bottleneck of approximate string matching by performing processing-in-memory to exploit the high internal bandwidth available inside new and emerging memory technologies, and Using SIMD programming with Xeon Phi to take advantage of the high amount of bit parallelism available in the bitap algorithm. CPU Memory Low capacity memory bus The operations used during bitap can be performed in parallel, but high-throughput parallel bitap computation requires a large amount of memory bandwidth that is currently unavailable to the processor.  Memory bus is the bottleneck. This is a problem for all approximate string matching problems but bitap is PIM- and SIMD-friendly since it is entirely based on bitvectors and elementary bitwise operations. PIM -> perform bitwise operations near memory by bypassing the memory bus and with the help of many DRAM rows throughput is pretty high. SIMD -> we use its local high bandwidth memory/cache. Goal is again bypassing the memory bus in the offload mode.