Concurrent Data Structures for Near-Memory Computing

Slides:

Advertisements

Similar presentations

COMP375 Computer Architecture and Organization Senior Review.

Advertisements

1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.

MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner.

A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Lynn Choi School of Electrical Engineering

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

ESE532: System-on-a-Chip Architecture

15-740/ Computer Architecture Lecture 3: Performance

CS5102 High Performance Computer Systems Thread-Level Parallelism

Alternative system models

ISPASS th April Santa Rosa, California

Faster Data Structures in Transactional Memory using Three Paths

Architecture & Organization 1

The Problem Finding a needle in haystack An expert (CPU)

So far we have covered … Basic visualization algorithms

Multi-Processing in High Performance Computer Architecture:

Cache Memory Presentation I

SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing

A Lock-Free Algorithm for Concurrent Bags

Hyperthreading Technology

Anders Gidenstam Håkan Sundell Philippas Tsigas

Multi-Processing in High Performance Computer Architecture:

CMSC 611: Advanced Computer Architecture

CSCI206 - Computer Organization & Programming

The University of Texas at Austin

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Architecture & Organization 1

Concurrent Data Structures Concurrent Algorithms 2017

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.

Yiannis Nikolakopoulos

Chapter 1 Introduction.

COMP60621 Fundamentals of Parallel and Distributed Systems

CS510 - Portland State University

Chapter 4 Multiprocessors

Memory System Performance Chapter 3

COMP60611 Fundamentals of Parallel and Distributed Systems

Tesseract A Scalable Processing-in-Memory Accelerator

A Closer Look at NFV Execution Models

Presentation transcript:

Concurrent Data Structures for Near-Memory Computing Zhiyu Liu (Brown) Irina Calciu (VMware Research) Maurice Herlihy (Brown) Onur Mutlu (ETH)

Concurrent Data Structures Are used everywhere: kernel, libraries, applications Issues: Difficult to design and implement Data layout and memory/cache hierarchy play crucial role in performance Cache Hot path vs cold path – caches matter Memory

The Memory Wall Data 2.2 GHz = 220K cycles during this time movement CPU < 10 ns L1 cache Tens of ns L2 cache Hundreds of ns L3 cache Data movement is expensive CPUs have been getting faster for a long time, while memory has not Although CPUs are not getting actually getting faster anymore, memory seems to be getting even slower (e.g. NVM), as vendors Have been prioritizing bandwidth and scaling over speed. Minimize data movement move computation to where the data is Memory

Near Memory Computing Also called Processing In Memory (PIM) Avoid data movement by doing computation in memory Old idea New advances in 3D integration and die-stacked memory Viable in the near future Becoming feasible 3D stacking technologies

Near Memory Computing: Architecture PIM memory PIM core Vault Vaults: memory partitions PIM cores: lightweight Fast access to its own vault Communication Between a CPU and a PIM Between PIMs Via messages sent to buffers CPU PIM core Vault CPU Crossbar Network PIM core Vault For example, one viable model CPU CPU DRAM

Data Structures + Hardware Tight integration between algorithmic design and hardware characteristics Memory becomes an active component in managing data Managing data structures in PIM Old work: pointer chasing for sequential data structures Our work: concurrent data structures This new technology promises to revolutionize the interaction between computation and data, as it enables memory to become an active component in managing the data. Therefore, it invites a fundamental rethinking of basic data structures and promotes a tighter dependency between algorithmic design and hardware characteristics. PIM is sequential; single core per vault; prior work looked at sequential data structures We are the first to look at concurrent data structures, and compare PIM data structures with state-of-the-art concurrent data structures.

Goals: PIM Concurrent Data Structures 1. How do PIM data structures compare to state-of-the-art concurrent data structures? 2. How to design efficient PIM data structures? CONTENTION LOW HIGH Pointer-chasing Cacheable

Goals: PIM Concurrent Data Structures 1. How do PIM data structures compare to state-of-the-art concurrent data structures? 2. How to design efficient PIM data structures? CONTENTION LOW HIGH Pointer-chasing Cacheable e.g., uniform distribution skiplist & linkedlist

Pointer Chasing Data Structures 1 10 16 1 4 7 10 14 16 1 4 7 8 10 12 14 16 Explain skiplist Explain pointer chasing Explain why PIM is good for pointer chasing 1 3 4 6 7 8 10 11 12 14 16

Goals: PIM Concurrent Data Structures 1. How do PIM data structures compare to state-of-the-art concurrent data structures? 2. How to design efficient PIM data structures? CONTENTION LOW HIGH Pointer-chasing Cacheable e.g., uniform distribution skiplist & linkedlist

Naïve PIM Skiplist PIM Memory Vault Low Latency CPU CPU CPU add(30) delete(80) PIM Memory Vault PIM core Low Latency 1 3 4 6 7 8 10 11 14 12 16

Concurrent Data Structures CPU CPU CPU add(30) add(5) delete(80) High Latency DRAM Concurrent (w/ coarse locks, fine-grain locks or lock-free) Sequential Skiplist CPU – done before Concurrent skiplist Lock free skiplist 1 3 4 6 7 8 10 11 14 12 16

Flat Combining High Latency DRAM Combiner lock CPU CPU CPU add(30) delete(80) High Latency Sequential access DRAM 1 3 4 6 7 8 10 11 14 12 16

Intel Xeon E7-4850v3 28 hardware threads, 2.2 GHz Skiplist Throughput Lock-free Intel Xeon E7-4850v3 28 hardware threads, 2.2 GHz Ops/s FC Threads

PIM Performance LCPU LLLC LATOMIC LPIM LMSG N p Size of the skiplist Number of processes LCPU Latency of a memory access from the CPU LLLC Latency of a LLC access LATOMIC Latency of an atomic instruction (by the CPU) LPIM Latency of a memory access from the PIM core LMSG Latency of a message from the CPU to the PIM core Performance Model To do so we have to develop a simple model to reason about the performance of these algorithms

PIM Performance LCPU = r1 LPIM = r2 LLLC LMSG = LCPU r1 = r2 = 3

B = average number of nodes accessed during a skiplist operation PIM Performance Algorithm Throughput Lock-free p / ( B * LCPU ) Flat Combining (FC) 1 / ( B * LCPU ) PIM 1 / ( B * LPIM + LMSG ) B = average number of nodes accessed during a skiplist operation

Skiplist Throughput Lock-free Ops/s PIM (expected) FC Threads

Goals: PIM Concurrent Data Structures 1. How do PIM data structures compare to state-of-the-art concurrent data structures? 2. How to design efficient PIM data structures? CONTENTION LOW HIGH Pointer-chasing Cacheable e.g., uniform distribution skiplist & linkedlist

New PIM algorithm: Exploit Partitioning PIM memory PIM core Vault CPU PIM core Vault CPU Crossbar Network We need parallelism, exploit the structure of the hardware PIM core Vault CPU CPU DRAM

PIM Skiplist w/ Partitioning CPU CPU CPU 1 10 20 CPU cache PIM Memory PIM core PIM core PIM core Vault 1 Vault 2 Vault 3 1 10 20 1 4 7 10 20 1 4 7 8 10 12 20 1 3 4 6 7 8 10 11 12 20 26

B = average number of nodes accessed during a skiplist operation PIM Performance Algorithm Throughput Lock-free p / ( B * LCPU ) Flat Combining (FC) 1 / ( B * LCPU ) PIM 1 / ( B * LPIM + LMSG ) FC + k partitions k / ( B * LCPU ) PIM + k partitions k / ( B * LPIM + LMSG ) B = average number of nodes accessed during a skiplist operation

Skiplist Throughput Lock-free FC w/ 16 partitions Ops/s Threads

Skiplist Throughput PIM w/ 16 partitions (expected) Ops/s Lock-free FC w/ 16 partitions FC w/ 8 partitions FC w/ 4 partitions FC Threads

Goals: PIM Concurrent Data Structures 1. How do PIM data structures compare to state-of-the-art concurrent data structures? 2. How to design efficient PIM data structures? CONTENTION LOW HIGH Pointer-chasing Cacheable e.g., FIFO queue

FIFO Queue HEAD TAIL enqueue dequeue Explain high contention + why they don’t seem to be a good fit for PIM – cache locality BUT contention hurts concurrent data structures

PIM FIFO Queue sends back the node 3 CPU CPU Deq() Deq() PIM core Tail PIM Memory Vault retrieves request retrieves request 1 PIM core 1 Tail dequeues a node 2

Pipelining Can overlap the execution of the next request Deq() 1 2 3 Timeline

Parallelize Enqs and Deqs CPU CPU CPU enqs deqs PIM core PIM core Vault 2 Head Tail Vault 1

Conclusion PIM is becoming feasible in the near future We investigate Concurrent Data Structures (CDS) for PIM Results: Naïve PIM data structures are less efficient than CDS New PIM algorithms can leverage PIM features They outperform efficient CDS They are simpler to design and implement The memory wall prompts moving computation to data PIM is becoming feasible in the near future which makes Memory an active component in managing data This results in a tight integration between algorithmic design and hardware characteristics We investigate Concurrent Data Structures for PIM In particular we look at concurrent data structures Naïve PIM data structures are less efficient than efficient concurrent data structures Despite the lower latency accesses from the PIM core

Thank you! icalciu@vmware.com https://research.vmware.com/

Concurrent Data Structures for Near-Memory Computing Zhiyu Liu (Brown) Irina Calciu (VMware Research) Maurice Herlihy (Brown) Onur Mutlu (ETH)

The memory wall: Speed gap between memory and CPU, the “memory wall” Moving computation closer to memory to solve this problem (This is an idea that has been extensively studied for a long time, but it is recently becoming a possibility using 3D stacking) Near Memory Computing: Architecture Revolutionize interaction between computation and data; Rethink basic data structures and tight integration between algorithmic design and hardware characteristics PIM is sequential; single core per vault; prior work looked at sequential data structures In this work: concurrent data structures What are concurrent data structures Concurrent data structures are used everywhere but difficult to design Concurrent data structures are hard We show that optimized concurrent data structures are better than PIM sequential So we need parallelism, we develop new algorithms to leverage the characteristics of the hardware Performance Model To do so we have to develop a simple model to reason about the performance of these algorithms