Tesseract A Scalable Processing-in-Memory Accelerator

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 12 Reduce Miss Penalty and Hit Time
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Synchronization and Communication in the T3E Multiprocessor.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Architectures of distributed systems Fundamental Models
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Presented by: Nick Kirchem Feb 13, 2004
Jehandad Khan and Peter Athanas Virginia Tech
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Chilimbi, et al. (2014) Microsoft Research
Concurrent Data Structures for Near-Memory Computing
Multiprocessor Cache Coherency
Cache Memory Presentation I
Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.
CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Application Slowdown Model
CMSC 611: Advanced Computer Architecture
CPSC 531: System Modeling and Simulation
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Mingxing Zhang, Youwei Zhuo (equal contribution),
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Multiprocessor Introduction and Characteristics of Multiprocessor
Exploring Non-Uniform Processing In-Memory Architectures
Building a Database on S3
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Multiple Processor Systems
Architectures of distributed systems Fundamental Models
University of Wisconsin-Madison
Architectures of distributed systems Fundamental Models
Threads Chapter 4.
Multithreaded Programming
Latency Tolerance: what to do when it just won’t go away
15-740/ Computer Architecture Lecture 19: Main Memory
CS 6290 Many-core & Interconnect
Architectures of distributed systems
Cache coherence CEG 4131 Computer Architecture III
Architectures of distributed systems Fundamental Models
Database System Architectures
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Presented by Florian Ettinger
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Tesseract A Scalable Processing-in-Memory Accelerator Seminar on Computer Architecture (Fall 2018) Tesseract A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (ISCA’15) presented by Mauro Bringolf Junwhan Ahn, Sungpack Hong §, Onur Mutlu*, Sungjoo Yoo, Kiyoung Choi Seoul National University, § Oracle Labs, *Carnegie Mellon University

Background and Problem Big-data analytics requires processing of ever-growing, large graphs Conventional architectures not well suited for graph processing - Conventional architecture: Large on-chip caches and small off-chip memory bandwidth 2+ billion users 300+ million users 45+ million pages

Graph Processing Characteristics Frequent random memory accesses during neighbor traversals Typically small amount of computation per vertex Neighbor traversal: Pointer chasing through large regions of memory Explain figure Source: [1] O.Mutlu, “A Scalable Processing-in-Memory…” (Slides)

Summary Problem: Memory bandwidth is the bottleneck for graph processing on conventional architectures Goal: Ideally, performance should increase proportionally to size of stored graphs in a system Key Mechanism: A new Processing-in-Memory architecture which increases available memory bandwidth by 10x and a programming model to use it efficiently Results: In evaluation, Tesseract achieves 10x performance and 87% energy reduction over conventional architectures

Example - PageRank Difficult to hide access latency Line 11 independent of other iterations, but stalls on access latency Poor cache locality since entire set of vertices is traversed Source: [2] J. Ahn et al., “A Scalable Processing-in-Memory…”

Key Ideas – Processing-in-Memory Apply the idea of Processing-in-Memory (PIM) using a Hybrid Memory Cube (HMC) which allows efficient stacking of memory and logic layers Exploit internal memory bandwidth of HMC Many workloads spend a lot of time moving data around instead of computing Move computation inside memory Reduce latency and off-chip bandwidth pressure Achieves memory-capacity-proportional bandwidth

Architecture A Tesseract core is an HMC enhanced with one processor per vault Tesseract is memory-mapped to non-cacheable memory of the host processor Each processor only has access to its local vault but can communicate with other vaults via messages - Host processor is responsible for distributing graph across vaults - One HMC has 8GB of capacity

HMC Internal Memory Bandwidth Each vault is connected via a 64bit wide interface sending at 2GB/s to the crossbar network One HMC consists of 32 vaults which yields an internal available bandwidth of 512GB/s versus 320 GB/s external The 512 is result of 8 bytes per vault sent at 2GB/s yielding 8*2*32 = 512 GB/s - Not a huge difference for one HMC, but paper uses 16 yielding about 8 TB/s internal bandwidth - 16 HMC external is still 320GB/s because CPU has limited pins for connecting memory - Note that this bandwidth is aggregated over all vaults which contain different partitions of the data Therefore the question: How to make sure it is well used? Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Key Ideas – Programming Model Apply vertex-focused programming model to PIM Use a message passing mechanism to exploit data parallelism - Instead of distributing independent computations onto threads we distribute them across the HMC’s vaults which now contain processors - Each processor can access only its own DRAM. If it needs to read or write other data it sends a message to the corresponding vault. - Synchronization between vaults is handeled via a global barrier instruction across the whole HMC

Programming Interface Blocking vs. non-blocking remote function call Message queue, interrupts for batch processing of messages - There are two types of send operations, blocking and non-blocking - Naturally, reads are blocking and writes are non-blocking - Writes are guaranteed to complete before next barrier instruction

Prefetch Mechanisms Stride prefetcher for traversal of vertex list or list of edges per vertex Message-triggered prefetcher to hide access latency Tesseract performs well but uses about 1 out of 8 TB/s bandwidth Go through figure step by step Message-triggered prefetching exploits slack between message arrival time and message processing time → can hide access latency Multiple ready messages are processed at once to minimize overhead of interrupt and context switch MTP is exact since it contains address of necessary data given by software Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Example - PageRank - list_for is not a Tesseract command but an abbreviation for list_begin and list_end which configure the stride prefetcher - The independent computation of updating all neighbors we identified in Figure 1 are now put-operations sent to other vaults - Last argument to put operation is address to prefetch for message triggered prefetcher Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Evaluation Design is evaluated in simulation against conventional DDR3-based and HMC-based architectures 16 HMC’s are used yielding 128 GB main memory DDR3-based: 102.4 GB/s HMC-based: 640 GB/s Tesseract: 8 TB/s

Workloads Five standard graph algorithms including PageRank Data sets are obtained from applications in the internet context including Wikipedia Input graphs contain a couple of million nodes, 100-200 millions of edges and are 3-5 GB - Shape of input graph might have large influence on runtime therefore using real-world graphs is a good idea

Key Results - Speedup DDR3 is what you might find in server of a datacenter and has 32 cores HMC-OoO is the same cores as DDR3 but main memory consists of 16 HMCs HMC-MC also has 16 HMCs but the same number of cores as Tesseract, so 16*32=512 HMC configurations represent potential future designs Source: [1] O.Mutlu, “A Scalable Processing-in-Memory…” (Slides)

Key Results - Speedup Source: [1] O.Mutlu, “A Scalable Processing-in-Memory…” (Slides)

Key Results - Scalability 32 Cores = 1 HMC, Speedup normalized to one cube Using 4 cubes yields nearly ideal speedup Going from 4 to 16 cubes is less than ideal because of off-chip communication between cubes Still a lot better compared to conventional where increasing the main memory capacity does not help much since memory bandwidth stays the same Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Summary Problem: Memory bandwidth is the bottleneck for graph processing on conventional architectures Goal: Ideally, performance should increase proportionally to size of stored graphs in a system Key Mechanism: A new Processing-in-Memory architecture which increases available memory bandwidth by 10x and a programming model to use it efficiently Results: In evaluation, Tesseract achieves 10x performance and 87% energy reduction over conventional architectures

Strengths Combines two strong ideas such that they benefit from each other: PIM and parallel programming model Performance analysis tries to isolate the different parts of the design Message-triggered prefetching is an intuitive idea with great performance benefits Design is not overly specific to graph workloads

Weaknesses Re-implemention of algorithms presents a tradeoff Global synchronization barrier might be problematic for imbalanced workloads across vaults The importance of graph distribution seems understated in the paper to me

Effects of Graph Distribution From the GraphP paper (1.7x speedup): “In TESSERACT, data organization aspect is not treated as a primary concern and is subsequently determined by the presumed programming model” Source: [3] M.Zhang, Y.Zhuo et al, “GraphP: Reducing Communication…”

Takeaways Processing-in-Memory can be a viable solution to the memory bottleneck A paradigm shift from the current conventional architectures can give great improvements by designing radically new systems Proven ideas from software can manifest themselves as new hardware designs

Tesseract A Scalable Processing-in-Memory Accelerator Seminar on Computer Architecture (Fall 2018) Tesseract A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (ISCA’15) presented by Mauro Bringolf Junwhan Ahn, Sungpack Hong §, Onur Mutlu*, Sungjoo Yoo, Kiyoung Choi Seoul National University, § Oracle Labs, *Carnegie Mellon University

Discussion Is there a better way to handle synchronization across one HMC between vaults? Is this design specific to graph workloads? Can you think of scenarios where it performs poorly? Do you think automatic translation of algorithms is difficult? Receiver: Send a confirmation message after processing a message, Sender: Keep track of open processing requests. Increases overhead a lot though I don’t think that it is very specific to graphs and would like to see other workloads evaluated on Tesseract Translation itself is not difficult, but identifying vertex independent computations needs the programmer to indicate. Thus integration with existing software frameworks might be the best solution

References O. Mutlu, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” (Slides) J. Ahn et al, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” (ISCA’15) M. Zhang, Y. Zhuo et al, “GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition”

get Retrieve data from a remote core Blocking remote function call

set, copy Store data on a remote core Non-blocking remote function call Guaranteed to be finished before next synchronization barrier

disable_interput, enable_interput Stop processing messages and only do local work Can be used to avoid data races

barrier A synchronization barrier across all Tesseract cores Can be used to avoid data races

list_begin, list_end Configure the prefetcher before doing a list traversal