Tesseract A Scalable Processing-in-Memory Accelerator

Tesseract A Scalable Processing-in-Memory Accelerator
Seminar on Computer Architecture (Fall 2018) Tesseract A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (ISCA’15) presented by Mauro Bringolf Junwhan Ahn, Sungpack Hong §, Onur Mutlu*, Sungjoo Yoo, Kiyoung Choi Seoul National University, § Oracle Labs, *Carnegie Mellon University

Background and Problem
Big-data analytics requires processing of ever-growing, large graphs Conventional architectures not well suited for graph processing - Conventional architecture: Large on-chip caches and small off-chip memory bandwidth 2+ billion users 300+ million users 45+ million pages

Graph Processing Characteristics
Frequent random memory accesses during neighbor traversals Typically small amount of computation per vertex Neighbor traversal: Pointer chasing through large regions of memory Explain figure Source: [1] O.Mutlu, “A Scalable Processing-in-Memory…” (Slides)

Summary Problem: Memory bandwidth is the bottleneck for graph processing on conventional architectures Goal: Ideally, performance should increase proportionally to size of stored graphs in a system Key Mechanism: A new Processing-in-Memory architecture which increases available memory bandwidth by 10x and a programming model to use it efficiently Results: In evaluation, Tesseract achieves 10x performance and 87% energy reduction over conventional architectures

Example - PageRank Difficult to hide access latency
Line 11 independent of other iterations, but stalls on access latency Poor cache locality since entire set of vertices is traversed Source: [2] J. Ahn et al., “A Scalable Processing-in-Memory…”

Key Ideas – Processing-in-Memory
Apply the idea of Processing-in-Memory (PIM) using a Hybrid Memory Cube (HMC) which allows efficient stacking of memory and logic layers Exploit internal memory bandwidth of HMC Many workloads spend a lot of time moving data around instead of computing Move computation inside memory Reduce latency and off-chip bandwidth pressure Achieves memory-capacity-proportional bandwidth

Architecture A Tesseract core is an HMC enhanced with one processor per vault Tesseract is memory-mapped to non-cacheable memory of the host processor Each processor only has access to its local vault but can communicate with other vaults via messages - Host processor is responsible for distributing graph across vaults - One HMC has 8GB of capacity

HMC Internal Memory Bandwidth
Each vault is connected via a 64bit wide interface sending at 2GB/s to the crossbar network One HMC consists of 32 vaults which yields an internal available bandwidth of 512GB/s versus 320 GB/s external The 512 is result of 8 bytes per vault sent at 2GB/s yielding 8*2*32 = 512 GB/s - Not a huge difference for one HMC, but paper uses 16 yielding about 8 TB/s internal bandwidth - 16 HMC external is still 320GB/s because CPU has limited pins for connecting memory - Note that this bandwidth is aggregated over all vaults which contain different partitions of the data Therefore the question: How to make sure it is well used? Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Key Ideas – Programming Model
Apply vertex-focused programming model to PIM Use a message passing mechanism to exploit data parallelism - Instead of distributing independent computations onto threads we distribute them across the HMC’s vaults which now contain processors - Each processor can access only its own DRAM. If it needs to read or write other data it sends a message to the corresponding vault. - Synchronization between vaults is handeled via a global barrier instruction across the whole HMC

Programming Interface
Blocking vs. non-blocking remote function call Message queue, interrupts for batch processing of messages - There are two types of send operations, blocking and non-blocking - Naturally, reads are blocking and writes are non-blocking - Writes are guaranteed to complete before next barrier instruction

Prefetch Mechanisms Stride prefetcher for traversal of vertex list or list of edges per vertex Message-triggered prefetcher to hide access latency Tesseract performs well but uses about 1 out of 8 TB/s bandwidth Go through figure step by step Message-triggered prefetching exploits slack between message arrival time and message processing time → can hide access latency Multiple ready messages are processed at once to minimize overhead of interrupt and context switch MTP is exact since it contains address of necessary data given by software Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Example - PageRank - list_for is not a Tesseract command but an abbreviation for list_begin and list_end which configure the stride prefetcher - The independent computation of updating all neighbors we identified in Figure 1 are now put-operations sent to other vaults - Last argument to put operation is address to prefetch for message triggered prefetcher Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Evaluation Design is evaluated in simulation against conventional DDR3-based and HMC-based architectures 16 HMC’s are used yielding 128 GB main memory DDR3-based: GB/s HMC-based: 640 GB/s Tesseract: 8 TB/s

Workloads Five standard graph algorithms including PageRank
Data sets are obtained from applications in the internet context including Wikipedia Input graphs contain a couple of million nodes, millions of edges and are 3-5 GB - Shape of input graph might have large influence on runtime therefore using real-world graphs is a good idea

Key Results - Speedup DDR3 is what you might find in server of a datacenter and has 32 cores HMC-OoO is the same cores as DDR3 but main memory consists of 16 HMCs HMC-MC also has 16 HMCs but the same number of cores as Tesseract, so 16*32=512 HMC configurations represent potential future designs Source: [1] O.Mutlu, “A Scalable Processing-in-Memory…” (Slides)

Key Results - Speedup Source: [1] O.Mutlu, “A Scalable Processing-in-Memory…” (Slides)

Key Results - Scalability
32 Cores = 1 HMC, Speedup normalized to one cube Using 4 cubes yields nearly ideal speedup Going from 4 to 16 cubes is less than ideal because of off-chip communication between cubes Still a lot better compared to conventional where increasing the main memory capacity does not help much since memory bandwidth stays the same Source: [2] Junwhan Ahn et al., “A Scalable Processing-in-Memory…”

Summary Problem: Memory bandwidth is the bottleneck for graph processing on conventional architectures Goal: Ideally, performance should increase proportionally to size of stored graphs in a system Key Mechanism: A new Processing-in-Memory architecture which increases available memory bandwidth by 10x and a programming model to use it efficiently Results: In evaluation, Tesseract achieves 10x performance and 87% energy reduction over conventional architectures

Strengths Combines two strong ideas such that they benefit from each other: PIM and parallel programming model Performance analysis tries to isolate the different parts of the design Message-triggered prefetching is an intuitive idea with great performance benefits Design is not overly specific to graph workloads

Weaknesses Re-implemention of algorithms presents a tradeoff
Global synchronization barrier might be problematic for imbalanced workloads across vaults The importance of graph distribution seems understated in the paper to me

Effects of Graph Distribution
From the GraphP paper (1.7x speedup): “In TESSERACT, data organization aspect is not treated as a primary concern and is subsequently determined by the presumed programming model” Source: [3] M.Zhang, Y.Zhuo et al, “GraphP: Reducing Communication…”

Takeaways Processing-in-Memory can be a viable solution to the memory bottleneck A paradigm shift from the current conventional architectures can give great improvements by designing radically new systems Proven ideas from software can manifest themselves as new hardware designs

Tesseract A Scalable Processing-in-Memory Accelerator
Seminar on Computer Architecture (Fall 2018) Tesseract A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (ISCA’15) presented by Mauro Bringolf Junwhan Ahn, Sungpack Hong §, Onur Mutlu*, Sungjoo Yoo, Kiyoung Choi Seoul National University, § Oracle Labs, *Carnegie Mellon University

Discussion Is there a better way to handle synchronization across one HMC between vaults? Is this design specific to graph workloads? Can you think of scenarios where it performs poorly? Do you think automatic translation of algorithms is difficult? Receiver: Send a confirmation message after processing a message, Sender: Keep track of open processing requests. Increases overhead a lot though I don’t think that it is very specific to graphs and would like to see other workloads evaluated on Tesseract Translation itself is not difficult, but identifying vertex independent computations needs the programmer to indicate. Thus integration with existing software frameworks might be the best solution

References O. Mutlu, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” (Slides) J. Ahn et al, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” (ISCA’15) M. Zhang, Y. Zhuo et al, “GraphP: Reducing Communication for PIM-based Graph Processing with Efﬁcient Data Partition”

get Retrieve data from a remote core Blocking remote function call

set, copy Store data on a remote core
Non-blocking remote function call Guaranteed to be finished before next synchronization barrier

disable_interput, enable_interput
Stop processing messages and only do local work Can be used to avoid data races

barrier A synchronization barrier across all Tesseract cores
Can be used to avoid data races

list_begin, list_end Configure the prefetcher before doing a list traversal

Tesseract A Scalable Processing-in-Memory Accelerator

Similar presentations

Presentation on theme: "Tesseract A Scalable Processing-in-Memory Accelerator"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tesseract A Scalable Processing-in-Memory Accelerator

Similar presentations

Presentation on theme: "Tesseract A Scalable Processing-in-Memory Accelerator"— Presentation transcript:

Similar presentations

About project

Feedback