Heterogeneous Memory Subsystem for Natural Graph Analytics

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Analysis and Modeling of Social Networks Foudalis Ilias.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Srihari Makineni & Ravi Iyer Communications Technology Lab
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
Reliable Multicast Routing for Software-Defined Networks.
IMP: Indirect Memory Prefetcher
By Islam Atta Supervised by Dr. Ihab Talkhan
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
A Hierarchical Edge Cloud Architecture for Mobile Computing IEEE INFOCOM 2016 Liang Tong, Yong Li and Wei Gao University of Tennessee – Knoxville 1.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Architecture and Algorithms for an IEEE 802
Reza Yazdani Albert Segura José-María Arnau Antonio González
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Data Center Network Architectures
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ASR: Adaptive Selective Replication for CMP Caches
A Study of Group-Tree Matching in Large Scale Group Communications
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Lecture 13: Large Cache Design I
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Short Circuiting Memory Traffic in Handheld Platforms
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Rahul Boyapati. , Jiayi Huang
Generative Model To Construct Blog and Post Networks In Blogosphere
Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Mingxing Zhang, Youwei Zhuo (equal contribution),
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Ann Gordon-Ross and Frank Vahid*
The likelihood of linking to a popular website is higher
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Lecture 24: Memory, VM, Multiproc
Yiannis Nikolakopoulos
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Many-Core Graph Workload Analysis
Lecture 23: Virtual Memory, Multiprocessors
Database System Architectures
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Tesseract A Scalable Processing-in-Memory Accelerator
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Heterogeneous Memory Subsystem for Natural Graph Analytics Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, Valeria Bertacco University of Michigan . IISWC 2018 October 2, 2018, Raleigh, NC

Graph Applications Challenges and Solutions Memory subsystem is a huge bottleneck to performance The few hardware solutions available are inflexible As you know, graph applications are being applied in a wide-range of areas. In search engines, graph applications like PageRank is being used to provide high quality search results In social networks, they are being used to analyze interaction among users and map for instance an organized crime group In medical centers, they are being used to understand functional brain connectivity aiding in diagnosing brain-related diseases like tumor On the other side, the amount of data generated is increasing as well. For instance, in case of social networks, there is larger number of online user presence each day. And a high-resolution MRI device is being able to generate a larger functional brain connectivity graph. This increasing data size is creating challenges to the existing hardware solutions . One of the challenge is that the memory subsystem is becoming a huge bottleneck. And the other challenge is that a few hardware solutions available today are inflexible. To address the first challenge we propose to designed a specialized memory subsystem architecture extract locality that exists in common natural graphs. And to address the second challenge we propose to keep the genera-purpose cores intact and focus the optimization effort in the memory subsystem. This will make it possible for many graph frameworks to benefit from our solution. Solution: Solution: Extract existing locality in common natural graphs Keep general-purpose cores and applications unmodified

Background: Graph Algorithms PageRank execution flow Key operation: [atomic_]PR[dest] += PR[src]/src.outDegree 5 6 1 7 4 3 2 9 8 Operation done atomically +PR0 +PR0 +PR1 +PR1 +PR1 1 2 +PR2 +PR2 Key data structures vertex property (vtxProp) edge list (edgelist) non-graph data (nongraph) accessed randomly accessed sequentially

Background: Power Law Distribution Degree of many graphs follow power law distribution #indegree edges #vertices 5 6 1 7 4 3 2 9 8 An instance of power law: 20% of vertices connect 80% of indegree edges Common due to preferential attachment Large size means highly-connected

Graph Workload Characterization Percentage of accesses to the vtxProp of the 20% most-connected vertices Graph dataset slashdot ca-AstroPh rMat orkut enwiki ljournal indochina uk roadNet-PA roadNet-CA west-USA PageRank 76.0 99.8 91.8 45.3 74.0 77.8 93.2 90.0 20.1 28.8 20.2 99.7 93.6 66.3 83.4 77.7 92.3 89.9 20.0 77.5 93.4 57.8 78.3 92.2 90.4 23.5 29.5 20.5 76.5 88.9 47.4 64.0 75.8 89.3 24.8 75.9 58.4 85.2 77.6 92.0 56.5 84.8 77.4 89.8 17.3 28.3 91.9 56.6 84.7 77.3 29.4 58.7 84.6 89.6 81.1 28.7 Breadth-First Search Single-Source Shortest Path Betweenness Centrality Graph algorithm Radii Connected Components Triangle Counting K-Core Key observations For most graphs: >70% of vtxProp accesses on 20% of vertices Exceptions: road networks (e.g. rCA)

OMEGA Architecture Baseline CMP OMEGA Key ideas Heterogenous memory subsystem architecture Scratchpads (SPs): store most-connected vtxProp Caches: store edgelist, nongraph, least-connected vtxProp Processing in Scratchpad (PISC) Computes atomic operations on vtxProp in situ

Execution of Atomic Operations without PISC Example: Core0 runs PageRank starting at vertex V4 V1’s current value is read from the remote scratchpad (SP1) Core0 updates its value req 4 req 1 res 1 On-chip lat. & traffic Locking overhead Core energy consumption res 4

Offloading Atomic Operations to PISC Example: Core0 runs PageRank starting at vertex V4 V1 update is offloaded to SP1’s PISC req 4 upd 1 On-chip lat. & traffic Locking overhead Core energy consumption res 4

access to source vertex Source Vertex Buffer Minimizes remote accesses to source vertex’s data SSSP atomic update function access to source vertex ShortestLen[dest] += min(ShortestLen[dest], ShortestLen[src]+edgeLen) Example: Core0 runs SSSP starting at vertex V3 First V3 access: served from SP1 and cached in the buffer Subsequent V3 accesses: served from the buffer upd 0 upd 1 req 3 req 3 res 3 res 3 V3 Buffer is read only and not coherent with other SPs

Graph Preprocessing Graph reordering identify the most-connected vertices Indegree-based reordering is preferable 5 1 2 7 4 6 3 9 8 5 6 1 7 4 3 2 9 8 5 1 2 7 4 6 3 9 8 Most-connected vertices: [V0-V1] mapped to SPs Least-connected vertices: [V2-V9] stored in caches OMEGA node

High-level Framework Modifications Source to source transformation tool Transforms the atomic update function Extracts scratchpad & PISC configuration parameters e.g., #vertices, atomic-operation type, etc. Transforming the atomic update function [atomic_]next_PR[dest] += curr_PR[src]/src.degree *mmr1 = curr_PR[src]/src.degree *mmr2 = dest *mmr = memory mapped register

Comparison with Prior Works Beamer et al. [IISWC’15] Graphicionado [MICRO’16] Tesseract [ISCA’15] GraphPIM [HPCA’17] OMEGA leverage power law yes no memory subsystem general specialized heterog. framework flexibility yes no limited on-chip traffic high low vtxProp access latency and energy limited low high

Experimental Setup Experiment setup (Gem5) Graph framework Common 16 cores, OoO, 8-wide L1 I/D cache: 16KB, 4/8-way, private Baseline-specific L2 cache: 2MB per core, 8way, shared OMEGA-specific Scrachpad + L2 Cache = Baseline’s L2 Cache L2 cache: 1MB per core & scratchpad: 1MB per core PISC has insignificant area and power overhead (<<1%) Source vertex buffer: 32 entries Graph framework Ligra

Workload Description Graph algorithms Graph datasets name description PageRank Page Rank BFS Breadth-First Search SSSP Single-Source Shortest Path BC Betweenness Centrality Radii CC Connected Components TC Triangle Counting name description remark Size rMat synthetic medium sd slashdot social network ap astroPh collaboration network rCA roadNet-CA road network rPA roadNet-PA ic Indochina web graph large wiki enwiki orkut Orkut lj ljournal Missing algorithms and datasets from our workload char. k-Core: because of having a similar characteristic with TC Some datasets: because of long simulation time

Performance Analysis >2x speedup on average Key observations medium large >2x speedup on average Key observations Both medium and large graphs achieve similar speedup Significant speedup even for non-power-law graphs (rCA and rPA) that fit in the scratchpads

Comparison of Off- & On-chip Communication PageRank 2.28x bandwidth utilization over a CMP baseline PageRank 3.2x on-chip traffic reduction over a CMP baseline

Energy Analysis – on PageRank 2.5x energy saving over a baseline CMP Main sources of energy savings Higher speedup Lower #DRAM accesses Lower energy for scratchpad compared to cache

Conclusions Heterogenous memory subsystem provides significant performance/energy improvements for power-law graphs OMEGA provides over 2x speedup on average achieves over 2.5x energy savings on PageRank does not incur area overhead

Backup Slides

Performance on large-scale Datasets 1.68x for PageRank on twitter, storing only 5% of vtxProp 1.35x for BFS on twitter, storing only 10% of vtxProp

Performance on non-power-law graphs Only 1.15x speedup on a large non-power-law graph

Comparison of On-chip Storage Access OMEGA’s scratchpad + cache storage provides over 70% hit rate for power law graphs