GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian Tsinghua University University of Southern California Stanford University
Outline Motivation GraphP Evaluation Graph applications Processing-In-Memory The drawbacks of the current solution GraphP Evaluation
Graph Applications Social network analytics Recommendation system Bioinformatics … social graph Resource Description graph underlying representation
Challenges High bandwidth requirement Small amount of computation per vertex Data movement overhead comp L1 L2 L3 many have been proposed in conventional computer systems, data goes through cache hierarchy from memory to computation units. data movement overhead limits memory access performance mem
PIM: Processing-In-Memory Idea: Computation logic inside memory Advantage: High memory bandwidth Example: Hybrid Memory Cubes (HMC) comp 320GB/s intra-cube 4x120GB/s inter-cube mem ….. avoid data movement overhead intra
HMC: Hybrid Memory Cubes 120 120 320 Intra-cube Bottleneck: Inter-cube communication Inter-cube Inter-group easy to connect 4 fully connected link connecteach, bandwidth between each cube 120 how to scacle to 16 impossible because up to 4 , 4 cubes as a group dragonfly onely one link connecting group bandwidth between group 120, bandwidth between each cube < 120 bandwidth (GB/s)
Outline Motivation GraphP Evaluation Graph applications Processing-In-Memory The drawbacks of the current solution GraphP Evaluation
Current Solution: Tesseract First PIM-based graph processing architecture Programming model Vertex program Partition Based on vertex program Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
PageRank in Vertex Program for (v: vertices) { for (w: edges.destination) { } update = 0.85 * v.rank / v.out_degree; put(w.id, function{ w.next_rank += update; }); barrier(); iterate all vertices iterate all destination (neighours) of the source the implication of it
Graph Partition hmc0 3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 inter 3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 we will be using the same example graph throughout the talk inter edge hmc1 comm put(w.id, function{ w.next_rank += update; }); communication = # of cross-cube edges
Drawback of Tesseract Excessive data communication Why? Programming Model Graph Partition Data Communication Tesseract ?
Outline Motivation GraphP Evaluation
GraphP Consider graph partition first. Graph Partition Source-Cut Programming model Two-phase vertex program Reduces inter-cube communication
Source-Cut Partition 3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 inter 3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 this is the same graph inter edge 2 replica 3 4 5 hmc1
Two-Phase Vertex Program for (r: replicas) { } r.next_rank = 0.85 * r.next_rank / r.out_degree; //apply updates from previous iterations 2 02 blink 3 4 5
Two-Phase Vertex Program for (v: vertices) { for (u: edges.sources) { } update += u.rank; 2 4 blink 3 4 5
Two-Phase Vertex Program for (r: replicas) { } put(r.id, function { r.next_rank = update}); 2 barrier(); 3 4 5 +cube boundary 1:1 replica communication 3 4
Benefits Strictly less data communication Enables architecture optimizations
Less Communication 2 2 2 4 5 4 5 Tesseract GraphP
Broadcast Optimization for (r: replicas) { } put(r.id, function { r.next_rank = update}); broadcast barrier(); 4 4 4 4
Naïve Broadcast 15 point to point messages src dst dst dst dst to send to a remote group of 4 cubes, 4 identical messages are sent in the intergroup link dst dst
Hierarchical communication 3 intergroup messages src dst dst only 1 intergroup message per remote group dst dst
Other Optimizations Computation/communication overlap Leveraging low-power state of SerDes Please see the paper for more details
Outline Motivation GraphP Evaluation
Evaluation Methodology Simulation Infrastructure zSim with HMC support ORION for NOC Energy modeling Configurations Same as Tesseract 16 HMCs Interconnection: Dragonfly and Mesh2D 512 CPUs Single-issue in-order cores Frequency: 1GHz
Workloads 4 graph algorithms 5 real-world graphs Breadth First Search Single Source Shortest Path Weakly Connected Component PageRank 5 real-world graphs Wiki-Vote (WV) ego-Twitter (TT) Soc-Slashdot0902 (SD) Amazon0302 (AZ) ljournal-2008 (LJ)
Performance <1.1x data partition 1.7x memory bandwidth Tesseract
Communication Amount
Energy consumption
Other results Bandwidth utilization Scalability Replication overhead Please see the paper for more details
Conclusions We propose GraphP Key contributions A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program Enable additional architecture optimizations GraphP drastically reduces inter-cube communication and improves energy efficiency.
GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian from USC It is a joint work with Tsinghua University and Stanford university Tsinghua University University of Southern California Stanford University
Workload Size & Capacity 128 GB (16 * 8GB) ~16 billion edges ~400 million edges (SNAP) ~7 billion edges (WebGraph) https://snap.stanford.edu/data/ http://law.di.unimi.it/datasets.php
Two-phase vertex program Equivalent Expressiveness as vertex programs