Download presentation
Presentation is loading. Please wait.
1
GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition
Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian Tsinghua University University of Southern California Stanford University
2
Outline Motivation GraphP Evaluation Graph applications
Processing-In-Memory The drawbacks of the current solution GraphP Evaluation
3
Graph Applications Social network analytics Recommendation system
Bioinformatics … social graph Resource Description graph underlying representation
4
Challenges High bandwidth requirement
Small amount of computation per vertex Data movement overhead comp L1 L2 L3 many have been proposed in conventional computer systems, data goes through cache hierarchy from memory to computation units. data movement overhead limits memory access performance mem
5
PIM: Processing-In-Memory
Idea: Computation logic inside memory Advantage: High memory bandwidth Example: Hybrid Memory Cubes (HMC) comp 320GB/s intra-cube 4x120GB/s inter-cube mem ….. avoid data movement overhead intra
6
HMC: Hybrid Memory Cubes
120 120 320 Intra-cube Bottleneck: Inter-cube communication Inter-cube Inter-group easy to connect 4 fully connected link connecteach, bandwidth between each cube 120 how to scacle to 16 impossible because up to 4 , 4 cubes as a group dragonfly onely one link connecting group bandwidth between group 120, bandwidth between each cube < 120 bandwidth (GB/s)
7
Outline Motivation GraphP Evaluation Graph applications
Processing-In-Memory The drawbacks of the current solution GraphP Evaluation
8
Current Solution: Tesseract
First PIM-based graph processing architecture Programming model Vertex program Partition Based on vertex program Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
9
PageRank in Vertex Program
for (v: vertices) { for (w: edges.destination) { } update = 0.85 * v.rank / v.out_degree; put(w.id, function{ w.next_rank += update; }); barrier(); iterate all vertices iterate all destination (neighours) of the source the implication of it
10
Graph Partition hmc0 3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 inter
3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 we will be using the same example graph throughout the talk inter edge hmc1 comm put(w.id, function{ w.next_rank += update; }); communication = # of cross-cube edges
11
Drawback of Tesseract Excessive data communication Why?
Programming Model Graph Partition Data Communication Tesseract ?
12
Outline Motivation GraphP Evaluation
13
GraphP Consider graph partition first. Graph Partition
Source-Cut Programming model Two-phase vertex program Reduces inter-cube communication
14
Source-Cut Partition 3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 inter
3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 this is the same graph inter edge 2 replica 3 4 5 hmc1
15
Two-Phase Vertex Program
for (r: replicas) { } r.next_rank = 0.85 * r.next_rank / r.out_degree; //apply updates from previous iterations 2 02 blink 3 4 5
16
Two-Phase Vertex Program
for (v: vertices) { for (u: edges.sources) { } update += u.rank; 2 4 blink 3 4 5
17
Two-Phase Vertex Program
for (r: replicas) { } put(r.id, function { r.next_rank = update}); 2 barrier(); 3 4 5 +cube boundary 1:1 replica communication 3 4
18
Benefits Strictly less data communication
Enables architecture optimizations
19
Less Communication 2 2 2 4 5 4 5 Tesseract GraphP
20
Broadcast Optimization
for (r: replicas) { } put(r.id, function { r.next_rank = update}); broadcast barrier(); 4 4 4 4
21
Naïve Broadcast 15 point to point messages src dst dst dst dst
to send to a remote group of 4 cubes, 4 identical messages are sent in the intergroup link dst dst
22
Hierarchical communication
3 intergroup messages src dst dst only 1 intergroup message per remote group dst dst
23
Other Optimizations Computation/communication overlap
Leveraging low-power state of SerDes Please see the paper for more details
24
Outline Motivation GraphP Evaluation
25
Evaluation Methodology
Simulation Infrastructure zSim with HMC support ORION for NOC Energy modeling Configurations Same as Tesseract 16 HMCs Interconnection: Dragonfly and Mesh2D 512 CPUs Single-issue in-order cores Frequency: 1GHz
26
Workloads 4 graph algorithms 5 real-world graphs Breadth First Search
Single Source Shortest Path Weakly Connected Component PageRank 5 real-world graphs Wiki-Vote (WV) ego-Twitter (TT) Soc-Slashdot0902 (SD) Amazon0302 (AZ) ljournal-2008 (LJ)
27
Performance <1.1x data partition 1.7x memory bandwidth Tesseract
28
Communication Amount
29
Energy consumption
30
Other results Bandwidth utilization Scalability Replication overhead
Please see the paper for more details
31
Conclusions We propose GraphP Key contributions
A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program Enable additional architecture optimizations GraphP drastically reduces inter-cube communication and improves energy efficiency.
32
GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition
Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian from USC It is a joint work with Tsinghua University and Stanford university Tsinghua University University of Southern California Stanford University
33
Workload Size & Capacity
128 GB (16 * 8GB) ~16 billion edges ~400 million edges (SNAP) ~7 billion edges (WebGraph)
34
Two-phase vertex program
Equivalent Expressiveness as vertex programs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.