Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mingxing Zhang, Youwei Zhuo (equal contribution),

Similar presentations


Presentation on theme: "Mingxing Zhang, Youwei Zhuo (equal contribution),"— Presentation transcript:

1 GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition
Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian Tsinghua University University of Southern California Stanford University

2 Outline Motivation GraphP Evaluation Graph applications
Processing-In-Memory The drawbacks of the current solution GraphP Evaluation

3 Graph Applications Social network analytics Recommendation system
Bioinformatics social graph Resource Description graph underlying representation

4 Challenges High bandwidth requirement
Small amount of computation per vertex Data movement overhead comp L1 L2 L3 many have been proposed in conventional computer systems, data goes through cache hierarchy from memory to computation units. data movement overhead limits memory access performance mem

5 PIM: Processing-In-Memory
Idea: Computation logic inside memory Advantage: High memory bandwidth Example: Hybrid Memory Cubes (HMC) comp 320GB/s intra-cube 4x120GB/s inter-cube mem ….. avoid data movement overhead intra

6 HMC: Hybrid Memory Cubes
120 120 320 Intra-cube Bottleneck: Inter-cube communication Inter-cube Inter-group easy to connect 4 fully connected link connecteach, bandwidth between each cube 120 how to scacle to 16 impossible because up to 4 , 4 cubes as a group dragonfly onely one link connecting group bandwidth between group 120, bandwidth between each cube < 120 bandwidth (GB/s)

7 Outline Motivation GraphP Evaluation Graph applications
Processing-In-Memory The drawbacks of the current solution GraphP Evaluation

8 Current Solution: Tesseract
First PIM-based graph processing architecture Programming model Vertex program Partition Based on vertex program Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

9 PageRank in Vertex Program
for (v: vertices) { for (w: edges.destination) { } update = 0.85 * v.rank / v.out_degree; put(w.id, function{ w.next_rank += update; }); barrier(); iterate all vertices iterate all destination (neighours) of the source the implication of it

10 Graph Partition hmc0 3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 inter
3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 we will be using the same example graph throughout the talk inter edge hmc1 comm put(w.id, function{ w.next_rank += update; }); communication = # of cross-cube edges

11 Drawback of Tesseract Excessive data communication Why?
Programming Model Graph Partition Data Communication Tesseract ?

12 Outline Motivation GraphP Evaluation

13 GraphP Consider graph partition first. Graph Partition
Source-Cut Programming model Two-phase vertex program Reduces inter-cube communication

14 Source-Cut Partition 3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 inter
3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 this is the same graph inter edge 2 replica 3 4 5 hmc1

15 Two-Phase Vertex Program
for (r: replicas) { } r.next_rank = 0.85 * r.next_rank / r.out_degree; //apply updates from previous iterations 2 02 blink 3 4 5

16 Two-Phase Vertex Program
for (v: vertices) { for (u: edges.sources) { } update += u.rank; 2 4 blink 3 4 5

17 Two-Phase Vertex Program
for (r: replicas) { } put(r.id, function { r.next_rank = update}); 2 barrier(); 3 4 5 +cube boundary 1:1 replica communication 3 4

18 Benefits Strictly less data communication
Enables architecture optimizations

19 Less Communication 2 2 2 4 5 4 5 Tesseract GraphP

20 Broadcast Optimization
for (r: replicas) { } put(r.id, function { r.next_rank = update}); broadcast barrier(); 4 4 4 4

21 Naïve Broadcast 15 point to point messages src dst dst dst dst
to send to a remote group of 4 cubes, 4 identical messages are sent in the intergroup link dst dst

22 Hierarchical communication
3 intergroup messages src dst dst only 1 intergroup message per remote group dst dst

23 Other Optimizations Computation/communication overlap
Leveraging low-power state of SerDes Please see the paper for more details

24 Outline Motivation GraphP Evaluation

25 Evaluation Methodology
Simulation Infrastructure zSim with HMC support ORION for NOC Energy modeling Configurations Same as Tesseract 16 HMCs Interconnection: Dragonfly and Mesh2D 512 CPUs Single-issue in-order cores Frequency: 1GHz

26 Workloads 4 graph algorithms 5 real-world graphs Breadth First Search
Single Source Shortest Path Weakly Connected Component PageRank 5 real-world graphs Wiki-Vote (WV) ego-Twitter (TT) Soc-Slashdot0902 (SD) Amazon0302 (AZ) ljournal-2008 (LJ)

27 Performance <1.1x data partition 1.7x memory bandwidth Tesseract

28 Communication Amount

29 Energy consumption

30 Other results Bandwidth utilization Scalability Replication overhead
Please see the paper for more details

31 Conclusions We propose GraphP Key contributions
A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program Enable additional architecture optimizations GraphP drastically reduces inter-cube communication and improves energy efficiency.

32 GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition
Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian from USC It is a joint work with Tsinghua University and Stanford university Tsinghua University University of Southern California Stanford University

33 Workload Size & Capacity
128 GB (16 * 8GB) ~16 billion edges ~400 million edges (SNAP) ~7 billion edges (WebGraph) 

34 Two-phase vertex program
Equivalent Expressiveness as vertex programs


Download ppt "Mingxing Zhang, Youwei Zhuo (equal contribution),"

Similar presentations


Ads by Google