Mingxing Zhang, Youwei Zhuo (equal contribution),

Slides:

Advertisements

Similar presentations

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Advertisements

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

SCAN: A Dynamic, Scalable, and Efficient Content Distribution Network Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

1 Minimization of Network Power Consumption with Redundancy Elimination T. Khoa Phan* Joint work with: Frédéric Giroire*, Joanna Moulierac* and Frédéric.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

THE LITTLE ENGINE(S) THAT COULD: SCALING ONLINE SOCIAL NETWORKS B 圖資三謝宗昊.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Data Structures and Algorithms in Parallel Computing

1 Traffic Engineering By Kavitha Ganapa. 2 Introduction Traffic engineering is concerned with the issue of performance evaluation and optimization of.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.

A Hierarchical Edge Cloud Architecture for Mobile Computing IEEE INFOCOM 2016 Liang Tong, Yong Li and Wei Gao University of Tennessee – Knoxville 1.

Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+

A Case for Toggle-Aware Compression for GPU Systems

Reza Yazdani Albert Segura José-María Arnau Antonio González

Seth Pugsley, Jeffrey Jestes,

Data Center Network Architectures

Antonis Papadimitriou, Arjun Narayan, Andreas Haeberlen

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Presented by Tashana Landray

Chuanxiong Guo, et al, Microsoft Research Asia, SIGCOMM 2008

Chilimbi, et al. (2014) Microsoft Research

Parallel Programming By J. H. Wang May 2, 2017.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Towards Effective Partition Management for Large Graphs

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Chuanxiong Guo, Haitao Wu, Kun Tan,

Mayank Bhatt, Jayasi Mehar

Using Packet Information for Efficient Communication in NoCs

Replication-based Fault-tolerance for Large-scale Graph Processing

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Parallel Applications And Tools For Cloud Computing Environments

Heterogeneous Memory Subsystem for Natural Graph Analytics

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

University of Wisconsin-Madison

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Hybrid Programming with OpenMP and MPI

Declarative Transfer Learning from Deep CNNs at Scale

Many-Core Graph Workload Analysis

Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin

Tahsin Reza Matei Ripeanu Nicolas Tripoul

A Case for Interconnect-Aware Architectures

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Rajeev Balasubramonian

Funded by the Horizon 2020 Framework Programme of the European Union

Tesseract A Scalable Processing-in-Memory Accelerator

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Presentation transcript:

GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian Tsinghua University University of Southern California Stanford University

Outline Motivation GraphP Evaluation Graph applications Processing-In-Memory The drawbacks of the current solution GraphP Evaluation

Graph Applications Social network analytics Recommendation system Bioinformatics … social graph Resource Description graph underlying representation

Challenges High bandwidth requirement Small amount of computation per vertex Data movement overhead comp L1 L2 L3 many have been proposed in conventional computer systems, data goes through cache hierarchy from memory to computation units. data movement overhead limits memory access performance mem

PIM: Processing-In-Memory Idea: Computation logic inside memory Advantage: High memory bandwidth Example: Hybrid Memory Cubes (HMC) comp 320GB/s intra-cube 4x120GB/s inter-cube mem ….. avoid data movement overhead intra

HMC: Hybrid Memory Cubes 120 120 320 Intra-cube Bottleneck: Inter-cube communication Inter-cube Inter-group easy to connect 4 fully connected link connecteach, bandwidth between each cube 120 how to scacle to 16 impossible because up to 4 , 4 cubes as a group dragonfly onely one link connecting group bandwidth between group 120, bandwidth between each cube < 120 bandwidth (GB/s)

Outline Motivation GraphP Evaluation Graph applications Processing-In-Memory The drawbacks of the current solution GraphP Evaluation

Current Solution: Tesseract First PIM-based graph processing architecture Programming model Vertex program Partition Based on vertex program Ahn, J., Hong, S., Yoo, S., Mutlu, O., & Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

PageRank in Vertex Program for (v: vertices) { for (w: edges.destination) { } update = 0.85 * v.rank / v.out_degree; put(w.id, function{ w.next_rank += update; }); barrier(); iterate all vertices iterate all destination (neighours) of the source the implication of it

Graph Partition hmc0 3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 inter 3 4 5 2 1 1 2 1 intra edge vertex 3 4 5 we will be using the same example graph throughout the talk inter edge hmc1 comm put(w.id, function{ w.next_rank += update; }); communication = # of cross-cube edges

Drawback of Tesseract Excessive data communication Why? Programming Model Graph Partition Data Communication Tesseract ?

Outline Motivation GraphP Evaluation

GraphP Consider graph partition first. Graph Partition Source-Cut Programming model Two-phase vertex program Reduces inter-cube communication

Source-Cut Partition 3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 inter 3 4 5 2 1 1 2 hmc0 1 intra edge vertex 2 this is the same graph inter edge 2 replica 3 4 5 hmc1

Two-Phase Vertex Program for (r: replicas) { } r.next_rank = 0.85 * r.next_rank / r.out_degree; //apply updates from previous iterations 2 02 blink 3 4 5

Two-Phase Vertex Program for (v: vertices) { for (u: edges.sources) { } update += u.rank; 2 4 blink 3 4 5

Two-Phase Vertex Program for (r: replicas) { } put(r.id, function { r.next_rank = update}); 2 barrier(); 3 4 5 +cube boundary 1:1 replica communication 3 4

Benefits Strictly less data communication Enables architecture optimizations

Less Communication 2 2 2 4 5 4 5 Tesseract GraphP

Broadcast Optimization for (r: replicas) { } put(r.id, function { r.next_rank = update}); broadcast barrier(); 4 4 4 4

Naïve Broadcast 15 point to point messages src dst dst dst dst to send to a remote group of 4 cubes, 4 identical messages are sent in the intergroup link dst dst

Hierarchical communication 3 intergroup messages src dst dst only 1 intergroup message per remote group dst dst

Other Optimizations Computation/communication overlap Leveraging low-power state of SerDes Please see the paper for more details

Outline Motivation GraphP Evaluation

Evaluation Methodology Simulation Infrastructure zSim with HMC support ORION for NOC Energy modeling Configurations Same as Tesseract 16 HMCs Interconnection: Dragonfly and Mesh2D 512 CPUs Single-issue in-order cores Frequency: 1GHz

Workloads 4 graph algorithms 5 real-world graphs Breadth First Search Single Source Shortest Path Weakly Connected Component PageRank 5 real-world graphs Wiki-Vote (WV) ego-Twitter (TT) Soc-Slashdot0902 (SD) Amazon0302 (AZ) ljournal-2008 (LJ)

Performance <1.1x data partition 1.7x memory bandwidth Tesseract

Communication Amount

Energy consumption

Other results Bandwidth utilization Scalability Replication overhead Please see the paper for more details

Conclusions We propose GraphP Key contributions A new PIM-based graph processing framework Key contributions Data partition as first-order design consideration Source-cut partition Two-phase vertex program Enable additional architecture optimizations GraphP drastically reduces inter-cube communication and improves energy efficiency.

GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, Xuehai Qian from USC It is a joint work with Tsinghua University and Stanford university Tsinghua University University of Southern California Stanford University

Workload Size & Capacity 128 GB (16 * 8GB) ~16 billion edges ~400 million edges (SNAP) ~7 billion edges (WebGraph) https://snap.stanford.edu/data/ http://law.di.unimi.it/datasets.php

Two-phase vertex program Equivalent Expressiveness as vertex programs