Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Spark: Cluster Computing with Working Sets
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Auburn University
International Conference on Data Engineering (ICDE 2016)
Parallel Graph Algorithms
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Chilimbi, et al. (2014) Microsoft Research
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Harry Xu University of California, Irvine & Microsoft Research
PREGEL Data Management in the Cloud
Parallel Algorithm Design
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
A Lightweight Communication Runtime for Distributed Graph Analytics
Roshan Dathathri          Gurbinder Gill          Loc Hoang
Synchronization trade-offs in GPU implementations of Graph Algorithms
Data Structures and Algorithms in Parallel Computing
Applying Twister to Scientific Applications
The Basics of Apache Hadoop
Jianping Fan Dept of CS UNC-Charlotte
Linchuan Chen, Peng Jiang and Gagan Agrawal
Mayank Bhatt, Jayasi Mehar
Verilog to Routing CAD Tool Optimization
Graphs Representation, BFS, DFS
Replication-based Fault-tolerance for Large-scale Graph Processing
Parallel Sort, Search, Graph Algorithms
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Data-Intensive Computing: From Clouds to GPU Clusters
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Gurbinder Gill Roshan Dathathri Loc Hoang
Many-Core Graph Workload Analysis
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Loc Hoang Roshan Dathathri Gurbinder Gill Keshav Pingali
DistTC: High Performance Distributed Triangle Counting
Presentation transcript:

Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Graph Analytics Need TBs of memory Applications: machine learning and network analysis Datasets: unstructured graphs Applications: search engines, machine learning, social network analysis How to rank webpages? (PageRank) How to recommend movies to users? (Stochastic Gradient Descent) Who are the most influential persons in a social network? (Betweenness Centrality) Datasets: unstructured graphs Billions of nodes and trillions of edges Do not fit in the memory of a single machine Need TBs of memory Credits: Wikipedia, SFL Scientific, MakeUseOf Credits: Sentinel Visualizer

Distributed Graph Analytics Distributed-memory clusters are used for in-memory processing of very large graphs D-Galois [PLDI’18], Gemini [OSDI’16], PowerGraph [OSDI’12], ...  Graph is partitioned among machines in the cluster Many heuristics for graph partitioning (partitioning policies) Application performance is sensitive to policy There is no clear way to choose policies No clear to choose policies for the user as well as to support for the systems Main objectives of a good partitioning policy: Minimize the synchronization overhead Balance computation load -> motivation for our study

Existing Partitioning Studies Performed on small graphs and clusters Only considered metrics such as edges cut, avg. number of replicas, etc. Did not consider work-efficient data-driven algorithms Only topology-driven algorithms evaluated Used framework that use similar communication pattern for all partitioning policies Putting some partitioning policies at a disadvantage suggested policies to minimize number of replicas Looked at the ways to control replication factor, Others just look at the metrics: edges cut, replication factor Nodes become active in data-dependent way, so static partitioning is of no good replication factor

Contributions Experimental study of partitioning strategies for work-efficient graph analytics applications: Largest publicly available web-crawls, such as wdc12 (~1TB) Large KNL and Skylake clusters with up to 256 machines (~69K threads) Uses the start-of-the-art graph analytics system, D-Galois [PLDI’19] Evaluate various kinds of partitioning policies Analyze partitioning policies using an analytical model and micro-benchmarking Present decision tree for selecting the best partitioning policy

Partitioning B C A D F G E H I J Original graph Let’s first see how we partition the graph among hosts: We will use this small example graph of 10 nodes and 24 edges and partition it among 4 host, represented by 4 different colored boxes on the right. I J Original graph

Partitioning B C A D F G E H I J Original graph Each edge is assigned to a unique host B C A D F G E H First we assign each edge uniquely to hosts I J Original graph

Partitioning B B C D B C A F G A D F G E H I F G E H B C D I J A D F G Each edge is assigned to a unique host All edges connect proxy nodes on the same host B B C D B C A F G A D F G E H I F G E H B C D All the end points of these edges I J A D F G Original graph G F H J I J

Partitioning B B C D B C A F G A D F G E H I F G E H B C D I J A D F G Each edge is assigned to a unique host All edges connect proxy nodes on the same host A node can have multiple proxies: one is master proxy; rest are mirror proxies B B C D B C A F G A D F G E H I F G E H B C D I J A D F G Original graph G F H J I J

Synchronization B B C D A F G F G E H I B C D A D F G G F H J I J Reduce: Mirror proxies send updates to master Broadcast: Master sends canonical value to mirrors B B C D A F G F G E H I B C D A D F G G F H Broadcast Reduce J I J

Matrix representation Matrix View of a Graph Destinations A B E F I C D G H J A B E F I C D G H J B C A D F G Sources E H I J Original graph Matrix representation

Partitioning Policies & Comm. Pattern 1D Partitioning Incoming Edge Cut (IEC)  Outgoing Edge Cut (OEC)  We are going to talk abt 4 policies in this talk: Different colors represent different hosts OEC is a simple 1D partitioning policy: In this adjacency matrix is partitioned along the rows as show in top left corner: Colored sources represent the CVC : cyclic along the rows and blocked along the columns Last one is most general, where edges are distributed without any contraints using some heuritstic, and is known as hybrid vertex cut Communication Pattern

Partitioning Policies & Comm. Pattern 2D Partitioning General Vertex Cut Cartesian Vertex Cut (CVC)  Hybrid Vertex Cut (HVC)  We are going to talk abt 4 policies in this talk: Different colors represent different hosts OEC is a simple 1D partitioning policy: In this adjacency matrix is partitioned along the rows as show in top left corner: Colored sources represent the CVC : blocked along the rows and cyclic along the columns HVC: treat nodes differently based on the degree Last one is most general, where edges are distributed without any contraints using some heuritstic, and is known as hybrid vertex cut Communication Pattern

Implementation: Partitioning: State-of-the-art graph partitioner, CuSP [IPDSP’19] Processing: State-of-the-art distributed graph analytics system, D-Galois [PLDI’18] Bulk synchronous parallel (BSP) execution Uses Gluon [PLDI’19] as communication substrate: Uses partition-specific communication pattern Only sends the updated node labels Avoids sending metadata in each round of synchronization

Experimental Setup Stampede2 at TACC: Up to 256 hosts Application: KNL hosts Skylake hosts Application: bc bfs cc pagerank sssp Partitioning Policies: XtraPulp (XEC) Edge Cut (EC) Cartesian Vertex Cut (CVC) Hybrid Vertex Cut (HVC) kron30 clueweb12 wdc12 |V| 1073M 978M 3,563M |E| 10,791M 42,574M 128,736M Size on Disk 136GB 325GB 986GB 2 architectures to show that observations are independent of architecture KNL: 68 cores (without hyper threading) SKYLAKE: 24 cores Only going to talk abt EC, HVC, CVC Make it clear that only plotting execution time Inputs and their properties

Execution Time Pointer: CVC is doing best Log-log scale Only execution time : without partitioning time Partitioning time can be amortized over the runs

Compute Time Scales Compute scales (mostly) The difference (mostly) arises from communication Max across all hosts: Therefore, if there is any load imbalance it must be covered in the compute time. We measure the time for computation of each round or iteration on each host and take the maximum across hosts. Summing up over iterations, we get the computation time. BFS, SSSP : not scaling due to very small computation at that scale

Communication Volume and Time >1000 <=100 Put 8 and 256 micro here + arrows The communication volume in work-efficient graph analytical applications changes over rounds and several rounds have low communication volume in practice. Low , Medium, High Do mention data driven algorithm: changes the amount of volume but still falls in those 2 categories over rounds and several rounds have low communication volume in practice.

Best Partitioning Policy (clueweb12) Execution time (sec): 32 hosts 256 hosts XEC EC HVC CVC bc 136.2 439.1 627.9 539.1 413.7 420 1012.1 266.9 bfs 10.4 8.9 17.4 19.2 36.4 27.1 46.5 14.6 cc OOM 16.9 7.5 19.6 43.0 84.7 8.4 7.3 pr 272.6 219.6 193.5 217.9 286.7 82.3 97.5 60.8 sssp 16.5 13.1 26.6 31.7 32.5 54.7 21.8

Decision Tree Based on our study we present the decision tree to select the best partitioning policy: % difference in execution time between policy chosen by decision tree vs. optimal

Key Lessons Best performing policy depends on: Application Input Number of hosts (scale) EC performs well at small scale but CVC wins at large scale Graph analytics systems must support: Various partitioning policies Partition-specific communication pattern Not a trivial thing to say what is best policy

Source Code ~ Thank you ~ Frameworks used for the study: Graph Partitioner: CuSP [IPDPS’19] Communication Substrate: Gluon [PLDI’18] Distributed Graph Analytics Framework: D-Galois [PLDI’18] https://github.com/IntelligentSoftwareSystems/Galois ~ Thank you ~

Partitioning Time CuSP [IPDSP’19]

Partitioning Time (clueweb12) Partitioning time (sec) on 256 hosts: Can be amortized? Say here Another important consideration while choosing the right partition is the time it takes to partition the graph among hosts. In this we compare the time it takes to do different partitioning policies: As we can see that cvc takes the maximum time to among these policies. The more complex the policy is, the more time it takes to Partition the graph. We did not include the partitioning time for the previous results as partitioning time can be amortized by partitioning it once and Then reusing the partitions several times.

Micro-benchmarking CVC at 256 gives a geomean speedup of 4.6x to 5.6x over EC and HVC