DistTC: High Performance Distributed Triangle Counting

Slides:



Advertisements
Similar presentations
BiG-Align: Fast Bipartite Graph Alignment
Advertisements

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
“SDJS: Efficient Statistics in Wireless Networks” Albert Krohn, Michael Beigl, Sabin Wendhack TecO (Telecooperation Office) Institut für Telematik Universität.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Future of parallel computing: issues and directions Laxmikant Kale CS433 Spring 2000.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
The InfoVis Toolkit Jean-Daniel Fekete INRIA Futurs/LRI, France
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Hierarchical Hybrid Search Structure for High Performance Packet Classification Authors : O˜guzhan Erdem, Hoang Le, Viktor K. Prasanna Publisher : INFOCOM,
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Distributed SAR Image Change Detection with OpenCL-Enabled Spark
Intelligent trigger for Hyper-K
CS 179: GPU Programming Lecture 1: Introduction 1
Distributed Vehicle Routing Approximation
Large-scale Machine Learning
Extreme Big Data Examples
Case Study 2- Parallel Breadth-First Search Using OpenMP
PREGEL Data Management in the Cloud
Improving java performance using Dynamic Method Migration on FPGAs
Interactive Website (
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Speedup over Ji et al.'s work
Ray-Cast Rendering in VTK-m
Supporting Fault-Tolerance in Streaming Grid Applications
Linchuan Chen, Xin Huo and Gagan Agrawal
Kijung Shin1 Mohammad Hammoud1
A Lightweight Communication Runtime for Distributed Graph Analytics
Roshan Dathathri          Gurbinder Gill          Loc Hoang
Experiment Evaluation
Synchronization trade-offs in GPU implementations of Graph Algorithms
Video Transcoding for Wireless Video
Digital Processing Platform
Exploring Non-Uniform Processing In-Memory Architectures
Replication-based Fault-tolerance for Large-scale Graph Processing
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Parallel Applications And Tools For Cloud Computing Environments
Assoc. Prof. Dr. Syed Abdul-Rahman Al-Haddad
University of Wisconsin-Madison
Scaling up Link Prediction with Ensembles
Gurbinder Gill Roshan Dathathri Loc Hoang
Fully-dynamic aimGraph
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Adaptive Offloading for Pervasive Computing
Fifty Years of Parallel Programming: Ieri, Oggi, Domani
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Loc Hoang Roshan Dathathri Gurbinder Gill Keshav Pingali
Search-Based Approaches to Accelerate Deep Learning
D. Sahoo, Stanford J. Jain, Fujitsu S. Iyer, UT-Austin
Journal Club Physical Review E
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

DistTC: High Performance Distributed Triangle Counting Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali The University of Texas at Austin IEEE HPEC Graph Challenge 2019 1

Goal: Implement an Efficient Triangle Counting Challenges Solution: DistTC Not enough memory to fit it on single machine (clueweb12 has 37 B edges) A novel graph partitioning policy Most of the implementations are on shared memory/single GPU Eliminates almost all communication for distributed TC Distributed processing requires efficient partitioning and synchronization Platform-agnostic Up to 1.6× faster than previous graph challenge champions TC is not a vertex program 2

Partitioning Scheme For Distributed Triangle Counting Host-1 Host-2 A D A D D B C E F B C E E F Host-1 Host-2 A D A D D B C E F B C E E F Master Mirror Partitioned Graph Based on OEC [CuSP IPDPS’19 ] 3

Experimental Setup Software CuSP [Hoang et al. IPDPS 2019] Bridges Super Computer 32 (nodes) * 2 (NVIDIA P100 GPUs) Hardware Input |V| |E| scale19 0.33 M 7.7 M twitter40 41.6 M 1.2 B scale20 0.64 M 15.6 M friendster 66 M 1.8 B scale21 1.24 M 31.7 M gsh-2015 988 M 25 B scale22 2.39 M 64.1 M clueweb12 978 M 37 B Inputs 4

Results Graph 500 Input IrGL (sec) DistTC (sec) scale18 0.06 scale19 0.18 0.15 scale20 0.56 0.35 scale21 1.58 0.80 scale22 4.41 1.91 Input GPUs TriCore (sec) DistTC (sec) twitter40 8 6.5 6.7 friendster 2.1 4.0 gsh-2015 32 253.4 159.9 clueweb12 64 - 172.4 Comparison on Single GPU Comparison on Distributed GPUs DistTC is up to 2.31x faster than IrGL. DistTC is up to 1.58x faster than TriCore 5

Summary Proposed a new graph partitioning policy Eliminates most of the synchronization Implemented distributed multi-GPU triangle counting Achieves up to 1.6x speed up over 2018 champion 6

Questions? 7

Results: Distributed Multi-GPUs Input GPUs Pre-Processing (sec) Exec. Time (sec) Total Time (sec) rmat26 16 30.16 (29.15) 5.95 36.11 32 26.92 (25.99) 3.63 30.55 64 23.74 (22.89) 2.62 26.36 twitter40 24.90 (24.20) 3.92 28.82 20.81 (20.20) 2.83 23.64 18.73 (18.19) 2.35 21.08 friendster 52.13 (51.32) 2.49 54.62 41.80 (41.19) 1.64 43.44 36.13 (35.64) 1.50 37.63 uk2007 12.16 (11.63) 8.64 20.80 12.06 (11.66) 6.52 18.58 11.41 (11.05) 5.47 16.88 gsh-2015 143.43 (142.30) 16.44 159.87 143.72 (142.97) 15.25 158.97 clueweb12 162.92 (162.61) 9.49 172.41 8