DistTC: High Performance Distributed Triangle Counting

Slides:

Advertisements

Similar presentations

BiG-Align: Fast Bipartite Graph Alignment

Advertisements

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

“SDJS: Efficient Statistics in Wireless Networks” Albert Krohn, Michael Beigl, Sabin Wendhack TecO (Telecooperation Office) Institut für Telematik Universität.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Future of parallel computing: issues and directions Laxmikant Kale CS433 Spring 2000.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

The InfoVis Toolkit Jean-Daniel Fekete INRIA Futurs/LRI, France

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Hierarchical Hybrid Search Structure for High Performance Packet Classification Authors : O˜guzhan Erdem, Hoang Le, Viktor K. Prasanna Publisher : INFOCOM,

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Distributed SAR Image Change Detection with OpenCL-Enabled Spark

Intelligent trigger for Hyper-K

CS 179: GPU Programming Lecture 1: Introduction 1

Distributed Vehicle Routing Approximation

Large-scale Machine Learning

Extreme Big Data Examples

Case Study 2- Parallel Breadth-First Search Using OpenMP

PREGEL Data Management in the Cloud

Improving java performance using Dynamic Method Migration on FPGAs

Interactive Website (

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Ray-Cast Rendering in VTK-m

Supporting Fault-Tolerance in Streaming Grid Applications

Linchuan Chen, Xin Huo and Gagan Agrawal

Kijung Shin1 Mohammad Hammoud1

A Lightweight Communication Runtime for Distributed Graph Analytics

Roshan Dathathri Gurbinder Gill Loc Hoang

Experiment Evaluation

Synchronization trade-offs in GPU implementations of Graph Algorithms

Video Transcoding for Wireless Video

Digital Processing Platform

Exploring Non-Uniform Processing In-Memory Architectures

Replication-based Fault-tolerance for Large-scale Graph Processing

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Parallel Applications And Tools For Cloud Computing Environments

Assoc. Prof. Dr. Syed Abdul-Rahman Al-Haddad

University of Wisconsin-Madison

Scaling up Link Prediction with Ensembles

Gurbinder Gill Roshan Dathathri Loc Hoang

Fully-dynamic aimGraph

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Adaptive Offloading for Pervasive Computing

Fifty Years of Parallel Programming: Ieri, Oggi, Domani

Assoc. Prof. Marc FRÎNCU, PhD. Habil.

Phoenix: A Substrate for Resilient Distributed Graph Analytics

Loc Hoang Roshan Dathathri Gurbinder Gill Keshav Pingali

Search-Based Approaches to Accelerate Deep Learning

D. Sahoo, Stanford J. Jain, Fujitsu S. Iyer, UT-Austin

Journal Club Physical Review E

Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Presentation transcript:

DistTC: High Performance Distributed Triangle Counting Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali The University of Texas at Austin IEEE HPEC Graph Challenge 2019 1

Goal: Implement an Efficient Triangle Counting Challenges Solution: DistTC Not enough memory to fit it on single machine (clueweb12 has 37 B edges) A novel graph partitioning policy Most of the implementations are on shared memory/single GPU Eliminates almost all communication for distributed TC Distributed processing requires efficient partitioning and synchronization Platform-agnostic Up to 1.6× faster than previous graph challenge champions TC is not a vertex program 2

Partitioning Scheme For Distributed Triangle Counting Host-1 Host-2 A D A D D B C E F B C E E F Host-1 Host-2 A D A D D B C E F B C E E F Master Mirror Partitioned Graph Based on OEC [CuSP IPDPS’19 ] 3

Experimental Setup Software CuSP [Hoang et al. IPDPS 2019] Bridges Super Computer 32 (nodes) * 2 (NVIDIA P100 GPUs) Hardware Input |V| |E| scale19 0.33 M 7.7 M twitter40 41.6 M 1.2 B scale20 0.64 M 15.6 M friendster 66 M 1.8 B scale21 1.24 M 31.7 M gsh-2015 988 M 25 B scale22 2.39 M 64.1 M clueweb12 978 M 37 B Inputs 4

Results Graph 500 Input IrGL (sec) DistTC (sec) scale18 0.06 scale19 0.18 0.15 scale20 0.56 0.35 scale21 1.58 0.80 scale22 4.41 1.91 Input GPUs TriCore (sec) DistTC (sec) twitter40 8 6.5 6.7 friendster 2.1 4.0 gsh-2015 32 253.4 159.9 clueweb12 64 - 172.4 Comparison on Single GPU Comparison on Distributed GPUs DistTC is up to 2.31x faster than IrGL. DistTC is up to 1.58x faster than TriCore 5

Summary Proposed a new graph partitioning policy Eliminates most of the synchronization Implemented distributed multi-GPU triangle counting Achieves up to 1.6x speed up over 2018 champion 6

Questions? 7

Results: Distributed Multi-GPUs Input GPUs Pre-Processing (sec) Exec. Time (sec) Total Time (sec) rmat26 16 30.16 (29.15) 5.95 36.11 32 26.92 (25.99) 3.63 30.55 64 23.74 (22.89) 2.62 26.36 twitter40 24.90 (24.20) 3.92 28.82 20.81 (20.20) 2.83 23.64 18.73 (18.19) 2.35 21.08 friendster 52.13 (51.32) 2.49 54.62 41.80 (41.19) 1.64 43.44 36.13 (35.64) 1.50 37.63 uk2007 12.16 (11.63) 8.64 20.80 12.06 (11.66) 6.52 18.58 11.41 (11.05) 5.47 16.88 gsh-2015 143.43 (142.30) 16.44 159.87 143.72 (142.97) 15.25 158.97 clueweb12 162.92 (162.61) 9.49 172.41 8