Synchronization trade-offs in GPU implementations of Graph Algorithms

Slides:



Advertisements
Similar presentations
Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Implementing a Speech Recognition System on a GPU using CUDA
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Synchronization These notes introduce:
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
TensorFlow– A system for large-scale machine learning
Remote Function Invocation
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Sathish Vadhiyar Parallel Programming
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Parallel Programming By J. H. Wang May 2, 2017.
3- Parallel Programming Models
Portable Inter-workgroup Barrier Synchronisation for GPUs
Faster Data Structures in Transactional Memory using Three Paths
Parallel Algorithm Design
Accelerating MapReduce on a Coupled CPU-GPU Architecture
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Linchuan Chen, Xin Huo and Gagan Agrawal
RegLess: Just-in-Time Operand Staging for GPUs
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
CUDA Execution Model – III Streams and Events
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Synchronization These notes introduce:
6- General Purpose GPU Programming
DistTC: High Performance Distributed Triangle Counting
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

Synchronization trade-offs in GPU implementations of Graph Algorithms Rashid Kaleem1, Anand Venkat2, Sreepathi Pai1, Mary Hall2, Keshav Pingali1 1University of Texas at Austin 2University of Utah

Outline Stochastic Gradient Descent Scheduling Evaluation Conclusions Recommendation systems Scheduling Just-in-time Runtime Evaluation Conclusions IPDPS'16

Recommendation systems Predict missing entries on the basis of known entries! User m0 m1 m2 m3 … a f k b g l c i Charles d h e j m0 m1 m2 m3 u0 a f k u1 b g l u2 c u3 i u4 d h u5 e j m0 m1 m2 m3 Charles ? IPDPS'16

Matrix Completion M U ? m0 m1 m2 m3 u0 u1 u2 u3 u4 u5 a f k b g l c i d h e j h Δ= abs ( (m0.lv , u4.lv )-d ) ; m0.lv = updatelv (m0.lv , Δ ) ; u4.lv = updatelv (u4.lv , Δ ) ; a u4 U m2.u3≈i d m0 u2 c e ? u5 j m2 i u3 IPDPS'16

Schedules Cannot schedule edges that share nodes m3 l Cannot schedule edges that share nodes Neighborhood of active element (edge) overlaps Find edges that do not share nodes Edge Matching Execute concurrently u1 k g b m1 u0 f h a u4 d m0 u2 c e u5 j m2 i u3 IPDPS'16

Scheduling issues Parallelism Locality Just-in-time Runtime After input has been read, before computation is performed Runtime Discover independent edges dynamically Locality Assign edges to threads Fine grained scheduling Assign nodes to threads Coarse grained scheduling High degree nodes IPDPS'16

Scheduling strategies Schedules Just-in-time Runtime Hybrid Edge Node Edge Node All-graph Sub-graph IPDPS'16

JUST-in-TIME schedules IPDPS'16

Edge matching m3 m1 m0 m2 u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c Matching decomposition – set of matchings for all edges in a graph No two edges in a matching set share nodes Execute each matching in parallel At least max degree matching sets Must be processed serially M set a,g,i 1 b,f,j 2 c,h,k 3 d,l 4 e Matching decomposition IPDPS'16

All-graph matching-edge (AGM-E) u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c Compute the graph matching Go over each matching Process edges independently No locality M set a,g,i 1 b,f,j 2 c,h,k 3 d,l 4 e T t0 t1 a g 1 i 2 b f 3 j 4 c h 5 k 6 d 7 e Matching decomposition Schedule IPDPS'16

All-graph matching-node (AGM-N) u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c M m0 m1 m2 m3 a f i 1 b g j k 2 c h l 3 d 4 e Locality optimization Reorder edges per node in matching order Execute on device with barrier If more nodes than hardware threads Break into chunks and process sequentially T t0 t1 a f 1 b g 2 c h 3 d 4 e 5 i 6 j k 7 l Reordering Schedule Overly conservative schedule IPDPS'16

Sub-graph Matching (SGM) Partition graph Size of partition = number of concurrent threads Consider conflicts within partition only Less number of conflicts m3 l u1 k M m0 m1 a f 1 b g 2 c h 3 d 4 e m2 m3 i k j l b g u0 f m1 h a u4 m0 d c u2 e u5 j m2 i u3 IPDPS'16

Scheduling strategies Schedules Just-in-time Runtime Hybrid Edge AGM-E Node Edge Node All-graph AGM-N Sub-graph SGM IPDPS'16

Runtime schedules IPDPS'16

Runtime Schedules Resolve conflicts dynamically Associate locks with nodes (movies and users) Try to acquire locks on nodes of an edge If successful process edge mark as done release locks Else defer to next pass Multiple passes to process all edges IPDPS'16

Edge Lock (EL) Each thread processes an edge m3 l u1 k k e b d j c h g f a i l g b m1 u0 f Each thread processes an edge Lock src (movie), dst (user) Defer conflicting edges to next pass h a u4 d m0 u2 c e u5 j m2 i u3 IPDPS'16

Node Lock (NL) Each thread assigned a movie Number of locks is reduced Go over all users Acquire lock on user, Update Number of locks is reduced Only lock users More opportunity for optimizations Improve data reuse 1-step per round Deal with conflicts later t0 t1 u1 k m0 m1 g b a f m1 u0 f b g h a u4 c h d d m0 e u2 c e u5 j m2 i u3 IPDPS'16

Schedules - summary Schedules Just-in-time Runtime Hybrid Node Edge AGM-E Node EL NL All-graph AGM-N Sub-graph SGM Inspector Executor IPDPS'16

evaluation IPDPS'16

Setup Software Hardware Inputs Scientific Linux 6.6. CUDA 7.0 Kernel 2.6.23 CUDA 7.0 OpenCL 1.2 Hardware Nvidia Tesla K40c 12G device memory AMD R9-290X 8G device memory Inputs IPDPS'16

Overall execution – K40 EL is best performing on scale-free inputs Maximal-matching schedules perform well on road Smaller number of matchings in road networks Node schedules (AGMN and NL) suffer from high-degree node traversals Perform better on road where max-degree is much smaller Slower Faster IPDPS'16

Overall execution – R9-290X Matching schedules perform best High cost of atomics compared to K40 AGM-N suffers from large number of matchings on scale-free Slower Faster IPDPS'16

Cost of atomics Micro-benchmark Runtime-schedules need atomics Atomic update to single location Runtime-schedules need atomics EL – 2 atomics/edge NL – 1 atomic/edge IPDPS'16

Hybrid schedule Combine runtime + matching schedules Matching expensive for high degree nodes Use EL on highest degree nodes High degree nodes reduce conflicts for each other by interleaving edges from each other Use SGM on remaining nodes Smaller max degree Smaller number of matching sets IPDPS'16

Hybrid schedule performance Speedup compared to EL Higher is better K40 benefits more EL is already best single-schedule AMD Better than EL Still not best due to atomics Slower Faster IPDPS'16

Conclusions Best schedule depends on device and graph properties Different ways of scheduling graph applications Scale-free Road Cheap atomics Runtime/Hybrid Just-in-time Expensive atomics Edge Node Just-in-time AGM-E/Ins. Exec. AGM-N/SGM Runtime EL NL IPDPS'16

Acknowledgments Research supported by National Science Foundation Equipment used in this research donated by Nvidia Corporation National Science Foundation grants CNS 1111766, CNS 1302663, CCF 1218568, XPS 1337281, CNS 1406355, DARPA BRASS contract FA8750-16-2-0004 DARPA grant FA8650-15-C-7563 IPDPS'16

Questions? Code available Contact - rashid@cs.utexas.edu https://bitbucket.org/arekay/sgd-release Soon at http://iss.ices.utexas.edu Contact - rashid@cs.utexas.edu IPDPS'16