Synchronization trade-offs in GPU implementations of Graph Algorithms

Synchronization trade-offs in GPU implementations of Graph Algorithms
Rashid Kaleem1, Anand Venkat2, Sreepathi Pai1, Mary Hall2, Keshav Pingali1 1University of Texas at Austin 2University of Utah

Outline Stochastic Gradient Descent Scheduling Evaluation Conclusions
Recommendation systems Scheduling Just-in-time Runtime Evaluation Conclusions IPDPS'16

Recommendation systems
Predict missing entries on the basis of known entries! User m0 m1 m2 m3 … a f k b g l c i Charles d h e j m0 m1 m2 m3 u0 a f k u1 b g l u2 c u3 i u4 d h u5 e j m0 m1 m2 m3 Charles ? IPDPS'16

Matrix Completion M U ? m0 m1 m2 m3 u0 u1 u2 u3 u4 u5 a f k b g l c i
d h e j h Δ= abs ( (m0.lv , u4.lv )-d ) ; m0.lv = updatelv (m0.lv , Δ ) ; u4.lv = updatelv (u4.lv , Δ ) ; a u4 U m2.u3≈i d m0 u2 c e ? u5 j m2 i u3 IPDPS'16

Schedules Cannot schedule edges that share nodes
m3 l Cannot schedule edges that share nodes Neighborhood of active element (edge) overlaps Find edges that do not share nodes Edge Matching Execute concurrently u1 k g b m1 u0 f h a u4 d m0 u2 c e u5 j m2 i u3 IPDPS'16

Scheduling issues Parallelism Locality Just-in-time Runtime
After input has been read, before computation is performed Runtime Discover independent edges dynamically Locality Assign edges to threads Fine grained scheduling Assign nodes to threads Coarse grained scheduling High degree nodes IPDPS'16

Scheduling strategies
Schedules Just-in-time Runtime Hybrid Edge Node Edge Node All-graph Sub-graph IPDPS'16

JUST-in-TIME schedules
IPDPS'16

Edge matching m3 m1 m0 m2 u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c Matching decomposition – set of matchings for all edges in a graph No two edges in a matching set share nodes Execute each matching in parallel At least max degree matching sets Must be processed serially M set a,g,i 1 b,f,j 2 c,h,k 3 d,l 4 e Matching decomposition IPDPS'16

All-graph matching-edge (AGM-E)
u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c Compute the graph matching Go over each matching Process edges independently No locality M set a,g,i 1 b,f,j 2 c,h,k 3 d,l 4 e T t0 t1 a g 1 i 2 b f 3 j 4 c h 5 k 6 d 7 e Matching decomposition Schedule IPDPS'16

All-graph matching-node (AGM-N)
u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c M m0 m1 m2 m3 a f i 1 b g j k 2 c h l 3 d 4 e Locality optimization Reorder edges per node in matching order Execute on device with barrier If more nodes than hardware threads Break into chunks and process sequentially T t0 t1 a f 1 b g 2 c h 3 d 4 e 5 i 6 j k 7 l Reordering Schedule Overly conservative schedule IPDPS'16

Sub-graph Matching (SGM)
Partition graph Size of partition = number of concurrent threads Consider conflicts within partition only Less number of conflicts m3 l u1 k M m0 m1 a f 1 b g 2 c h 3 d 4 e m2 m3 i k j l b g u0 f m1 h a u4 m0 d c u2 e u5 j m2 i u3 IPDPS'16

Scheduling strategies
Schedules Just-in-time Runtime Hybrid Edge AGM-E Node Edge Node All-graph AGM-N Sub-graph SGM IPDPS'16

Runtime schedules IPDPS'16

Runtime Schedules Resolve conflicts dynamically
Associate locks with nodes (movies and users) Try to acquire locks on nodes of an edge If successful process edge mark as done release locks Else defer to next pass Multiple passes to process all edges IPDPS'16

Edge Lock (EL) Each thread processes an edge
m3 l u1 k k e b d j c h g f a i l g b m1 u0 f Each thread processes an edge Lock src (movie), dst (user) Defer conflicting edges to next pass h a u4 d m0 u2 c e u5 j m2 i u3 IPDPS'16

Node Lock (NL) Each thread assigned a movie Number of locks is reduced
Go over all users Acquire lock on user, Update Number of locks is reduced Only lock users More opportunity for optimizations Improve data reuse 1-step per round Deal with conflicts later t0 t1 u1 k m0 m1 g b a f m1 u0 f b g h a u4 c h d d m0 e u2 c e u5 j m2 i u3 IPDPS'16

Schedules - summary Schedules Just-in-time Runtime Hybrid Node Edge
AGM-E Node EL NL All-graph AGM-N Sub-graph SGM Inspector Executor IPDPS'16

evaluation IPDPS'16

Setup Software Hardware Inputs Scientific Linux 6.6. CUDA 7.0
Kernel CUDA 7.0 OpenCL 1.2 Hardware Nvidia Tesla K40c 12G device memory AMD R9-290X 8G device memory Inputs IPDPS'16

Overall execution – K40 EL is best performing on scale-free inputs
Maximal-matching schedules perform well on road Smaller number of matchings in road networks Node schedules (AGMN and NL) suffer from high-degree node traversals Perform better on road where max-degree is much smaller Slower Faster IPDPS'16

Overall execution – R9-290X
Matching schedules perform best High cost of atomics compared to K40 AGM-N suffers from large number of matchings on scale-free Slower Faster IPDPS'16

Cost of atomics Micro-benchmark Runtime-schedules need atomics
Atomic update to single location Runtime-schedules need atomics EL – 2 atomics/edge NL – 1 atomic/edge IPDPS'16

Hybrid schedule Combine runtime + matching schedules
Matching expensive for high degree nodes Use EL on highest degree nodes High degree nodes reduce conflicts for each other by interleaving edges from each other Use SGM on remaining nodes Smaller max degree Smaller number of matching sets IPDPS'16

Hybrid schedule performance
Speedup compared to EL Higher is better K40 benefits more EL is already best single-schedule AMD Better than EL Still not best due to atomics Slower Faster IPDPS'16

Conclusions Best schedule depends on device and graph properties
Different ways of scheduling graph applications Scale-free Road Cheap atomics Runtime/Hybrid Just-in-time Expensive atomics Edge Node Just-in-time AGM-E/Ins. Exec. AGM-N/SGM Runtime EL NL IPDPS'16

Acknowledgments Research supported by National Science Foundation
Equipment used in this research donated by Nvidia Corporation National Science Foundation grants CNS , CNS , CCF , XPS , CNS , DARPA BRASS contract FA DARPA grant FA C-7563 IPDPS'16

Questions? Code available Contact - rashid@cs.utexas.edu
Soon at Contact - IPDPS'16

Synchronization trade-offs in GPU implementations of Graph Algorithms

Similar presentations

Presentation on theme: "Synchronization trade-offs in GPU implementations of Graph Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Synchronization trade-offs in GPU implementations of Graph Algorithms

Similar presentations

Presentation on theme: "Synchronization trade-offs in GPU implementations of Graph Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback