Synchronization trade-offs in GPU implementations of Graph Algorithms Rashid Kaleem1, Anand Venkat2, Sreepathi Pai1, Mary Hall2, Keshav Pingali1 1University of Texas at Austin 2University of Utah
Outline Stochastic Gradient Descent Scheduling Evaluation Conclusions Recommendation systems Scheduling Just-in-time Runtime Evaluation Conclusions IPDPS'16
Recommendation systems Predict missing entries on the basis of known entries! User m0 m1 m2 m3 … a f k b g l c i Charles d h e j m0 m1 m2 m3 u0 a f k u1 b g l u2 c u3 i u4 d h u5 e j m0 m1 m2 m3 Charles ? IPDPS'16
Matrix Completion M U ? m0 m1 m2 m3 u0 u1 u2 u3 u4 u5 a f k b g l c i d h e j h Δ= abs ( (m0.lv , u4.lv )-d ) ; m0.lv = updatelv (m0.lv , Δ ) ; u4.lv = updatelv (u4.lv , Δ ) ; a u4 U m2.u3≈i d m0 u2 c e ? u5 j m2 i u3 IPDPS'16
Schedules Cannot schedule edges that share nodes m3 l Cannot schedule edges that share nodes Neighborhood of active element (edge) overlaps Find edges that do not share nodes Edge Matching Execute concurrently u1 k g b m1 u0 f h a u4 d m0 u2 c e u5 j m2 i u3 IPDPS'16
Scheduling issues Parallelism Locality Just-in-time Runtime After input has been read, before computation is performed Runtime Discover independent edges dynamically Locality Assign edges to threads Fine grained scheduling Assign nodes to threads Coarse grained scheduling High degree nodes IPDPS'16
Scheduling strategies Schedules Just-in-time Runtime Hybrid Edge Node Edge Node All-graph Sub-graph IPDPS'16
JUST-in-TIME schedules IPDPS'16
Edge matching m3 m1 m0 m2 u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c Matching decomposition – set of matchings for all edges in a graph No two edges in a matching set share nodes Execute each matching in parallel At least max degree matching sets Must be processed serially M set a,g,i 1 b,f,j 2 c,h,k 3 d,l 4 e Matching decomposition IPDPS'16
All-graph matching-edge (AGM-E) u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c Compute the graph matching Go over each matching Process edges independently No locality M set a,g,i 1 b,f,j 2 c,h,k 3 d,l 4 e T t0 t1 a g 1 i 2 b f 3 j 4 c h 5 k 6 d 7 e Matching decomposition Schedule IPDPS'16
All-graph matching-node (AGM-N) u0 u4 u2 u5 u1 u3 l k g b f h a d e j i c M m0 m1 m2 m3 a f i 1 b g j k 2 c h l 3 d 4 e Locality optimization Reorder edges per node in matching order Execute on device with barrier If more nodes than hardware threads Break into chunks and process sequentially T t0 t1 a f 1 b g 2 c h 3 d 4 e 5 i 6 j k 7 l Reordering Schedule Overly conservative schedule IPDPS'16
Sub-graph Matching (SGM) Partition graph Size of partition = number of concurrent threads Consider conflicts within partition only Less number of conflicts m3 l u1 k M m0 m1 a f 1 b g 2 c h 3 d 4 e m2 m3 i k j l b g u0 f m1 h a u4 m0 d c u2 e u5 j m2 i u3 IPDPS'16
Scheduling strategies Schedules Just-in-time Runtime Hybrid Edge AGM-E Node Edge Node All-graph AGM-N Sub-graph SGM IPDPS'16
Runtime schedules IPDPS'16
Runtime Schedules Resolve conflicts dynamically Associate locks with nodes (movies and users) Try to acquire locks on nodes of an edge If successful process edge mark as done release locks Else defer to next pass Multiple passes to process all edges IPDPS'16
Edge Lock (EL) Each thread processes an edge m3 l u1 k k e b d j c h g f a i l g b m1 u0 f Each thread processes an edge Lock src (movie), dst (user) Defer conflicting edges to next pass h a u4 d m0 u2 c e u5 j m2 i u3 IPDPS'16
Node Lock (NL) Each thread assigned a movie Number of locks is reduced Go over all users Acquire lock on user, Update Number of locks is reduced Only lock users More opportunity for optimizations Improve data reuse 1-step per round Deal with conflicts later t0 t1 u1 k m0 m1 g b a f m1 u0 f b g h a u4 c h d d m0 e u2 c e u5 j m2 i u3 IPDPS'16
Schedules - summary Schedules Just-in-time Runtime Hybrid Node Edge AGM-E Node EL NL All-graph AGM-N Sub-graph SGM Inspector Executor IPDPS'16
evaluation IPDPS'16
Setup Software Hardware Inputs Scientific Linux 6.6. CUDA 7.0 Kernel 2.6.23 CUDA 7.0 OpenCL 1.2 Hardware Nvidia Tesla K40c 12G device memory AMD R9-290X 8G device memory Inputs IPDPS'16
Overall execution – K40 EL is best performing on scale-free inputs Maximal-matching schedules perform well on road Smaller number of matchings in road networks Node schedules (AGMN and NL) suffer from high-degree node traversals Perform better on road where max-degree is much smaller Slower Faster IPDPS'16
Overall execution – R9-290X Matching schedules perform best High cost of atomics compared to K40 AGM-N suffers from large number of matchings on scale-free Slower Faster IPDPS'16
Cost of atomics Micro-benchmark Runtime-schedules need atomics Atomic update to single location Runtime-schedules need atomics EL – 2 atomics/edge NL – 1 atomic/edge IPDPS'16
Hybrid schedule Combine runtime + matching schedules Matching expensive for high degree nodes Use EL on highest degree nodes High degree nodes reduce conflicts for each other by interleaving edges from each other Use SGM on remaining nodes Smaller max degree Smaller number of matching sets IPDPS'16
Hybrid schedule performance Speedup compared to EL Higher is better K40 benefits more EL is already best single-schedule AMD Better than EL Still not best due to atomics Slower Faster IPDPS'16
Conclusions Best schedule depends on device and graph properties Different ways of scheduling graph applications Scale-free Road Cheap atomics Runtime/Hybrid Just-in-time Expensive atomics Edge Node Just-in-time AGM-E/Ins. Exec. AGM-N/SGM Runtime EL NL IPDPS'16
Acknowledgments Research supported by National Science Foundation Equipment used in this research donated by Nvidia Corporation National Science Foundation grants CNS 1111766, CNS 1302663, CCF 1218568, XPS 1337281, CNS 1406355, DARPA BRASS contract FA8750-16-2-0004 DARPA grant FA8650-15-C-7563 IPDPS'16
Questions? Code available Contact - rashid@cs.utexas.edu https://bitbucket.org/arekay/sgd-release Soon at http://iss.ices.utexas.edu Contact - rashid@cs.utexas.edu IPDPS'16