Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Similar presentations


Presentation on theme: "Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,"— Presentation transcript:

1 Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15, 2015Rahul Sharma Michael Bauer

2 Outline GPU background Motivating examples Verification algorithm and implementation Results

3 GPU background GPU Off-Chip Global Memory SM Streaming Multiprocessors On-Chip ALU Shared Memory (up to 48 KB) Shared Memory (up to 48 KB) Threadblock (CTA) ~ 100s of threads Load Data from Global to Shared __syncthreads() barrier Compute on data in Shared __syncthreads() barrier Store Data from Shared to Global Warp = 32 threads SM

4 Named barriers Synchronization primitive Built into hardware 16 named barriers per SM Two instructions Sync: blocking Arrive: non-blocking Specify participating count __synchthreads is a special case of named barriers Sync 0, N Encode producer- consumer patterns Producer Warp Consumer Warp Named Barrier 0 Sync 0,64Arrive 0,64

5 CudaDMA library (SC 2011) Simple library to abstract data movement between GPU memories Global to shared Shared to global Specialize warps Compute warps: do math DMA warps: move data Use named barriers to synchronize transfers Use more named barriers for double buffering Compute Warps DMA Warps start_xfer (arrive 0,N)wait_start (sync 0,N) finish_xfer (arrive 1,N) wait_start (sync 0,N) wait_finish (sync 1,N) start_xfer (arrive 0,N) Load Data into Shared Buffer Compute on Shared Buffer Load Data into Shared Buffer wait_finish (sync 1,N) finish_xfer (arrive 1,N)

6 Singe compiler (PPoPP 2014) DSL compiler for combustion chemistry Up to 4X speedup Kernels contain 10K lines Maps static dataflow graphs onto warps Use shared memory for communication Assign synchronization points to named barriers Analogous to register allocation Manage passing of data through shared memory Warp 0Warp 1Warp 2Warp 3 A A B B C C D D E E G G F F I I H H J J 2 2 0 0 1 1 3 3 2 2

7 Named barrier challenges Three challenges: Named barrier reuse Must prove that it is safe to recycle named barriers Need happens-before relationship Must be self-consistent Deadlock Shared memory races Two accesses to the same location with at least one being a write Warp 0Warp 1Warp 2Warp 3 A A B B C C D D E E G G F F I I H H J J 2 2 0 0 1 1 3 3 2 2 Warp 0Warp 1 sync 0 arrive 1 sync 1 arrive 0

8 WEFT architecture GPU kernel GPU kernel compile 0 Thread programs 1n Happens Before Improper barrier recycling Shared memory data races WEFT Deadlocks Threadblock (n threads)

9 Thread programs Omit statements irrelevant to properties Straight line programs: sequences of commands Commands sync b [m] arrive b [m] read a write a Restrictive, but followed by the majority of GPU code

10 Well synchronization “Synchronization pattern is deterministic” Same commands synchronize, no double duty Obey generations Subsumes deadlock freedom and safe recycling ProducerConsumer sync 0 write async 1 arrive 1read a sync 0 Generation 1 of barrier 0 Generation 1 of barrier 1 Generation 2 of barrier 0

11 Check well synchronization Need to know Which commands synchronize together What is the generation of the corresponding barrier First challenge: how to infer this information? Generations are invariant over all executions Statically emulate one execution Record synchronization Check that all executions respect the generations

12 Happens before HB relation: reachability A happens before B if path from A from B The path has at least one black edge Check successive generations have HB relationship Main result: HB relation is sound and precise ProducerConsumer sync 0 write async 1 arrive 1read a sync 0 write a arrive 1 sync 0 sync 1 read a sync 0 gen 1 gen 2

13 Data races For every two commands that can race check an HB relationship Sound and complete for race detection

14 Implementation

15 Evaluation (Singe kernels)

16

17 Discovered bugs Write-after-read Benign data races All kernels were well synchronized

18 Conclusion GPUs are much more flexible than people realize Can use GPUs in new ways with named barriers Use of named barriers can create many complications Deadlock, improper recycling, data races Providing good software verification is important Necessary to make named barriers easy to use WEFT verifies code with named barriers Algorithm is both sound and complete Handles real production code efficiently https://github.com/lightsighter/Weft


Download ppt "Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,"

Similar presentations


Ads by Google