Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Weak Execution Ordering - Exploiting Iterative Methods on Many-Core GPUs
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho University of Florida Lu Peng Louisiana State University

Outline CUDA review & Inter-Block communication and synchronization
Host synchronization overhead Applications with iterative PDE solver Optimizations on inter-block communication Performance results Conclusion

CUDA Programming Model
Host invoke Kernels/Grids to execute on GPU Kernel/Grid Blocks Threads Thread Application Host execution kernel 0 Block 0 Block 1 Block 2 Block 3 ... ... ... ... Host execution kernel 1 Block 0 Block N … … ... ... …

CUDA GPU Architecture … … … … …
Blocks assigned to Stream Multiprocessors (SM) composed of 8 Stream processors (SP) and Shared (local) Memory. Block synchronization must through Host! No synch. among blocks! Block 58 Block 59 Num. of blocks limited by resources Scheduler: Waiting Blocks GPU SM 0 SM 29 Block 0 … … … SP SP Block 60 SP SP Block 61 SP SP Blocks can communicate through GM Data lost when return to host Block 1 SP SP … … Shared Memory Interconnect Network Block N Global Memory

Example: Breath First Search (BFS)
Inactive Given G(V,E) source (S), compute steps to reach all other nodes. Each thread compute one node Initially all inactive except source node If activated, visit it and activate its unvisited neighbors n-1 steps needed to reach nodes visited in nth iteration Keep iterating until no active node Synchronization needed after each Iteration Active Visited 1st Iteration S S S C C C A A A B B B 2nd Iteration D D D E E E 3rd Iteration; Done

No-Host vs. Host Synchronization
Limit number of nodes to fit in 1 Block – for avoiding host synchronization Host-sync can be replaced by __syncthreads() Avoid multiple kernel initiation overhead Data can stay in shared memory to reduce global accesses for save/restore Reduce intermediate partial data transfer or termination flag to host during host synchronization 6

No-Host/Host Result Graph generated by GTgraph with 3K nodes
No-host uses __syncthreads() in each iteration 67% Host overhead

Applications with Iterative PDE solver
Partial Differential Equation solver are widely used Weak execution ordering / chaotic PDE using iterative methods Accuracy of the solver is NOT critical Poisson Image Editing 3D Shape from Shading

Basic 3D-Shape in CUDA Newx,y = f (Oldx-1,y, Oldx,y-1, Oldx,y+1, Oldx+1,y) Each block computes a sub-grid. Nodes from neighboring blocks needed for computing boundary nodes Host synchronization: Go back to host after each iteration But, no exact order needed! Block2 Shared Mem . . . . Grid in Global Memory . . . . . . . . . . . . . . . . . . Block 0 Block 1 Block 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block 3 Block 4 Block 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block5 Shared Mem

Coarse Synchronization
Host synchronization every (n ) iterations Inter-block communicate through GM with neighbor blocks for updated boundary nodes Block2 Shared Mem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block5 Shared Mem

Coarse vs. Fine Host Synchronization
Coarse synchronization Less synchronization overhead, but Need more iterations to converge due to imprecise boundary updates through inter-block comm. Reduce inter-block communication overhead Overlap communication with computation Neighbor communicate: upper/lower vs. 4 neighbors Blocks scheduling strategy: square vs. stripe

Overlap Communication, Computation
Separate communication threads to overlap with computation No precise order is needed Computation Threads Communication Threads Initialization Phase: Load 32*20 data nodes into Shared Mem __syncthreads() Load the boundary nodes into Shared Mem Main Phase: While < n iteration{ Compute iterations (no-host) } Store and Load boundary values to/from Global memory Ending Phase: Store new 32*20 data nodes to global mem return

Overlap Communication with Computation
Communication frequency: Execution Time = Time/Iteration * Number of Iteration 13

Neighbor Communication
Only communicate upper and lower neighbors Less data communication through global memory Coalesced memory moves Incomplete data communication  slower in convergence Communicate with all four neighbors More and uncoalesced data moves May converge faster

Blocks Scheduling Blocks scheduled in groups due to limited resources.
No updated data from inactive blocks. Try to minimize boundary nodes of the whole group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 Stripe scheduling Square scheduling

Base: s

Conclusion Inter-block synchronization
Not supported on GPU Significant impact on asynchronous PDE solvers Coarse synchronization and optimizations to improve the overall Performance Separate communication threads to overlap computation Block scheduling and inter-block communication Speedup of 4-5 times compared with fine-granularity host synchronization

Thank You!! Questions?

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Similar presentations

Presentation on theme: "Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Similar presentations

Presentation on theme: "Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho"— Presentation transcript:

Similar presentations

About project

Feedback