CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Fall 2010 Jih-Kwon Peir Computer Information Science Engineering University of Florida

Supplement 4: Handling Control Flow

Objective To understand the implications of control flow on
Branch divergence overhead SM execution resource utilization To learn better ways to write code with control flow (reduce branch divergence) To understand compiler/HW predication designed to reduce the impact of control flow There is a cost involved

Quick terminology review
Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Warp: a group of threads executed physically in parallel in G80 Block: a group of threads that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

How thread blocks are partitioned
Thread blocks are partitioned into warps Thread IDs within a warp are consecutive and in increasing order Warp 0 starts with Thread ID 0 Partitioning is always the same Thus you can use this knowledge in control flow However, the exact size of warps may change from generation to generation (details will be covered next) However, DO NOT rely on any ordering between warps (independent) If there are any dependencies between threads, you must __syncthreads() to get correct results

How thread blocks are partitioned
Assume 2 warps, each has 8 threads Warp 1: Threads (0,0), (1,0), (2,0), (3,0), (0,1), (1,1), (2,1), (3,1) Warp 2: Threads (0,2), (1,2), (2,2), (3,2), (0,3), (1,3), (2,3), (3,3)

Control Flow Instructions
Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there is no more. A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path WARP_SIZE

Parallel Reduction Given an array of values, “reduce” them to a single value in parallel Examples Sum reduction: sum of all values in the array Max reduction: maximum of all values in the array Typically parallel implementation: Recursively halve # threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads

A Vector Reduction Example
Sum an Array: Using one thread block Assume an in-place reduction using shared memory The original vector is in device global memory The shared memory used to hold a partial sum vector Each iteration brings the partial sum vector closer to the final sum The final solution will be in element 0

A simple implementation
Assume we have already loaded array into shared M __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } One element per thread

Vector Reduction with Shared Memory Bank Conflicts
Array elements 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations

Vector Reduction with Branch Divergence
Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements

Some Observations In each iterations, two control flow paths will be sequentially traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time All odd index threads are disabled right from the beginning! On average, less than ¼ of the threads will be activated for all warps over time. After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire

Short comings of the implementation
Assume we have already loaded array into shared M __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } BAD: Divergence due to interleaved branch decisions

A better implementation
Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } Compute in adjacent threads

No Divergence until < 16 sub-sums
Thread 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4

Some Observations About the New Implementation
Only the last 5 iterations will have divergence Entire warps will be shut down as iterations progress For a 512-thread block, 4 iterations to shut down all but one warps in each block Better resource utilization, will likely retire warps and thus blocks faster Recall, no bank conflicts either

A Potential Further Refinement but bad idea
For last 6 loops only one warp active (i.e. tid’s 0..31) Shared reads & writes SIMD synchronous within a warp So skip __syncthreads() and unroll last 5 iterations unsigned int tid = threadIdx.x; for (unsigned int d = n>>1; d > 32; d >>= 1) { __syncthreads(); if (tid < d) shared[tid] += shared[tid + d]; } if (tid <= 32) { // unroll last 6 predicated steps shared[tid] += shared[tid + 32]; shared[tid] += shared[tid + 16]; shared[tid] += shared[tid + 8]; shared[tid] += shared[tid + 4]; shared[tid] += shared[tid + 2]; shared[tid] += shared[tid + 1]; This would not work properly if warp size decreases; need __synchthreads() between each statement! However, having ___synchthreads() in if statement is problematic.

Predicated Execution Concept Handling Branch divergence
<p1> LDR r1,r2,0 If p1 is TRUE, instruction executes normally If p1 is FALSE, instruction treated as NOP

Predication Example : : if (x == 10) LDR r5, X c = c + 1;
p1 <- r5 eq 10 <p1> LDR r1 <- C <p1> ADD r1, r1, 1 <p1> STR r1 -> C

Predication very helpful for if-else
B C D B C D

If-else example : : p1,p2 <- r5 eq 10 p1,p2 <- r5 eq 10
<p1> inst 1 from B <p1> inst 2 from B <p1> : <p2> inst 1 from C <p2> inst 2 from C : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p2> inst 1 from C <p1> inst 2 from B <p2> inst 2 from C <p1> : schedule The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence.

Instruction Predication in G80
Comparison instructions set condition codes (CC) Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.) Compiler tries to predict if a branch condition is likely to produce many divergent warps If guaranteed not to diverge: only predicates if < 4 instructions If not guaranteed: only predicates if < 7 instructions May replace branches with instruction predication ALL predicated instructions take execution cycles Those with false conditions don’t write their output Or invoke memory loads and stores Saves branch instructions, so can be cheaper than serializing divergent paths

For more information on instruction predication
“A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp Also available in Readings in Computer Architecture, edited by Hill, Jouppi, and Sohi, Morgan Kaufmann, 2000

Performance with Prefetch and Unroll

Graph Algorithm on CUDA

Graph Representation on CUDA
Compact edge list representation: A long vector of vertices A long vector of edges, edges of vertex i following edges of vertex i+1 Each entry in Vertex array points to its starting edge list in Edge array Less space required so larger graphs can be accommodated on the GPU memory O(V )  O(V+E) Number of Vertices 1 2 3 4 5 5 2 3 4 4 6 9 12 1 5 Space Complexity O(V+E) 1 4 5 1 3 5 1 2 3 4 Number of Edges 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2

Breadth First Search – An Example
Problem: Find the smallest number of edges to reach every vertex from a given source vertex Properties: Follows Levels, once a level is visited it is not visited again, like an explosion CPU (Host) can be used for level synchronization GPU used to exploit intra-level parallelism Each vertex updates the cost (level) of its neighbors Synchronization issues: No synchronization required as multiple writes don’t cause problem

Breath First Search (BFS) Example
Given G(V,E) source (s), compute steps to reach all other Vertexes. Each thread compute one vertex Initially all inactive except source vertex If activated, visit it (visited) and activate its unvisited neighbors n-1 steps needed to reach vertex visited in nth iteration Keep iterating until no active vertexes Synchronization needed after each Iteration (or level) Inactive Active Visited 1st Iteration S S S C C C A A A B B B In order 1) The lifespan of the shared variables located in the shared memory is confined within a single kernel invocation. Therefore, the data in the shared memory must be saved into the global memory for future reuses before the kernel exits to the host. The saved data is likely to be restored back into the shared memory in the subsequent kernel invocation. 2) Beside the overhead involved in kernel initiation, proper intermediate results may need to be transferred back to the host for controlling the synchronization function. 2nd Iteration D D D E E E 3rd Iteration; Done

Breadth First Search CUDA Implementation Details
One Thread per vertex (for small size) Two Boolean arrays Frontier and Visited are used Each thread looks at its entry in Frontier array If present in Frontier, it executes / updates the cost (level) of its neighbors Adds its neighbors to Frontier if already not present in Visited array Adds its neighbors to Visited array CPU initiates each Kernel execution Execution stops when Frontier is empty 2 1 2 2 S 1 1 2 Frontier x x x x x x x x Visited x x x x x x x x Execution Stops

Breadth First Search results
P. Harish , et al. IIIT - Hyderabad , HiPC’07 O(V) O(E) BFS CPU (ms) GPU New York 250K 730K 313 126 Florida 1M 2.7M 1055 1143 USA East 3M 8M 3844 4005 USA West 6M 15M 6688 7853 Results on Random Scale Free Graphs with 0.1% high degree vertices Results on Random Graphs with 6 degree per vertex Results on Real World Data. Avg. degree 2-3 Results on 100K random Graph with varying degree per vertex

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Similar presentations

Presentation on theme: "CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Similar presentations

Presentation on theme: "CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming"— Presentation transcript:

Similar presentations

About project

Feedback