Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides by Mark Harris (NVIDIA) Introduction to CUDA Programming
Scan / Parallel Prefix Sum 3 1 7 4 1 6 3 3 4 11 11 15 16 22 Given an array A = [a0, a1, …, an-1] and a binary associative operator @ with identity I scan (A) = [I, a0, (a0 @ a1), …, (a0 @ a1 @ … @ an-2)] This is the exclusive scan We’ll focus on this
This is the inclusive scan 3 1 7 4 1 6 3 3 4 11 11 15 16 22 25 Given an array A = [a0, a1, …, an-1] and a binary associative operator @ with identity I scan (A) = [a0, (a0 @ a1), …, (a0 @ a1 @ … @ an-1)] This is the inclusive scan
Scan is used as a building block for many parallel algorithms Applications of Scan Scan is used as a building block for many parallel algorithms Radix sort Quicksort String comparison Lexical analysis Run-length encoding Histograms Etc. See: Guy E. Blelloch. “Prefix Sums and Their Applications”. In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/CMU-CS-90-190.html
Pre-GPU GPU Computing Scan Background First proposed in APL by Iverson (1962) Used as a data parallel primitive in the Connection Machine (1990) Feature of C* and CM-Lisp Guy Blelloch used scan as a primitive for various parallel algorithms Blelloch, 1990, “Prefix Sums and Their Applications” GPU Computing O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2) Applied to Summed Area Tables by Hensley et al. (EG05) O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06) O(n) work & space GPU implementation by Harris et al. (2007) NVIDIA CUDA SDK and GPU Gems 3 Applied to radix sort, stream compaction, and summed area tables
3 1 7 4 1 6 3 3 4 N additions Use a guide: Sequential algorithm 4 1 6 3 3 4 void scan( float* output, float* input, int length) { output[0] = 0; // since this is a prescan, not a scan for(int j = 1; j < length; ++j) output[j] = input[j-1] + output[j-1]; } N additions Use a guide: Want parallel to be work efficient Does similar amount of work
Naïve Parallel Algorithm for d := 1 to log2n do forall k in parallel do if k >= 2d then x[k] := x[k − 2d-1] + x[k] 3 1 7 4 1 6 3 3 1 7 4 1 6 d = 1, 2d -1 = 1 3 4 8 7 4 5 7 d = 2, 2d -1 = 2 3 4 11 11 12 12 11 d = 3, 2d -1 = 4 3 4 11 11 15 16 22
Need Double-Buffering First all read Then all write But no ordering guarantees on a GPU Solution Use two arrays: Input & Output Alternate at each step 3 4 8 7 4 5 7 3 4 11 11 12 12 11
Output in global memory 3 1 7 4 1 6 3 3 1 7 4 1 6 3 4 8 7 4 5 7 3 4 11 Double Buffering Two arrays A & B Input in global memory Output in global memory 3 1 7 4 1 6 3 input 3 1 7 4 1 6 A B 3 4 8 7 4 5 7 3 4 11 11 12 12 11 A B 3 4 11 11 15 16 22 3 4 11 11 15 16 22 global
Naïve Kernel in CUDA __global__ void scan_naive(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; int thid = threadIdx.x, pout = 0, pin = 1; temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0; for (int dd = 1; dd < n; dd *= 2) pout = 1 - pout; pin = 1 - pout; int basein = pin * n, baseout = pout * n; syncthreads(); temp[baseout +thid] = temp[basein +thid]; if (thid >= dd) temp[baseout +thid] += temp[basein +thid - dd]; } g_odata[thid] = temp[baseout +thid];
Analysis of naïve kernel This scan algorithm executes log(n) parallel iterations The steps do n-1, n-2, n-4,... n/2 adds each Total adds: O(n*log(n)) This scan algorithm is NOT work efficient Sequential scan algorithm does n adds
A common parallel algorithms pattern: Balanced Trees Improving Efficiency A common parallel algorithms pattern: Balanced Trees Build balanced binary tree on input data and sweep to and from the root Tree is conceptual, not an actual data structure For scan: Traverse from leaves to root building partial sums at internal nodes Root holds sum of all leaves Traverse from root to leaves building the scan from the partial sums Algorithm originally described by Blelloch (1990)
Balanced Tree-Based Scan Algorithm / Up-Sweep
Balanced Tree-Based Scan Algorithm / Up-Sweep
Balanced Tree-Based Scan Algorithm / Up-Sweep
Balanced Tree-Based Scan Algorithm / Up-Sweep
Balanced Tree-Based Scan Algorithm / Up-Sweep
Balanced Tree-Based Scan Algorithm / Down-Sweep
Balanced Tree-Based Scan Algorithm / Down-Sweep
Balanced Tree-Based Scan Algorithm / Down-Sweep
Up-Sweep Pseudo-Code
Down-Sweep Pseudo-Code
Up-Sweep Down-Sweep Essentially a reduction Two phases Up-Sweep Essentially a reduction Produces many partial results Down-Sweep Propagating the partial results to all relevant elements
Just a reduction: Up-Sweep 1 2 2 5 6 3 8 2 4 1 5 2 7 9 3 5 1 3 2 7 6 9 10 4 5 5 7 7 16 3 8 1 3 2 10 6 9 8 19 4 5 5 12 7 16 3 24 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 36 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 65
Now let’s see this is a tree Up-Sweep Now let’s see this is a tree 1 2 2 5 6 3 8 2 4 1 5 2 7 9 3 5 3 7 9 10 5 7 16 8 10 19 12 24 29 36 Notice we only have these nodes left in our array: the rest were partial results 65 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 65
Up-Sweep So, this is what’s left nodes without values don’t exist, they were partial results 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65
For the second phase we need to think: Down-Sweep For the second phase we need to think: The edges in reverse The empty nodes as placeholders for partial results 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65
Now let’s view the tree as a collection of subtrees Down-Sweep Now let’s view the tree as a collection of subtrees The root of each sub tree, where it’s still present contains the reduction of all subtree elements i.e., the sum of all subtree elements 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65
Let’s focus on the rightmost subtree: Down-Sweep Let’s focus on the rightmost subtree: 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65
Down-Sweep Before the last step of the down-sweep phase the yellow element will contain the sum (57) of all elements to the left of the subtree. 3 57 The last step will take the following two actions 3+ 57 = 60, this goes on the rightmost element This is the sum of all elements including 3 but excluding the right most one overwrite 3 with 57 This is the sum of all elements left of 3
Down-Sweep In terms of the array stored in memory the aforementioned actions look like this: 57 61 3 57 Where: the dark arrows represent addition the red dotted arrow represents a move
Down-Sweep Let’s now focus at the rightmost subtree that contains the last four nodes: This will be processed at the step before the previous subtree we just discussed 7 3 16
Down-Sweep Before the previous to the last step of the down-sweep phase the green element will contain the sum (41) of all elements to the left of the subtree. 7 3 16 41
The actions that will be taken at this step are: Down-Sweep The actions that will be taken at this step are: 16 + 41 = 57 will be written as the root of the rightmost subtree As we saw before this is the sum of all element left of the rightmost subtree 41 will replace 16 This is the sum of all elements left of the subtree rooted by 16 7 3 41 57 41
Down-Sweep In terms of the array stored in memory the aforementioned actions look like this: 7 41 3 57 7 16 3 41 Where: the dark arrows represent addition the red dotted arrow represents a move
Down-Sweep Now let’s go a step back looking at the complete right subtee (in green) 4 5 7 3 5 16 12
Down-Sweep Before this step the root node will contain the sum (29) of all elements of the left subtree 4 5 7 3 5 16 12 29
As before we’ll do two things: Down-Sweep As before we’ll do two things: 29+12 = 41 and this becomes the root of the rightmost subtree This should be the sum of all elements to the left of that subtree for the next step (which we saw previously) 29 replaces 12 4 5 7 3 same reason: 29 is the sum of all elements left of the subtree rooted by what was 12. 5 16 29 41 29
Down-Sweep Let’s try to generalize what happens at every step of the down-sweep phase Let’s look at step 1: There is only one subtree shown in purple 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29 65
Down-Sweep Before we process this tree as described before the root node must contain the sum of all elements to the left of the tree There are no elements Hence the root must be 0 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29
Now repeat the steps we saw before Down-Sweep Now repeat the steps we saw before 29 + 0 = 29 and this becomes the root of the right subtree 29 gets replaced by 0 1 2 6 8 4 5 7 3 3 9 5 16 10 12 29
Down-Sweep In terms of the array stored in memory the aforementioned actions look like this: 1 3 2 10 6 9 8 4 5 5 12 7 16 3 29 1 3 2 10 6 9 8 29 4 5 5 12 7 16 3 Where: the dark arrows represent addition the red dotted arrow represents a move
Declarations & Copying to shared memory Cuda Implementation Declarations & Copying to shared memory Two elements per thread __global__ void prescan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[N];// allocated on invocation int thid = threadIdx.x; int offset = 1; temp[2*thid] = g_idata[2*thid]; // load input into shared memory temp[2*thid+1] = g_idata[2*thid+1];
Cuda Implementation Up-Sweep for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree { __syncthreads(); if (thid < d) int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; Same computation Different assignment of threads
Up-Sweep: Who does what
Up-Sweep: Who does what For N=16 ai 0 bi 1 offset 1 d 8 n 16 thid 0 ai 1 bi 3 offset 2 d 4 n 16 thid 0 ai 3 bi 7 offset 4 d 2 n 16 thid 0 ai 7 bi 15 offset 8 d 1 n 16 thid 0 ai 2 bi 3 offset 1 d 8 n 16 thid 1 ai 5 bi 7 offset 2 d 4 n 16 thid 1 ai 11 bi 15 offset 4 d 2 n 16 thid 1 ai 4 bi 5 offset 1 d 8 n 16 thid 2 ai 9 bi 11 offset 2 d 4 n 16 thid 2 ai 6 bi 7 offset 1 d 8 n 16 thid 3 ai 13 bi 15 offset 2 d 4 n 16 thid 3 ai 8 bi 9 offset 1 d 8 n 16 thid 4 ai 10 bi 11 offset 1 d 8 n 16 thid 5 ai 12 bi 13 offset 1 d 8 n 16 thid 6 ai 14 bi 15 offset 1 d 8 n 16 thid 7
Down-Sweep // clear the last element if (thid == 0) { temp[n - 1] = 0; } // traverse down tree & build scan for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads(); if (thid < d) int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } __syncthreads()
Down-Sweep: Who does what ai 7 bi 15 offset 8 d 1 n 16 thid 0 ai 3 bi 7 offset 4 d 2 n 16 thid 0 ai 1 bi 3 offset 2 d 4 n 16 thid 0 ai 0 bi 1 offset 1 d 8 n 16 thid 0 ai 11 bi 15 offset 4 d 2 n 16 thid 1 ai 5 bi 7 offset 2 d 4 n 16 thid 1 ai 2 bi 3 offset 1 d 8 n 16 thid 1 ai 9 bi 11 offset 2 d 4 n 16 thid 2 ai 4 bi 5 offset 1 d 8 n 16 thid 2 ai 13 bi 15 offset 2 d 4 n 16 thid 3 ai 6 bi 7 offset 1 d 8 n 16 thid 3 ai 8 bi 9 offset 1 d 8 n 16 thid 4 ai 10 bi 11 offset 1 d 8 n 16 thid 5 ai 12 bi 13 offset 1 d 8 n 16 thid 6 ai 14 bi 15 offset 1 d 8 n 16 thid 7
All threads do: __syncthreads(); Copy to output // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1]; }
Current scan implementation has many shared memory bank conflicts These really hurt performance on hardware Occur when multiple threads access the same shared memory bank with different addresses No penalty if all threads access different banks Or if all threads access exact same address Access costs 2*M cycles if there is a conflict Where M is max number of threads accessing single bank
Loading from Global Memory to Shared Each thread loads two shared mem data elements Original code interleaves loads: temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; Threads:(0,1,2,…,8,9,10,…) banks:(0,2,4,…,0,2,4,…) Better to load one element from each half of the array temp[thid] = g_idata[thid]; temp[thid + (n/2)] = g_idata[thid + (n/2)];
Bank Conflicts in the Tree Algorithm / Up-Sweep When we build the sums, each thread reads two shared memory locations and writes one: Threads 0 and 8 access bank 0 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … First iteration: 2 threads access each of 8 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses
Bank Conflicts in the Tree Algorithm / Up-Sweep When we build the sums, each thread reads two shared memory locations and writes one: Threads 1 and 9 access bank 2, and so on t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … First iteration: 2 threads access each of 8 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses
Bank Conflicts in the Tree Algorithm / Down-Sweep 2nd iteration: even worse 4-way bank conflicts; for example: Threads 0,4,8,12, access bank 1, Threads 1,5,9,13, access Bank 5, etc. t0 t1 t2 t3 t4 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... 16 … 2nd iteration: 4 threads access each of 4 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses
Using Padding to Prevent Conflicts We can use padding to prevent bank conflicts Just add a word of padding every 16 words: Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … P … 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... P … In time:
Using Padding to Remove Conflicts After you compute a shared mem address like this: Address = 2 * stride * thid; Add padding like this: Address += (address / 16); (address >> 4) This removes most bank conflicts Not all, in the case of deep trees
A full binary tree with 64 leaf nodes: Scan Bank Conflicts (1) A full binary tree with 64 leaf nodes: Multiple 2-and 4-way bank conflicts Shared memory cost for whole tree 1 32-thread warp = 6 cycles per thread w/o conflicts Counting 2 shared mem reads and one write (s[a] += s[b]) 6 * (2+4+4+4+2+1) = 102 cycles 36 cycles if there were no bank conflicts (6 * 6) Half-warp
Scan Bank Conflicts (1) Look at step 1: scale 1 2-way conflicts 1. first 8 threads read s[a] 2 cycles 2. the next 8 threads read s[a] 2 cycles 3. The first 8 threads read s[b] 2 cycles 4. the next 8 threads read s[b] 2 cycles 5. the first 8 thread update s[a] 2 cycles 6. the next 8 threads update s[a] 2 cycles Total is 12 cycles or 6 (the no-bank-conflicts time) x 2 (the bank conflict ways)
Hence for all steps: Scan Bank Conflicts (1) Look at step 2: scale 2 4-way conflicts 1. first 4 threads read s[a] 2 cycles 2. threads 3-7 read s[a] 2 cycles 3. threads 8-11 read s[a] 2 cycles 4. threads 12-15 read s[a] 2 cycles total is 24 cycles or 6 (the no-bank-conflicts time) x 4 (the bank conflict ways) Hence for all steps: 6 * (2+4+4+4+2+1) = 102 cycles
It’s much worse with bigger trees Scan Bank Conflicts (2) It’s much worse with bigger trees A full binary tree with 128 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: 12*2 + 6*(4+8+8+4+2+1) = 186 cycles 48 cycles if there were no bank conflicts: 12*1 + (6*6) Note two warps are needed for the first step hence the 12 * 2 for the first step after step 1 only one warp is active
A full binary tree with 512 leaf nodes Scan Bank Conflicts (3) A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: 48*2+24*4+12*8+6* (16+16+8+4+2+1) = 570 cycles 120 cycles if there were no bank conflicts
Fixing Scan Bank Conflicts Insert padding every NUM_BANKS elements const int LOG_NUM_BANKS = 4; // 16 banks int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) int a = s*(2*tid); int b = s*(2*tid+1) a += (a >> LOG_NUM_BANKS); // insert pad word b += (b >> LOG_NUM_BANKS); // insert pad word shared[a] += shared[b]; }
Fixing Scan Bank Conflicts A full binary tree with 64 leaf nodes No more bank conflicts However, there are ~8 cycles overhead for addressing For each s[a] += s[b] (8 cycles/iter. * 6 iter. = 48 extra cycles) So just barely worth the overhead on a small tree 84 cycles vs. 102 with conflicts vs. 36 optimal 2 shifts and 2 adds per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. So (6+8)*6 = 84 cycles
Fixing Scan Bank Conflicts A full binary tree with 128 leaf nodes Only the last 6 iterations shown (root and 5 levels below) No more bank conflicts! Significant performance win: 106 cycles vs. 186 with bank conflicts vs. 48 optimal 1 shift and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. So (6+8)*7 = 98 cycles
Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Wait, we still have bank conflicts Improved 304 cycles vs. 570 with bank conflicts vs. 120 optimal 1 shift and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead plus the 6 cycles for the s[a] += s[b] without bank conflicts. But we still have 2-way bank conflicts on 4 tree levels (out of 9 total). So… (6+8)*5 + (12+8)*4= 150 cycles
Why are there bank conflicts 1-st level padding Adds one element every 16 Original address becomes a + a / 16 Recall: threads using a s=2^n stride So adjacent threads will try to access Thread i k*2^n Thread i+1 (k+2) * 2^n int a = s*(2*tid); int b = s*(2*tid+1) With our padding these become k*2^n + k*2^n / 16 (k+2) * 2^n + (k+2) *2^n / 16 What happens when n = 7? You are going over 16 pad words ;) k * 128 + k * 8 (k+2) * 128 + (k + 2) * 8 Use k = 0 for example 0 bank 0 256 + 16 = 272 bank 0
Fixing Scan Bank Conflicts It’s possible to remove all bank conflicts Just do multi-level padding Example: two-level padding: const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) int a = s*(2*tid); int b = s*(2*tid+1) int offset = (a >> LOG_NUM_BANKS); // first level a += offset + (offset >>LOG_NUM_BANKS); // second level offset = (b >> LOG_NUM_BANKS); // first level b += offset + (offset >>LOG_NUM_BANKS); // second level temp[a] += temp[b]; } A and b calculation so that both offset and offset>>LOG_NUM_BANKS are added to them.
Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) No bank conflicts But an extra cycle overhead per address calculation Not worth it: 440 cycles vs. 304 with 1-level padding With 1-level padding, bank conflicts only occur in warp 0 Very small remaining cost due to bank conflicts Removing them hurts all other warps 2 shifts and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 12 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. (6+12)*9 = 162 cycles
See Scan Large Array in SDK Large Arrays So far: Array can be processed by a block 1024 elements Larger arrays? Divide into blocks Scan each with a block of threads Produce partial scans Scan the partial scans Add the corresponding scan result back to all elements of each block See Scan Large Array in SDK
Large Arrays
Application: Stream Compaction
Application: Radix Sort
Application: Radix Sort f(i) = how many values with a ‘0’ LSB have I seen up to this point? Let’s call these Falses How many Falses are there? We scanned so f(max) includes all but the last element So we got to add e(max) to f(max) to get the number of Falses f(i) can be used as the position to place falses on the output array We need to calculate positions for the non falses. We got to place non-falses after the falses + totalFalses One after the other: i their original position But ignore any in-between falses - f(i)
Using Streams to Overlap Kernels with Data Transfers Queue of ordered CUDA requests By default all CUDA request go to the same stream Create a stream: cudaStreamCreate (cudaStream *stream)
Overlapping Kernels cudaMemcpyAsync (dA, hA, sizeB, cudaMemcpyHostToDevice, streamA); cudaMemcpyAsync (dB, hB, sizeB, cudaMemcpyHostToDevice, streamB); Kernel<<<100, 512, 0, streamA>>> (dAo, dA, sizeA); Kernel<<<100, 512, 0, streamB>>> (dBo, dB, sizeB); cudaMemcpyAsync(hBo, dAo, cudaMemcpyDeviceToHost, streamA); cudaMemcpyAsync(hBo, dAo, cudaMemcpyDeviceToHost, streamB); cudaThreadSynchronize();