Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III
Scan Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n elements [a 0, a 1, …, a n-1 ] and returns the array [I, a 0, (a 0 a 1 ), …, (a 0 a 1 … a n-2 )] Example: [ ] [ ]
Sequential Scan out[0] = 0; for (k = 1; k < n; k++) out[k] = in[k-1] + out[k -1]; Performs n adds for an array length of n Work Complexity is O(n)
Parallel Scan Performs O(nlog 2 n) addition operations Assumes there are as many processors as data elements for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[k] = x[k – 2 d-1 ] + x[k]
Parallel Scan X0X0 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 ∑(x 0..x 0 )∑(x 0..x 1 )∑(x 1..x 2 )∑(x 2..x 3 )∑(x 3..x 4 )∑(x 4..x 5 )∑(x 5..x 6 )∑(x 6..x 7 ) ∑(x 0..x 0 )∑(x 0..x 1 )∑(x 0..x 2 )∑(x 0..x 3 )∑(x 1..x 4 )∑(x 2..x 5 )∑(x 3..x 6 )∑(x 4..x 7 ) ∑(x 0..x 0 )∑(x 0..x 1 )∑(x 0..x 2 )∑(x 0..x 3 )∑(x 0..x 4 )∑(x 0..x 5 )∑(x 0..x 6 )∑(x 0..x 7 ) D = 1 D = 2 D = 3 for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[k] = x[k – 2 d-1 ] + x[k]
Parallel Scan What’s the problem with this algorithm for the GPU? for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[k] = x[k – 2 d-1 ] + x[k]
Parallel Scan GPU needs to double buffer the array for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[out][k] = x[in][k – 2 d-1 ] + x[in][k] else x[out][k] = x[in][k]
Issues with Current Implementation? Only works for 512 elements (one thread block) GPU has a complexity of O(nlog 2 n) ( CPU version is O(n) )
A work efficient parallel scan Goal is a parallel scan that is O(n) instead of O(nlog 2 n) Solution: Balanced Trees: Build a binary tree on the input data and sweep it to and from the root. Binary tree with n leaves has d=log 2 n levels, each level d has 2 d nodes One add is performed per node, therefore O(n) add on a single traversal of the tree.
Balanced Binary Trees Binary tree with n leaves has d=log 2 n levels, each level d has 2 d nodes One add is performed per node, therefore O(n) add on a single traversal of the tree. d = 0 d = 1 d = 3 d = 2 Tree for n = 8 Two Phase Algorithm 1.Up-sweep phase 2.Down-sweep phase
The Up-Sweep Phase for(d = 1; d < log 2 n-1; d++) for all k=0; k < n-1; 2 d+1 in parallel x[k+2 d+1 -1] = x[k+2 d -1] + x[k+2 d+1 -1] Where have we seen this before?
The Down-Sweep Phase x[n-1] = 0; for(d = log 2 n – 1; d >=0; d--) for all k = 0; k < n-1; k += 2 d+1 in parallel t = x[k + 2 d – 1] x[k + 2 d - 1] = x[k + 2 d+1 -1] x[k + 2 d+1 - 1] = t + x[k + 2 d+1 – 1] x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) x2x2 x4x4 ∑(x 4..x 5 ) x6x6 ∑(x 0..x 7 ) x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) x2x2 x4x4 ∑(x 4..x 5 ) x6x6 0 x0x0 ∑(x 0..x 1 ) 0 x2x2 x4x4 ∑(x 4..x 5 ) x6x6 ∑(x 0..x 3 ) x0x0 0 ∑(x 0..x 1 ) x2x2 x4x4 ∑(x 0..x 3 ) x6x6 ∑(x 0..x 5 ) 0 ∑(x 0..x 2 ) ∑(x 0..x 4 ) ∑(x 0..x 6 ) x0x0 ∑(x 0..x 1 ) ∑(x 0..x 3 ) ∑(x 0..x 5 )
Current Limitations Array sizes are limited to 1024 elements Array sizes must be a power of two
Alterations for Arbitrary Sized Arrays Divide the large array into blocks that can be scanned by a single thread block Scan each block and write the total sums of each block to another array of blocks Scan the block sums, generating an array of block increments The result is added to each of the element of their respective block Initial array of values Scan Block 0Scan Block 1 Scan Block 2 Scan Block 3 Final Array of Scanned Values Block Sums Scan Block Sums
Applications Stream Compaction Summed-Area Tables Radix Sort
Stream Compaction Definition: Extracts the ‘interest’ elements from an array of elements and places them continuously in a new array Uses: Collision Detection Sparse Matrix Compression ABADDEC ABAC FB B
Stream Compaction ABADDEC ABAC FB B ABADDECFB Input: We want to preserve the gray elements Set a ‘1’ in each gray input Scan Scatter gray inputs to output using scan result as scatter address
Summed Area Tables Definition: A 2D table generated from an input image in which each entry in the table stores the sum of all pixels between the entry location and the lower- left corner of the input image Uses: Can be used to perform filters of different widths at every pixel in the image in constant time per pixel
Summed Area Tables 1. Apply sum scan to all rows of the image 2. Transpose image 3. Apply a sum scan to all rows of the result
Radix Sort Initial Array Pass Pass Pass Pass Pass
Radix Sort Using Scan Input Array e = Insert a 1 for all false sort keys f = Scan the 1s = = = = = = = = 8 t = index – f + Total Falses Total Falses = e[n-1] + f[n-1] d = b ? t : f b = least significant bit Scatter input using d as scatter address
Radix Sort Using GPU Partial Radix sort is performed once for each block. Scan needs to be performed once for each bit Partial sorts are then sorted together using bitonic sort
References These slides are directly based upon the following resource and are meant for education purposes only. GPU Gems III, Chapter 39, Parallel Prefix Sum (Scan) with CUDA, Mark Harris, Shubhabrata Sengupta, John D. Owens