LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program
Announcements Course website updates: PDF http links fixed. Lecture3 – Counting 6’s – Ocelot working Installation guide –
GPU is a coprocessor for CPU Master Slave
CUDA Runtime API cudaMalloc(&d, len) … len - 1
0 1 2 … CUDA Runtime API cudaMemcpy(d, …, …, cudaMemcpyHostToDevice) … len - 1
CUDA Runtime API kernel >>(d); __global__ void kernel(int* d) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; d[idx] = idx; } … len … …
CUDA Runtime API cudaMemcpy(…, d, …, cudaMemcpyDeviceToHost) … len - 1
CUDA Runtime API cudaFree(d ) … len - 1
CUDA Thread Identification … kernel >>(a); … __global__ void kernel(int* a) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = idx; } __global__ void kernel(int* a) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = threadIdx.x; } __global__ void kernel(int* a) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = blockIdx.x; } Built-in Variables threadIdx, blockDim, blockIdx, gridDim defined when kernel executes, used to identify what thread is executing. (Executive Configuration)
Example: Counting 6’s in CUDA Host (CPU) code int main() { int size = 300, bsize = 100; int * h = (int*)malloc(size * sizeof(int)); for (int i = 0; i < size; ++i) h[i] = i % 10; int * d_in, * d_out; int blocks = size/bsize; int threads_per_block = 1; int rv1 = cudaMalloc(&d_in, size*sizeof(int)); int rv2 = cudaMalloc(&d_out, blocks*sizeof(int)); int rv3 = cudaMemcpy(d_in, h, size*sizeof(int), cudaMemcpyHostToDevice); c6 >>(d_in, d_out, bsize); cudaThreadSynchronize(); int rv4 = cudaGetLastError(); int * r = (int*)malloc(blocks * sizeof(int)); int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int), cudaMemcpyDeviceToHost); int sum = 0; for (int i = 0; i < blocks; ++i) sum += r[i]; printf("Result = %d\n", sum); return 0; } Declare size of array, schunk size Declare CPU copy of array, initialize Declare device copies of input and output Declare blocks per grid, threads per block Allocate GPU global memory for input/output Copy host memory to GPU memory Call GPU, wait for threads to complete, get errors Copy from GPU to CPU Sum up per block count of number of 6’s
Example: Counting 6’s in CUDA Kernel (GPU) code __global__ void c6(int * d_in, int * d_out, int size) { int sum = 0; for (int i=0; i < size; ++i) { int val = d_in[i + blockIdx.x * size]; if (val == 6) sum++; } d_out[blockIdx.x] = sum; } Declare parameters to kernel Initial value of number of 6’s in the chunk Compute number of 6’s in chunk Set global memory the sum
Optimizations To achieve better performance: Choose good kernel launch configuration Use constant and shared memories Avoid bank conflicts in shared memory use Use coalesced accesses to global memory Avoid branch divergence in a warp
Warps kernel >>(a); __global__ void kernel(int* a) { int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = idx; } 2 blocks, 3 threads per block 2 warps, 32 threads per warp --29 threads per warp unused Under the covers, the CUDA Runtime API converts the blocks into warps.
Want to maximize the number of active threads in all multiprocessors at all times. Occupancy is defined as the ratio of the number of resident warps to the maximum number of resident warps Function of state of warp residency, registers in use, amount of shared memory in use, and the type of GPU card. Occupancy
kernel >>(a); __global__ void kernel(int* a) { int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = idx; } 2 blocks, 3 threads per block Maximum number of resident blocks per multiprocessor = 8 Maximum number of resident warps per multiprocessor = 48 Maximum number of resident threads per multiprocessor = 1538 Number of blocks per multiprocessor = 2 Possible Occupancy = 2/48 = Max Occupancy = because two SM were used Higher occupancy = better utilization of GPU
Occupancy Use the CUDA Occupancy Calculator (an Excel spreadsheet) to compute potential occupancy.
Occupancy Use the CUDA Compute Visual Profiler to measure real occupancy.
Occupancy Generally, choose 32 threads per block because that is mapped into one warp, or multiples of 32.
GPU Memory Streaming Multiprocessors contain Registers, per thread Shared memory, per block L1 cache L2 cache Global Memory, per grid Micikevicius, P. Fundamental Optimizations Supercomputing, New Orleans, Nov 14, 2010, 2010.
GPU Memory Memory access times depending on class Sun, W. and Ma, Z., Count Sort for GPU Computing. in th International Conference on Parallel and Distributed Systems, (2009), IEEE Computer Society,
Use shared memory to increase the speed of computation because the latency for shared memory is very low. Shared memory
Counting 6’s using shared memory Kernel (GPU) code __global__ void c6d(int * d_in, int * d_out) { __shared__ int sum; if (threadIdx.x == 0) sum = 0; __syncthreads(); int val = d_in[threadIdx.x + blockIdx.x * blockDim.x]; if (val == 6) atomicAdd(&sum, 1); __syncthreads(); if (threadIdx.x == 0) d_out[blockIdx.x] = sum; } Declare parameters to kernel Initial value of number of 6’s in the chunk Synchronize to make sure all threads computed addition to sum Set global memory the sum Compute number of 6’s in chunk
Shared memory paradigm Partition data into subsets that fit into shared memory
Handle each data subset with one thread block Shared memory paradigm
Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism (prefetching) Shared memory paradigm
Perform the computation on the subset from shared memory Shared memory paradigm
Copy the result from shared memory back to global memory Shared memory paradigm
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; }
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } Two loads
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } // once by thread i // again by thread i+1
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; }
Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); if(tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; }
Shared memory is organized into 32 banks. Addresses in shared memory are interleaved banks, 4 bytes quantities, accessible in 2 cycles per warp. Problem with Shared Memory: Bank conflict
Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. Problem with Shared Memory: Bank conflict NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.
Shared Memory: without Bank Conflict one access per bankone access per bank with shuffling access the same address (broadcast) partial broadcast and skipping some banks NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.
Shared Memory: with Bank Conflict access more than one address per bank
Bank conflict #define NUM_BANKS 32 #define NUM_THREADS 32 __shared__ int mem [NUM_BANKS * NUM_THREADS]; __global__ void bad_access(int *out, int iters) { int idx = threadIdx.x; int min = idx * NUM_BANKS; int max = (idx + 1) * NUM_BANKS; int inc = 1; for (int j = min; j < max; j += inc) mem[j] = 0; for (int i = 0; i < iters; i++) for (int j = min; j < max; j += inc) mem[j]++; for (int j = min; j < max; j += inc) out[j] = mem[j]; } Bank conflict here
MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict for (int j = min; j < max; j += inc) mem[j] = 0; j = 0, 32, 64, …, 928, 960, 992 for threads 1, 2, 3, … 32
MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict for (int j = min; j < max; j += inc) mem[j] = 0; j = 1, 33, 65, …, 929, 961, 993 for threads 1, 2, 3, … 32
MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict for (int j = min; j < max; j += inc) mem[j] = 0; j = 2, 34, 66, …, 930, 962, 994 for threads 1, 2, 3, … 32
Bank conflict fixed #define NUM_BANKS 32 #define NUM_THREADS 32 __shared__ int mem [NUM_BANKS * NUM_THREADS]; __global__ void good_access(int *out, int iters) { int idx = threadIdx.x; int min = idx; int max = blockDim.x * NUM_BANKS; int inc = NUM_BANKS; for (int j = min; j < max; j += inc) mem[j] = 0; for (int i = 0; i < iters; i++) for (int j = min; j < max; j += inc) mem[j]++; for (int j = min; j < max; j += inc) out[j] = mem[j]; } Bank conflict fixed
MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict fixed for (int j = min; j < max; j += inc) mem[j] = 0; j = 0, 1, 2, …, 29, 30, 31 for threads 1, 2, 3, … 32
MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict fixed for (int j = min; j < max; j += inc) mem[j] = 0; j = 32, 33, 34, …, 61, 62, 63 for threads 1, 2, 3, … 32
MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict fixed for (int j = min; j < max; j += inc) mem[j] = 0; j = 64, 65, 66, …, 93, 94, 95 for threads 1, 2, 3, … 32
Coalesced access Fast access to global memory for older GeForce GPU’s (1.0 to 1.3 compute capability). For newer GeForce GPU’s (2.0), coalesced access does not exist. A half warp can access one 32- (or 64-, 128-) byte memory quantity in one transaction, if three conditions met.
Coalesced access To coalesce, the global memory request for a half-warp must satisfy the following conditions: 1) The size of the words accessed by the threads must be 4, 8, or 16 bytes; 2) 4, all 16 words must lie in the same 64-byte segment; 8, 16 … 3) Threads must access the words in sequence: The kth thread in the half-warp must access the kth word.
Global Memory: Coalesced Access perfectly coalesced allow threads skipping LD/ST NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.
Global Memory: Non-Coalesced Access non-consecutive address starting address not aligned to 128 Byte non-consecutive address stride larger than one word NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.
Coalesced access
Thread id&data[] … … Threads 0-15 access data[] in 4 byte quantities consecutively, from 0 to 63. Therefore, this half warp is coalesced. Threads access data[] in 4 byte quantities consecutively, from 64 to 127. Therefore, this half warp is coalesced.
Coalesced access
Thread id&data[] … … Threads 1 and 2 do not access data[] consecutively. Therefore, this half warp is NOT coalesced. Threads 17 and 18 do not access data[] consecutively. Therefore, this half warp is NOT coalesced.
BASIC ALGORITHMS The REDUCE and SCAN primitives
+/ ×/ ⋁ / \ ×\ =\ Time warp… What do these APL statements do? op/ is a SCAN; op\ is a REDUCE
Definition: The REDUCE operation takes a binary operator ⊕ with identity I, and an ordered set [a 0, a 1,..., a n−1 ] of n elements, and returns the value (((a 0 ⊕ a 1 ) ⊕... ) ⊕ a n−1 ). For our discussions, consider only associative, commutative operators. We want to perform operations in any order. What is REDUCE? Blelloch, G. Prefix Sums and Their Applications, 1990
Definition: The SCAN operation takes a binary operator ⊕, and an ordered set of n elements [a 0, a 1,..., a n−1 ], and returns the ordered set [a 0, (a 0 ⊕ a 1 ),..., (a 0 ⊕ a 1 ⊕... ⊕ a n−1 )]. AKA “inclusive prefix”, “all-prefix-sums” (addition) Let’s only consider associative binary operators… What is SCAN? ngAPLwithAPLX.pdf
If ⊕ is addition, then REDUCE computes the summation. If ⊕ is dyadic min function (i.e., min(x,y) = x > y ? y : x), then REDUCE finds the minimum of all numbers. SCAN is used in many sorting algorithms. Why is REDUCE and SCAN important? Blelloch, G. Prefix Sums and Their Applications, 1990
Sequential implementation of REDUCE Result = I for i:= 1 to n do Result = Result ⊕ x[i]
Naïve parallel implementation of REDUCE Blelloch, G. Prefix Sums and Their Applications, 1990 for (d = 0; d ≤ log 2 n – 1; ++d) do for P i, 0 ≤ i ≤ log 2 n - d in parallel do a[(i +1) * 2 d+1 - 1] = a[i * 2 d d - 1] + a[(i +1) * 2 d+1 - 1] end
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31?
Naïve parallel implementation of REDUCE = 31
REDUCE using a tree Blelloch, G. Prefix Sums and Their Applications, 1990
Naïve implementation: Problem with this implementation: Modifies input Slow! CUDA implementation of REDUCE __global__ void reduce0(int *g_idata, int *g_odata) { unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) g_idata[tid] += g_idata[i + s]; __syncthreads(); }
CUDA implementation of REDUCE
Better implementation: Uses parallelism in global data loading into shared memory Does not modify input data Problem with this implementation: Has bank conflicts CUDA implementation of REDUCE __global__ void reduce1(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element into shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }
CUDA implementation of REDUCE
Even better implementation: Solves highly divergent code CUDA implementation of REDUCE __global__ void reduce2(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element into shared mem unsigned int tid = threadIdx.x; unsigned int i= blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for (unsigned int s = blockDim.x / 2; s>0; s>>=1) { if (tid < s) sdata[tid] += sdata[tid + s]; __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }
CUDA implementation of REDUCE (This is better, but there are still problems: bank conflict!)
Definition: The SCAN operation takes a binary associative, commutative operator ⊕, and an ordered set of n elements [a 0, a 1,..., a n−1 ], and returns the ordered set [a 0, (a 0 ⊕ a 1 ),..., (a 0 ⊕ a 1 ⊕... ⊕ a n−1 )]. Definition: The PRESCAN operation takes a binary associative, commutative operator ⊕, identity I over ⊕, and an ordered set of n elements [a 0, a 1,..., a n−1 ], and returns the ordered set [I, a 0, (a 0 ⊕ a 1 ),..., (a 0 ⊕ a 1 ⊕... ⊕ a n−1 )]. AKA “exclusive scan” and “exclusive prefix”. What is SCAN? Blelloch, G. Prefix Sums and Their Applications, 1990
Sequential implementation of SCAN for i:= 1 to n do x[i] = x[i-1] ⊕ x[i]
Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, 2009.
Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, 2009.
Problem: “Algorithm 1” assumes all fetches and additions occur simultaneously in x[k – 2 d-1 ] + x[k] for a given d. In a Tesla GPU, this is not the case. Threads execute in a Warp, which has 32 threads, and Warps execute sequentially. Need to separate input from output: double buffering. Naïve parallel implementation of SCAN
swap(xout, xin) for d:= 1 to log 2 n do swap(xout, xin) forall k in parallel do if k >= d then xout[k] = xin[k – 2 d-1 ] + xin[k] else xout[k] = xin[k]
Naïve parallel implementation of SCAN WORKS BUT IS INEFFICIENT + SLOWER THAN SERIAL IMPLEMENTATION!
Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, 2009.
Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, Watch out for “free” software and algorithms!
Naïve parallel implementation of SCAN WORKS BUT DOES NOT SCALE
Blelloch SCAN
Step 1: Run naïve or Blelloch parallel PRESCAN on blocks. This computes the PRESCAN for each block without regard to other blocks. Step 2: Run naïve or Blelloch parallel SCAN on top values in all blocks. Consider an array F in which the values are the last value in each block in step 1. Compute SCAN for array F. Step 3: Update all values in all blocks with the values in array F. Implementation… see code at Scalable Parallel implementation of SCAN
BASIC ALGORITHMS Sorting
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a 03
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a 03
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a b 3-
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a b 3-
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a b 3-
Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see a b 35
Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step1(int * c, int Kp, int n) { int idx = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.x * blockDim.x * blockDim.y * gridDim.x; if (idx < 0) return; if (idx >= Kp) return; // Initialize c from 0 to K inclusive. c[idx] = 0; } For a glossly overview of the implementation, see Sun, W. and Ma, Z., Count Sort for GPU Computing. in th International Conference on Parallel and Distributed Systems, (2009), IEEE Computer Society,
Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step2(int * c, int * a, int K, int n) { int idx = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (idx < 0) return; if (idx >= n) return; int w = a[idx]; if (a[idx] < 0) return; if (a[idx] > K) return; int * p = &c[w]; atomicAdd(p, 1); }
Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step3(int * g_odata, int * g_idata, int K) { extern __shared__ int temp[]; int tid = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (tid < 0) return; if (tid >= K) return; int pout = 0; int pin = 1; temp[pout*K + tid] = g_idata[tid]; __syncthreads(); for (int i = 1; i < K; i *= 2) { pout = 1 - pout; pin = 1 - pin; if (tid >= i) temp[pout*K + tid] = temp[pin*K + tid] + temp[pin*K + tid - i]; else temp[pout*K + tid] = temp[pin*K + tid]; __syncthreads(); } g_odata[tid] = temp[pout*K + tid]; } Note: this is our Naïve SCAN implementation. We can do much better!
Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step3(int * g_odata, int * g_idata, int K) { extern __shared__ int temp[]; int tid = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (tid < 0) return; if (tid >= K) return; int pout = 0; int pin = 1; temp[pout*K + tid] = g_idata[tid]; __syncthreads(); for (int i = 1; i < K; i *= 2) { pout = 1 - pout; pin = 1 - pin; if (tid >= i) temp[pout*K + tid] = temp[pin*K + tid] + temp[pin*K + tid - i]; else temp[pout*K + tid] = temp[pin*K + tid]; __syncthreads(); } g_odata[tid] = temp[pout*K + tid]; } Note: this is our Naïve SCAN implementation. We can do much better!
Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step4(int * c, int * a, int * b, int Kp, int n) { extern __shared__ int temp[]; int tid = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (tid < 0) return; if (tid >= n) return; int w = a[idx]; int old = atomicAdd(&c[w], -1); b[old-1] = w; } What’s wrong with this algorithm????