LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

List Ranking and Parallel Prefix

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA Grids, Blocks, and Threads

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

GPU Programming with CUDA – Optimisation Mike Griffiths

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

CIS 565 Fall 2011 Qing Sun

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CUDA programming Performance considerations (CUDA best practices)

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CS/EE 217 – GPU Architecture and Parallel Programming

Sathish Vadhiyar Parallel Programming

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Basic CUDA Programming

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Programming Lecture 7.

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Parallel Computation Patterns (Reduction)

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Chapter 4:Parallel Programming in CUDA C

Parallel Computation Patterns (Histogram)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program

Announcements Course website updates: PDF http links fixed. Lecture3 – Counting 6’s – Ocelot working Installation guide –

GPU is a coprocessor for CPU Master Slave

CUDA Runtime API cudaMalloc(&d, len) … len - 1

0 1 2 … CUDA Runtime API cudaMemcpy(d, …, …, cudaMemcpyHostToDevice) … len - 1

CUDA Runtime API kernel >>(d); __global__ void kernel(int* d) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; d[idx] = idx; } … len … …

CUDA Runtime API cudaMemcpy(…, d, …, cudaMemcpyDeviceToHost) … len - 1

CUDA Runtime API cudaFree(d ) … len - 1

CUDA Thread Identification … kernel >>(a); … __global__ void kernel(int* a) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = idx; } __global__ void kernel(int* a) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = threadIdx.x; } __global__ void kernel(int* a) { Int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = blockIdx.x; } Built-in Variables threadIdx, blockDim, blockIdx, gridDim defined when kernel executes, used to identify what thread is executing. (Executive Configuration)

Example: Counting 6’s in CUDA Host (CPU) code int main() { int size = 300, bsize = 100; int * h = (int*)malloc(size * sizeof(int)); for (int i = 0; i < size; ++i) h[i] = i % 10; int * d_in, * d_out; int blocks = size/bsize; int threads_per_block = 1; int rv1 = cudaMalloc(&d_in, size*sizeof(int)); int rv2 = cudaMalloc(&d_out, blocks*sizeof(int)); int rv3 = cudaMemcpy(d_in, h, size*sizeof(int), cudaMemcpyHostToDevice); c6 >>(d_in, d_out, bsize); cudaThreadSynchronize(); int rv4 = cudaGetLastError(); int * r = (int*)malloc(blocks * sizeof(int)); int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int), cudaMemcpyDeviceToHost); int sum = 0; for (int i = 0; i < blocks; ++i) sum += r[i]; printf("Result = %d\n", sum); return 0; } Declare size of array, schunk size Declare CPU copy of array, initialize Declare device copies of input and output Declare blocks per grid, threads per block Allocate GPU global memory for input/output Copy host memory to GPU memory Call GPU, wait for threads to complete, get errors Copy from GPU to CPU Sum up per block count of number of 6’s

Example: Counting 6’s in CUDA Kernel (GPU) code __global__ void c6(int * d_in, int * d_out, int size) { int sum = 0; for (int i=0; i < size; ++i) { int val = d_in[i + blockIdx.x * size]; if (val == 6) sum++; } d_out[blockIdx.x] = sum; } Declare parameters to kernel Initial value of number of 6’s in the chunk Compute number of 6’s in chunk Set global memory the sum

Optimizations To achieve better performance: Choose good kernel launch configuration Use constant and shared memories Avoid bank conflicts in shared memory use Use coalesced accesses to global memory Avoid branch divergence in a warp

Warps kernel >>(a); __global__ void kernel(int* a) { int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = idx; } 2 blocks, 3 threads per block 2 warps, 32 threads per warp --29 threads per warp unused Under the covers, the CUDA Runtime API converts the blocks into warps.

Want to maximize the number of active threads in all multiprocessors at all times. Occupancy is defined as the ratio of the number of resident warps to the maximum number of resident warps Function of state of warp residency, registers in use, amount of shared memory in use, and the type of GPU card. Occupancy

kernel >>(a); __global__ void kernel(int* a) { int idx= blockIdx.x* blockDim.x+ threadIdx.x; a[idx] = idx; } 2 blocks, 3 threads per block Maximum number of resident blocks per multiprocessor = 8 Maximum number of resident warps per multiprocessor = 48 Maximum number of resident threads per multiprocessor = 1538 Number of blocks per multiprocessor = 2 Possible Occupancy = 2/48 = Max Occupancy = because two SM were used Higher occupancy = better utilization of GPU

Occupancy Use the CUDA Occupancy Calculator (an Excel spreadsheet) to compute potential occupancy.

Occupancy Use the CUDA Compute Visual Profiler to measure real occupancy.

Occupancy Generally, choose 32 threads per block because that is mapped into one warp, or multiples of 32.

GPU Memory Streaming Multiprocessors contain Registers, per thread Shared memory, per block L1 cache L2 cache Global Memory, per grid Micikevicius, P. Fundamental Optimizations Supercomputing, New Orleans, Nov 14, 2010, 2010.

GPU Memory Memory access times depending on class Sun, W. and Ma, Z., Count Sort for GPU Computing. in th International Conference on Parallel and Distributed Systems, (2009), IEEE Computer Society,

Use shared memory to increase the speed of computation because the latency for shared memory is very low. Shared memory

Counting 6’s using shared memory Kernel (GPU) code __global__ void c6d(int * d_in, int * d_out) { __shared__ int sum; if (threadIdx.x == 0) sum = 0; __syncthreads(); int val = d_in[threadIdx.x + blockIdx.x * blockDim.x]; if (val == 6) atomicAdd(&sum, 1); __syncthreads(); if (threadIdx.x == 0) d_out[blockIdx.x] = sum; } Declare parameters to kernel Initial value of number of 6’s in the chunk Synchronize to make sure all threads computed addition to sum Set global memory the sum Compute number of 6’s in chunk

Shared memory paradigm Partition data into subsets that fit into shared memory

Handle each data subset with one thread block Shared memory paradigm

Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism (prefetching) Shared memory paradigm

Perform the computation on the subset from shared memory Shared memory paradigm

Copy the result from shared memory back to global memory Shared memory paradigm

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; }

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } Two loads

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } // once by thread i // again by thread i+1

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; }

Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); if(tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; }

Shared memory is organized into 32 banks. Addresses in shared memory are interleaved banks, 4 bytes quantities, accessible in 2 cycles per warp. Problem with Shared Memory: Bank conflict

Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. Problem with Shared Memory: Bank conflict NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.

Shared Memory: without Bank Conflict one access per bankone access per bank with shuffling access the same address (broadcast) partial broadcast and skipping some banks NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.

Shared Memory: with Bank Conflict access more than one address per bank

Bank conflict #define NUM_BANKS 32 #define NUM_THREADS 32 __shared__ int mem [NUM_BANKS * NUM_THREADS]; __global__ void bad_access(int *out, int iters) { int idx = threadIdx.x; int min = idx * NUM_BANKS; int max = (idx + 1) * NUM_BANKS; int inc = 1; for (int j = min; j < max; j += inc) mem[j] = 0; for (int i = 0; i < iters; i++) for (int j = min; j < max; j += inc) mem[j]++; for (int j = min; j < max; j += inc) out[j] = mem[j]; } Bank conflict here

MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict for (int j = min; j < max; j += inc) mem[j] = 0; j = 0, 32, 64, …, 928, 960, 992 for threads 1, 2, 3, … 32

MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict for (int j = min; j < max; j += inc) mem[j] = 0; j = 1, 33, 65, …, 929, 961, 993 for threads 1, 2, 3, … 32

MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict for (int j = min; j < max; j += inc) mem[j] = 0; j = 2, 34, 66, …, 930, 962, 994 for threads 1, 2, 3, … 32

Bank conflict fixed #define NUM_BANKS 32 #define NUM_THREADS 32 __shared__ int mem [NUM_BANKS * NUM_THREADS]; __global__ void good_access(int *out, int iters) { int idx = threadIdx.x; int min = idx; int max = blockDim.x * NUM_BANKS; int inc = NUM_BANKS; for (int j = min; j < max; j += inc) mem[j] = 0; for (int i = 0; i < iters; i++) for (int j = min; j < max; j += inc) mem[j]++; for (int j = min; j < max; j += inc) out[j] = mem[j]; } Bank conflict fixed

MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict fixed for (int j = min; j < max; j += inc) mem[j] = 0; j = 0, 1, 2, …, 29, 30, 31 for threads 1, 2, 3, … 32

MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict fixed for (int j = min; j < max; j += inc) mem[j] = 0; j = 32, 33, 34, …, 61, 62, 63 for threads 1, 2, 3, … 32

MEM[i]Bank 1Bank 2Bank 3…Bank 30Bank 31Bank 32 Thread Thread Thread … Thread Thread Thread Bank conflict fixed for (int j = min; j < max; j += inc) mem[j] = 0; j = 64, 65, 66, …, 93, 94, 95 for threads 1, 2, 3, … 32

Coalesced access Fast access to global memory for older GeForce GPU’s (1.0 to 1.3 compute capability). For newer GeForce GPU’s (2.0), coalesced access does not exist. A half warp can access one 32- (or 64-, 128-) byte memory quantity in one transaction, if three conditions met.

Coalesced access To coalesce, the global memory request for a half-warp must satisfy the following conditions: 1) The size of the words accessed by the threads must be 4, 8, or 16 bytes; 2) 4, all 16 words must lie in the same 64-byte segment; 8, 16 … 3) Threads must access the words in sequence: The kth thread in the half-warp must access the kth word.

Global Memory: Coalesced Access perfectly coalesced allow threads skipping LD/ST NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.

Global Memory: Non-Coalesced Access non-consecutive address starting address not aligned to 128 Byte non-consecutive address stride larger than one word NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.

Coalesced access

Thread id&data[] … …  Threads 0-15 access data[] in 4 byte quantities consecutively, from 0 to 63. Therefore, this half warp is coalesced.  Threads access data[] in 4 byte quantities consecutively, from 64 to 127. Therefore, this half warp is coalesced.

Coalesced access

Thread id&data[] … …  Threads 1 and 2 do not access data[] consecutively. Therefore, this half warp is NOT coalesced.  Threads 17 and 18 do not access data[] consecutively. Therefore, this half warp is NOT coalesced.

BASIC ALGORITHMS The REDUCE and SCAN primitives

+/ ×/ ⋁ / \ ×\ =\ Time warp… What do these APL statements do? op/ is a SCAN; op\ is a REDUCE

Definition: The REDUCE operation takes a binary operator ⊕ with identity I, and an ordered set [a 0, a 1,..., a n−1 ] of n elements, and returns the value (((a 0 ⊕ a 1 ) ⊕... ) ⊕ a n−1 ). For our discussions, consider only associative, commutative operators. We want to perform operations in any order. What is REDUCE? Blelloch, G. Prefix Sums and Their Applications, 1990

Definition: The SCAN operation takes a binary operator ⊕, and an ordered set of n elements [a 0, a 1,..., a n−1 ], and returns the ordered set [a 0, (a 0 ⊕ a 1 ),..., (a 0 ⊕ a 1 ⊕... ⊕ a n−1 )]. AKA “inclusive prefix”, “all-prefix-sums” (addition) Let’s only consider associative binary operators… What is SCAN? ngAPLwithAPLX.pdf

If ⊕ is addition, then REDUCE computes the summation. If ⊕ is dyadic min function (i.e., min(x,y) = x > y ? y : x), then REDUCE finds the minimum of all numbers. SCAN is used in many sorting algorithms. Why is REDUCE and SCAN important? Blelloch, G. Prefix Sums and Their Applications, 1990

Sequential implementation of REDUCE Result = I for i:= 1 to n do Result = Result ⊕ x[i]

Naïve parallel implementation of REDUCE Blelloch, G. Prefix Sums and Their Applications, 1990 for (d = 0; d ≤ log 2 n – 1; ++d) do for P i, 0 ≤ i ≤ log 2 n - d in parallel do a[(i +1) * 2 d+1 - 1] = a[i * 2 d d - 1] + a[(i +1) * 2 d+1 - 1] end

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31?

Naïve parallel implementation of REDUCE = 31 

REDUCE using a tree Blelloch, G. Prefix Sums and Their Applications, 1990

Naïve implementation: Problem with this implementation: Modifies input Slow! CUDA implementation of REDUCE __global__ void reduce0(int *g_idata, int *g_odata) { unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) g_idata[tid] += g_idata[i + s]; __syncthreads(); }

CUDA implementation of REDUCE

Better implementation: Uses parallelism in global data loading into shared memory Does not modify input data Problem with this implementation: Has bank conflicts CUDA implementation of REDUCE __global__ void reduce1(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element into shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }

CUDA implementation of REDUCE

Even better implementation: Solves highly divergent code CUDA implementation of REDUCE __global__ void reduce2(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element into shared mem unsigned int tid = threadIdx.x; unsigned int i= blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for (unsigned int s = blockDim.x / 2; s>0; s>>=1) { if (tid < s) sdata[tid] += sdata[tid + s]; __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }

CUDA implementation of REDUCE (This is better, but there are still problems: bank conflict!)

Definition: The SCAN operation takes a binary associative, commutative operator ⊕, and an ordered set of n elements [a 0, a 1,..., a n−1 ], and returns the ordered set [a 0, (a 0 ⊕ a 1 ),..., (a 0 ⊕ a 1 ⊕... ⊕ a n−1 )]. Definition: The PRESCAN operation takes a binary associative, commutative operator ⊕, identity I over ⊕, and an ordered set of n elements [a 0, a 1,..., a n−1 ], and returns the ordered set [I, a 0, (a 0 ⊕ a 1 ),..., (a 0 ⊕ a 1 ⊕... ⊕ a n−1 )]. AKA “exclusive scan” and “exclusive prefix”. What is SCAN? Blelloch, G. Prefix Sums and Their Applications, 1990

Sequential implementation of SCAN for i:= 1 to n do x[i] = x[i-1] ⊕ x[i]

Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, 2009.

Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, 2009.

 Problem: “Algorithm 1” assumes all fetches and additions occur simultaneously in x[k – 2 d-1 ] + x[k] for a given d. In a Tesla GPU, this is not the case. Threads execute in a Warp, which has 32 threads, and Warps execute sequentially. Need to separate input from output: double buffering. Naïve parallel implementation of SCAN

swap(xout, xin) for d:= 1 to log 2 n do swap(xout, xin) forall k in parallel do if k >= d then xout[k] = xin[k – 2 d-1 ] + xin[k] else xout[k] = xin[k]

Naïve parallel implementation of SCAN WORKS BUT IS INEFFICIENT + SLOWER THAN SERIAL IMPLEMENTATION!

Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, 2009.

Naïve parallel implementation of SCAN Harris, M. Parallel prefix sum (scan) with CUDA, NVIDIA, Watch out for “free” software and algorithms!

Naïve parallel implementation of SCAN WORKS BUT DOES NOT SCALE

Blelloch SCAN

Step 1: Run naïve or Blelloch parallel PRESCAN on blocks. This computes the PRESCAN for each block without regard to other blocks. Step 2: Run naïve or Blelloch parallel SCAN on top values in all blocks. Consider an array F in which the values are the last value in each block in step 1. Compute SCAN for array F. Step 3: Update all values in all blocks with the values in array F. Implementation… see code at Scalable Parallel implementation of SCAN

BASIC ALGORITHMS Sorting

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a 03

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a 03

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a b 3-

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a b 3-

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see c a b 3-

Counting Sort COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see a b 35

Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step1(int * c, int Kp, int n) { int idx = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.x * blockDim.x * blockDim.y * gridDim.x; if (idx < 0) return; if (idx >= Kp) return; // Initialize c from 0 to K inclusive. c[idx] = 0; } For a glossly overview of the implementation, see Sun, W. and Ma, Z., Count Sort for GPU Computing. in th International Conference on Parallel and Distributed Systems, (2009), IEEE Computer Society,

Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step2(int * c, int * a, int K, int n) { int idx = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (idx < 0) return; if (idx >= n) return; int w = a[idx]; if (a[idx] < 0) return; if (a[idx] > K) return; int * p = &c[w]; atomicAdd(p, 1); }

Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step3(int * g_odata, int * g_idata, int K) { extern __shared__ int temp[]; int tid = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (tid < 0) return; if (tid >= K) return; int pout = 0; int pin = 1; temp[pout*K + tid] = g_idata[tid]; __syncthreads(); for (int i = 1; i < K; i *= 2) { pout = 1 - pout; pin = 1 - pin; if (tid >= i) temp[pout*K + tid] = temp[pin*K + tid] + temp[pin*K + tid - i]; else temp[pout*K + tid] = temp[pin*K + tid]; __syncthreads(); } g_odata[tid] = temp[pout*K + tid]; } Note: this is our Naïve SCAN implementation. We can do much better!

Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step3(int * g_odata, int * g_idata, int K) { extern __shared__ int temp[]; int tid = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (tid < 0) return; if (tid >= K) return; int pout = 0; int pin = 1; temp[pout*K + tid] = g_idata[tid]; __syncthreads(); for (int i = 1; i < K; i *= 2) { pout = 1 - pout; pin = 1 - pin; if (tid >= i) temp[pout*K + tid] = temp[pin*K + tid] + temp[pin*K + tid - i]; else temp[pout*K + tid] = temp[pin*K + tid]; __syncthreads(); } g_odata[tid] = temp[pout*K + tid]; } Note: this is our Naïve SCAN implementation. We can do much better!

Counting Sort in CUDA COUNTING_SORT (A, B, k) for i ← 1 to k do c[i] ← 0 for j ← 1 to n do c[A[j]] ← c[A[j]] + 1 //c[i] now contains the number of elements equal to i for i ← 2 to k do c[i] ← c[i] + c[i-1] // c[i] now contains the number of elements ≤ i for j ← n downto 1 do B[c[A[i]]] ← A[j] c[A[i]] ← c[A[j]] - 1 Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT Press McGraw-Hill Book Company, For animation, see __global__ void step4(int * c, int * a, int * b, int Kp, int n) { extern __shared__ int temp[]; int tid = threadIdx.x + threadIdx.y * blockDim.x + blockIdx.x * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y * gridDim.x; if (tid < 0) return; if (tid >= n) return; int w = a[idx]; int old = atomicAdd(&c[w], -1); b[old-1] = w; } What’s wrong with this algorithm????