Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Introduction to CUDA Programming

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

CSE 326: Data Structures Sorting Ben Lerner Summer 2007.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© NVIDIA and UC Davis 2008 Advanced Data-Parallel Programming: Data Structures and Algorithms John Owens UC Davis.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

Priority Queues and Heaps. John Edgar  Define the ADT priority queue  Define the partially ordered property  Define a heap  Implement a heap using.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

CS/EE 217 – GPU Architecture and Parallel Programming

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Basic CUDA Programming

ME964 High Performance Computing for Engineering Applications

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Introduction to CUDA Programming

CS 179: GPU Programming Lecture 7.

L18: CUDA, cont. Memory Hierarchy and Examples

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Scan)

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

GPGPU: Parallel Reduction and Scan

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Reduction)

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 498AL Lecture 10: Control Flow

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Presentation transcript:

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides by Mark Harris (NVIDIA)

Scan / Parallel Prefix Sum Given an array A = [a0, a1, …, an-1] and a binary associative with identity I –scan (A) = [I, a0, a1), …, an-2)] This is the exclusive scan  We’ll focus on this

Inclusive Scan Given an array A = [a0, a1, …, an-1] and a binary associative with identity I –scan (A) = [a0, a1), …, an-1)] This is the inclusive scan

Applications of Scan Scan is used as a building block for many parallel algorithms –Radix sort –Quicksort –String comparison –Lexical analysis –Run-length encoding –Histograms –Etc. See: –Guy E. Blelloch. “Prefix Sums and Their Applications”. In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, c/papers/CMU-CS html

Scan Background Pre-GPU –First proposed in APL by Iverson (1962) –Used as a data parallel primitive in the Connection Machine (1990) Feature of C* and CM-Lisp –Guy Blelloch used scan as a primitive for various parallel algorithms Blelloch, 1990, “Prefix Sums and Their Applications” GPU Computing –O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2) Applied to Summed Area Tables by Hensley et al. (EG05) –O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06) –O(n) work & space GPU implementation by Harris et al. (2007) NVIDIA CUDA SDK and GPU Gems 3 Applied to radix sort, stream compaction, and summed area tables

Sequential algorithm void scan( float* output, float* input, int length) { output[0] = 0; // since this is a prescan, not a scan for(int j = 1; j < length; ++j) { output[j] = input[j-1] + output[j-1]; } N additions Use a guide: –Want parallel to be work efficient –Does similar amount of work

Naïve Parallel Algorithm for d := 1 to log 2 n do forall k in parallel do if k >= 2 d then x[k] := x[k − 2 d-1 ] + x[k] d = 1, 2 d -1 = 1 d = 2, 2 d -1 = 2 d = 3, 2 d -1 = 4

Need Double-Buffering First all read Then all write Solution –Use two arrays: Input & Output –Alternate at each step

Double Buffering Two arrays A & B Input in global memory Output in global memory A input B A B global

Naïve Kernel in CUDA __global__ void scan_naive(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; int thid = threadIdx.x, pout = 0, pin = 1; temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0; for (int dd = 1; dd < n; dd *= 2) { pout = 1 - pout; pin = 1 - pout; int basein = pin * n, baseout = pout * n; syncthreads(); temp[baseout +thid] = temp[basein +thid]; if (thid >= dd) temp[baseout +thid] += temp[basein +thid - dd]; } syncthreads(); g_odata[thid] = temp[baseout +thid]; }

Analysis of naïve kernel This scan algorithm executes log(n) parallel iterations –The steps do n-1, n-2, n-4,... n/2 adds each –Total adds: O(n*log(n)) This scan algorithm is NOT work efficient –Sequential scan algorithm does n adds

Improving Efficiency A common parallel algorithms pattern: Balanced Trees –Build balanced binary tree on input data and sweep to andfrom the root –Tree is conceptual, not an actual data structure For scan: –Traverse from leaves to root building partial sums at internal nodes Root holds sum of all leaves –Traverse from root to leaves building the scan from the partial sums Algorithm originally described by Blelloch (1990)

Balanced Tree-Based Scan Algorithm / Up-Sweep

Balanced Tree-Based Scan Algorithm / Down-Sweep

Up-Sweep Pseudo-Code

Down-Sweep Pseudo-Code

Cuda Implementation Declarations & Copying to shared memory –Two elements per thread __global__ void prescan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[N];// allocated on invocation int thid = threadIdx.x; int offset = 1; temp[2*thid] = g_idata[2*thid]; // load input into shared memory temp[2*thid+1] = g_idata[2*thid+1];

Cuda Implementation Up-Sweep for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree { __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; } Same computation Different assignment of threads

Up-Sweep: Who does what t0 t1 t2t3t4 t5 t6t7 d = 8 d = 4 d = 2 d = 1

Up-Sweep: Who does what For N=16 –ai 0 bi 1 offset 1 d 8 n 16 thid 0 –ai 1 bi 3 offset 2 d 4 n 16 thid 0 –ai 3 bi 7 offset 4 d 2 n 16 thid 0 –ai 7 bi 15 offset 8 d 1 n 16 thid 0 –ai 2 bi 3 offset 1 d 8 n 16 thid 1 –ai 5 bi 7 offset 2 d 4 n 16 thid 1 –ai 11 bi 15 offset 4 d 2 n 16 thid 1 –ai 4 bi 5 offset 1 d 8 n 16 thid 2 –ai 9 bi 11 offset 2 d 4 n 16 thid 2 –ai 6 bi 7 offset 1 d 8 n 16 thid 3 –ai 13 bi 15 offset 2 d 4 n 16 thid 3 –ai 8 bi 9 offset 1 d 8 n 16 thid 4 –ai 10 bi 11 offset 1 d 8 n 16 thid 5 –ai 12 bi 13 offset 1 d 8 n 16 thid 6 –ai 14 bi 15 offset 1 d 8 n 16 thid 7

Down-Sweep // clear the last element if (thid == 0) { temp[n - 1] = 0; } // traverse down tree & build scan for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } __syncthreads()

Down-Sweep: Who does what N = 32 –ai 7 bi 15 offset 8 d 1 n 16 thid 0 –ai 3 bi 7 offset 4 d 2 n 16 thid 0 –ai 1 bi 3 offset 2 d 4 n 16 thid 0 –ai 0 bi 1 offset 1 d 8 n 16 thid 0 –ai 11 bi 15 offset 4 d 2 n 16 thid 1 –ai 5 bi 7 offset 2 d 4 n 16 thid 1 –ai 2 bi 3 offset 1 d 8 n 16 thid 1 –ai 9 bi 11 offset 2 d 4 n 16 thid 2 –ai 4 bi 5 offset 1 d 8 n 16 thid 2 –ai 13 bi 15 offset 2 d 4 n 16 thid 3 –ai 6 bi 7 offset 1 d 8 n 16 thid 3 –ai 8 bi 9 offset 1 d 8 n 16 thid 4 –ai 10 bi 11 offset 1 d 8 n 16 thid 5 –ai 12 bi 13 offset 1 d 8 n 16 thid 6 –ai 14 bi 15 offset 1 d 8 n 16 thid 7

Copy to output All threads do: __syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1]; }

Bank Conflicts Current scan implementation has many shared memory bank conflicts –These really hurt performance on hardware Occur when multiple threads access the same shared memory bank with different addresses No penalty if all threads access different banks –Or if all threads access exact same address Access costs 2*M cycles if there is a conflict –Where M is max number of threads accessing single bank

Loading from Global Memory to Shared Each thread loads two shared mem data elements Original code interleaves loads: temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; Threads:(0,1,2,…,8,9,10,…)  –banks:(0,2,4,…,0,2,4,…) Better to load one element from each half of the array temp[thid] = g_idata[thid]; temp[thid + (n/2)] = g_idata[thid + (n/2)];

Bank Conflicts in the Tree Algorithm / Up-Sweep When we build the sums, each thread reads two shared memory locations and writes one: –Threads 0 and 8 access bank … … Bank: First iteration: 2 threads access each of 8 banks. … Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses t0t1t2t3t4t5t6t7t8t9

Bank Conflicts in the Tree Algorithm / Up-Sweep When we build the sums, each thread reads two shared memory locations and writes one: –Threads 1 and 9 access bank 2, and so on t0t1t2t3t4t5t6t7t8t … … Bank: First iteration: 2 threads access each of 8 banks. … Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

Bank Conflicts in the Tree Algorithm / Down-Sweep 2 nd iteration: even worse –4-way bank conflicts; for example: Threads 0,4,8,12, access bank 1, Threads 1,5,9,13, access Bank 5, etc … … Bank: 2 nd iteration: 4 threads access each of 4 banks. … Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses t0t1 t2t3t4

Using Padding to Prevent Conflicts We can use padding to prevent bank conflicts –Just add a word of padding every 16 words: … P457… P497… Bank: … In time:

Using Padding to Remove Conflicts After you compute a shared mem address like this: Address = 2 * stride * thid; Add padding like this: Address += (address / 16); (address >> 4) This removes most bank conflicts –Not all, in the case of deep trees

Scan Bank Conflicts (1) A full binary tree with 64 leaf nodes: Multiple 2-and 4-way bank conflicts Shared memory cost for whole tree –1 32-thread warp = 6 cycles per thread w/o conflicts Counting 2 shared mem reads and one write (s[a] += s[b]) –6 * ( ) = 102 cycles –36 cycles if there were no bank conflicts (6 * 6)

Scan Bank Conflicts (2) It’s much worse with bigger trees A full binary tree with 128 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: –12*2 + 6*( ) = 186 cycles –48 cycles if there were no bank conflicts: 12*1 + (6*6)

Scan Bank Conflicts (3) A full binary tree with 512 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: –48*2+24*4+12*8+6* ( ) = 570 cycles –120 cycles if there were no bank conflicts!

Fixing Scan Bank Conflicts Insert padding every NUM_BANKS elements const int LOG_NUM_BANKS = 4; // 16 banks int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) { int a = s*(2*tid); int b = s*(2*tid+1) a += (a >> LOG_NUM_BANKS); // insert pad word b += (b >> LOG_NUM_BANKS); // insert pad word shared[a] += shared[b]; }

Fixing Scan Bank Conflicts A full binary tree with 64 leaf nodes No more bank conflicts –However, there are ~8 cycles overhead for addressing For each s[a] += s[b] (8 cycles/iter. * 6 iter. = 48 extra cycles) –So just barely worth the overhead on a small tree 84 cycles vs. 102 with conflicts vs. 36 optimal

A full binary tree with 128 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) No more bank conflicts! –Significant performance win: 106 cycles vs. 186 with bank conflicts vs. 48 optimal Fixing Scan Bank Conflicts

A full binary tree with 512 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) Wait, we still have bank conflicts –Method is not foolproof, but still much improved –304 cycles vs. 570 with bank conflicts vs. 120 optimal

Fixing Scan Bank Conflicts It’s possible to remove all bank conflicts –Just do multi-level padding –Example: two-level padding: const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) { int a = s*(2*tid); int b = s*(2*tid+1) int offset = (a >> LOG_NUM_BANKS); // first level a += offset + (offset >>LOG_NUM_BANKS); // second level offset = (b >> LOG_NUM_BANKS); // first level b += offset + (offset >>LOG_NUM_BANKS); // second level temp[a] += temp[b]; }

A full binary tree with 512 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) –No bank conflicts But an extra cycle overhead per address calculation Not worth it: 440 cycles vs. 304 with 1-level padding –With 1-level padding, bank conflicts only occur in warp 0 Very small remaining cost due to bank conflicts Removing them hurts all other warps Fixing Scan Bank Conflicts

Large Arrays So far: –Array can be processed by a block 1024 elements Larger arrays? –Divide into blocks –Scan each with a block of threads –Produce partial scans –Scan the partial scans –Add the corresponding scan result back to all elements of each block See Scan Large Array in SDK

Large Arrays

Application: Stream Compaction

Application: Radix Sort

Using Streams to Overlap Kernels with Data Transfers Stream: –Queue of ordered CUDA requests By default all CUDA request go to the same stream Create a stream: –cudaStreamCreate (cudaStream *stream)

Overlapping Kernels cudaMemcpyAsync (dA, hA, sizeB, cudaMemcpyHostToDevice, streamA); cudaMemcpyAsync (dB, hB, sizeB, cudaMemcpyHostToDevice, streamB); Kernel >> (dAo, dA, sizeA); Kernel >> (dBo, dB, sizeB); cudaMemcpyAsync(hBo, dAo, cudaMemcpyDeviceToHost, streamA); cudaMemcpyAsync(hBo, dAo, cudaMemcpyDeviceToHost, streamB); cudaThreadSynchronize();