© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
HEAPS. Review: what are the requirements of the abstract data type: priority queue? Quick removal of item with highest priority (highest or lowest key.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Introduction to CUDA Programming
CS 179: GPU Programming Lecture 7.
L18: CUDA, cont. Memory Hierarchy and Examples
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Scan)
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Parallel build blocks.
ECE 498AL Lecture 10: Control Flow
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
Presentation transcript:

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and Their Implementation

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 2 Parallel Reductions Simple array reductions reduce all of the data in an array to a single value that contains some information from the entire array. –Sum, maximum element, minimum element, etc. Used in lots of applications, although not always in parallel form –Matrix Multiplication is essentially performing a sum reduction over the element-product of two vectors for each output element: but the sum is computed by a single thread Assumes that the operator used in the reduction is associative –Technically not true for things like addition on floating-point numbers, but it’s common to pretend that it is

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 3 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator  with identity I, and an array of n elements [a 0, a 1, …, a n-1 ] and returns the ordered set [I, a 0, (a 0  a 1 ), …, (a 0  a 1  …  a n-2 )]. Example: if  is addition, then scan on the set [ ] returns the set [ ] (From Blelloch, 1990, “Prefix Sums and Their Applications) Each element is the array reduction of all previous elements

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 4 Relevance of Scan Scan is a simple and useful parallel building block –Convert recurrences from sequential : for(j=1;j<n;j++) out[j] = out[j-1] + f(j); –into parallel: forall(j) { temp[j] = f(j) }; scan(out, temp); Useful for many parallel algorithms: radix sort quicksort String comparison Lexical analysis Stream compaction Polynomial evaluation Solving recurrences Tree operations Histograms Etc.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 5 Example: Application of Scan Tid Scan Cnt Computing indexes into a global array where each thread needs space for a dynamic number of elements –Each thread computes the number of elements it will produce –The scan of the thread element counts will determine the beginning index for each thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 6 Scan on the CPU Just add each element to the sum of the elements before it Trivial, but sequential Exactly n adds: absolute minimum bound void scan( float* scanned, float* input, int length) { scanned[0] = 0; for(int i = 1; i < length; ++i) { scanned[i] = input[i-1] + scanned[i-1]; }

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 7 A First-Attempt Parallel Scan Algorithm 1.Read from input into a temporary array we can work on in place. Set first element to zero and shift others right by one. Each thread reads one value from the input array in device memory into shared memory array T0. Thread 0 writes 0 into shared memory array. T In

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 8 A First-Attempt Parallel Scan Algorithm Active threads: stride to n-1 (n-stride threads) Thread j adds elements j and j-stride from T Iteration #1 Stride = 1 T Stride 1 T In Read input from into a temporary array (shared memory in CUDA). Set first element to zero and shift others right. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 9 A First-Attempt Parallel Scan Algorithm T T Stride 1 Stride 2 1.Read input from into a temporary array (shared memory in CUDA). Set first element to zero and shift others right. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. Iteration #2 Stride = 2 T In Note that threads need to synchronize before they read and before they write to T. Double-buffering can allow them to only synchronize after writing.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 10 A First-Attempt Parallel Scan Algorithm T Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. Iteration #3 Stride = 4 In T T Stride 1 Stride 2 T After each iteration, each array element contains the sum of the previous 2*stride elements of the original array.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 11 A First-Attempt Parallel Scan Algorithm Out T In T T Stride 1 Stride 2 T Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. 3.Write output.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 12 Work Efficiency Considerations The first-attempt Scan executes log(n) parallel iterations –The steps do at least n/2 operations every step for log(n) steps –Total adds  O(n*log(n)) work This scan algorithm is not very efficient on finite resources –Presumably, if you have N or more parallel processors, the number of steps matters more than the number of operations –For larger reductions, finite resources get their workload multiplied by factor of log(n) compared to a sequential implementation. Log(1024) = 10: this gets bad very quickly

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 13 Improving Efficiency A common parallel algorithm pattern: Balanced Trees –Build a balanced binary tree on the input data and sweep it to and from the root –Tree is not an actual data structure, but a concept to determine what each thread does at each step For scan: –Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves –Traverse back up the tree building the scan from the partial sums

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 14 Build the Sum Tree T Assume array is already in the temp array

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 15 Build the Sum Tree T T Stride 1 Iteration 1, n/2 threads Iterate log(n) times. Each thread adds value stride elements away to its own value Each corresponds to a single thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 16 Build the Sum Tree T T T Stride 1 Stride 2 Iteration 2, n/4 threads Iterate log(n) times. Each thread adds value stride elements away to its own value Each corresponds to a single thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 17 Build the Sum Tree T T T T Iterate log(n) times. Each thread adds value stride elements away to its own value. After step with stride k, elements with indexes divisible by 2k contain the partial sum of itself and the preceding 2k-1 elements. Iteration log(n), 1 thread Stride 1 Stride 2 Stride 4 Each corresponds to a single thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 18 Partial Sum Array Idx T Index k holds the partial sum of the elements from j to k where j is the greatest index less than or equal to k that is divisible by the greatest power of 2 by which k+1 is divisible, or 0 if there is none. Sum Trust me, it works

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 19 Zero The Last Element Idx T It’s an exclusive scan, so we don’t need it. It’ll propagate back to the beginning in the upcoming steps. Sum

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 20 Build Scan From Partial Sums T

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 21 Build Scan From Partial Sums T T Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. Iteration 1 1 thread Stride 4 Each corresponds to a single thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 22 Build Scan From Partial Sums T T T Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. Iteration 2 2 threads Stride 4 Stride 2 Each corresponds to a single thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 23 Build Scan From Partial Sums T T T T Done! We now have a completed scan that we can write to output. Total steps: 2 * log(n). Total operations: 2 * (n-1) adds - Work Efficient! Iteration log(n) n/2 threads Stride 2 Stride 4 Stride 1 Each corresponds to a single thread.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 24 Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast cases: –All 16 threads of a half-warp access different banks: no bank conflict –All 16 threads of a half-warp access the same address: broadcast The slow case: –Multiple threads in the same half-warp access different values in the same bank –Must serialize the accesses –Cost = max # of values requested from one of the 16 banks

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 25 Bank Addressing Examples 2-way Bank Conflicts –Linear addressing stride == 2 8-way Bank Conflicts –Linear addressing stride == 8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 26 Use Padding to Reduce Conflicts This is a simple modification to the indexing After you compute a shared mem address like this: Address = 2 * stride * thid; Add padding like this: Address += (Address / 16); // divide by NUM_BANKS This removes most bank conflicts –Not all, in the case of deep trees, but good enough for us

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 27 Fixing Scan Bank Conflicts Insert padding every NUM_BANKS elements const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) { int a = s*(2*tid); int b = s*(2*tid+1) a += (a >> LOG_NUM_BANKS); // insert pad word b += (b >> LOG_NUM_BANKS); // insert pad word shared[a] += shared[b]; }

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 28 What About Really Big Arrays? What if the array doesn’t fit in shared memory? –After all, we care about parallelism because we have big problems, right? Tiled reduction with global synchronization 1.Tile the input and perform reductions on tiles with individual thread blocks 2.Store the intermediate results from each block back to global memory to be the input for the next kernel 3.Repeat as necessary

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 29 Building the Global Sum Tree T Iterate log(n) times. Each thread adds value stride elements away to its own value. After step with stride k, elements with indexes divisible by 2k contain the partial sum of itself and the preceding 2k-1 elements. Compact and globally synchronize, then continue Stride 1 Stride 2 … … … T11114…

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 30 Global Synchronization in CUDA Remember, there is no barrier synchronization between CUDA thread blocks –You can have some limited communication through atomic integer operations on global memory on newer devices –Doesn’t conveniently address the global reduction problem, or lots of others To synchronize, you need to end the kernel (have all thread blocks complete) –Then launch a new one