ME964 High Performance Computing for Engineering Applications

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L9: Project Discussion and Floating Point Issues CS6235.
8-1 Embedded Systems Fixed-Point Math and Other Optimizations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Single Instruction Multiple Threads
© David Kirk/NVIDIA and Wen-mei W
CS/EE 217 – GPU Architecture and Parallel Programming
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Analysis of Algorithms
COMBINED PAGING AND SEGMENTATION
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
How will execution time grow with SIZE?
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Introduction to CUDA Programming
CS 179: GPU Programming Lecture 7.
Lecture 5: GPU Compute Architecture for the last time
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Mattan Erez The University of Texas at Austin
CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations
Parallel Computation Patterns (Reduction)
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
UNIVERSITY OF MASSACHUSETTS Dept
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 498AL Lecture 10: Control Flow
CENG 351 Data Management and File Structures
CS213 Floating Point Topics IEEE Floating Point Standard Rounding
UNIVERSITY OF MASSACHUSETTS Dept
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

ME964 High Performance Computing for Engineering Applications CUDA Arithmetic Support A Software Design Exercise: Prefix Scan Oct. 14, 2008

Before we get started… Last Time Today Gauging the extent to which you use hardware resources in CUDA The occupancy calculator and the “–ptxas-options –v” compile option Control Flow in CUDA The issue of divergent warps Predicated instructions Today CUDA arithmetic support Performance implications A software design exercise: prefix scan Preamble to the topic of software design patterns Helpful with your assignment 2

Runtime Math Library There are two types of runtime math operations __func(): direct mapping to hardware Fast but low accuracy (see programming guide for details) Examples: __sin(x), __exp(x), __pow(x,y) func() : compile to multiple instructions Slower but higher accuracy (5 ulp, or less) ulp(x): Units in the Last Place - the gap between the two floating-point numbers closest to the value x Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() Unit in the Last Place or Unit of Least Precision (ULP) is the gap between two very close floating-point numbers. To be exact, ulp(x) is the gap between the two floating-point numbers closest to the value x. The amount of error in the evaluation of a floating-point operation is often expressed in ULP. An average error of 1 ULP is often seen as a tolerable error. The error is measured in ulp (units in the last place), that is the values of the results are multiplied by a power of the base such that an error in the last significant place is an error of 1. 3 HK-UIUC

Arithmetic Instruction Throughput in CUDA int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Keep this in mind, be explicit in cases where compiler can’t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2 4 HK-UIUC

Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp 5 HK-UIUC

Make your program float-safe! Existing hardware already has double precision support G80 is single-precision only Double precision comes at a performance cost Careless use of double or undeclared types may run more slowly on existing cards with double precision support (GTX280) Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed Add ‘f’ specifier on float literals: foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit 6 HK-UIUC

Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD) Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported No mechanism to detect floating-point exceptions The error is measured in ulp (units in the last place), that is the values of the results are multiplied by a power of the base such that an error in the last significant place is an error of 1. 7 HK-UIUC

GPU Floating Point Features SSE IBM Altivec Cell SPE Precision IEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handling Flush to zero Supported, 1000’s of cycles NaN support Yes No Overflow and Infinity support Yes, only clamps to max norm No, infinity Flags Some Square root Software only Hardware Division Reciprocal estimate accuracy 24 bit 12 bit Reciprocal sqrt estimate accuracy 23 bit log2(x) and 2^x estimates accuracy Streaming SIMD Extensions (SSE) is a SIMD (Single Instruction, Multiple Data) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! (which had debuted a year earlier). SSE contains 70 new instructions. 8 HK-UIUC

End: CUDA Arithmetic Support Begin: Software Design Exercise ~Prefix Sum~ End 9

Objective Putting your CUDA knowledge to work The vehicle for the software design exercise today is the parallel implementation of a prefix sum operation Recall first assignment, also the topic of the current assignment Understand that: Different algorithmic designs lead to different performance levels Different constraints dominate in different applications and/or design solutions Case studies help to establish intuition, idioms and ideas Point out parallel algorithm patterns that can result in superior performance Understand that there are patterns and it’s worth being aware of them If you want, these are the tricks of the trade When considering patterns, you can’t lose sight of the underlying hardware 10

Software for Parallel Computers You come to rely on compiler to figure out the parallelism in a piece of code and then map it to an underlying hardware VERY hard, the holy grail in parallel computing You rely on parallel libraries built for a specific underlying hardware Very convenient, the way to go when such libraries are available You rely on language extensions to facilitate the process of generating a parallel executable This is where you are with CUDA 11

Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator  with identity I, and an array of n elements [a0, a1, …, an-1] and returns the ordered set [I, a0, (a0  a1), …, (a0  a1  …  an-2)]. Example: if  is addition, then scan on the set [3 1 7 0 4 1 6 3] returns the set [0 3 4 11 11 15 16 22] Exclusive scan: last input element is not included in the result (From Blelloch, 1990, “Prefix Sums and Their Applications) 12

Applications of Scan Scan is a simple and useful parallel building block Convert recurrences from sequential … for(j=1;j<n;j++) out[j] = out[j-1] + f(j); … into parallel: forall(j) in parallel temp[j] = f(j); scan(out, temp); Useful in implementation of several parallel algorithms: radix sort quicksort String comparison Lexical analysis Stream compaction Polynomial evaluation Solving recurrences Tree operations Histograms Etc. 13 HK-UIUC

Scan on the CPU void scan( float* scanned, float* input, int length) { scanned[0] = 0; for(int i = 1; i < length; ++i) scanned[i] = scanned[i-1] + input[i-1]; } Just add each element to the sum of the elements before it Trivial, but sequential Exactly n-1 adds: optimal in terms of work efficiency (Dan): n-1 instead of n adds 14

Parallel Scan Algorithm: Solution One Hillis & Steele (1986) Note that a implementation of the algorithm shown in picture requires two buffers of length n (shown is the case n=8=23) Assumption: the number n of elements is a power of 2: n=2M 15 Picture courtesy of Mark Harris

The Plain English Perspective First iteration, I go with stride 1=20 Start at x[2M] and apply this stride to all the array elements before x[2M] to find the mate of each of them. When looking for the mate, the stride should not land you before the beginning of the array. The sum replaces the element of higher index. This means that I have 2M-1 additions Second iteration, I go with stride 2=21 This means that I have 2M – 21 additions Third iteration: I go with stride 4=22 This means that I have 2M – 22 additions … (and so on) 16

The Plain English Perspective Consider the kth iteration (k is some arbitrary valid integer): I go with stride 2k-1 Start at x[2M] and apply this stride to all the array elements before x[2M] to find the mate of each of them. When looking for the mate, the stride should not land you before the beginning of the array. The sum replaces the element of higher index. This means that I have 2M-2k-1 additions … Mth iteration: I go with stride 2M-1 This means that I have 2M-2M-1 additions NOTE: There is no (M+1)th iteration since this would automatically put me beyond the bounds of the array (if you apply an offset of 2M to “&x[2M] ” it places you right before the beginning of the array – not good…) 17

Hillis & Steele Parallel Scan Algorithm Algorithm looks like this: for d := 0 to M-1 do forall k in parallel do if k – 2d ≥0 then x[out][k] := x[in][k] + x[in][k − 2d] else x[out][k] := x[in][k] endforall swap(in,out) endfor Double-buffered version of the sum scan 18

Operation Count Final Considerations The number of operations tally: (2M-20) + (2M-21) + … + (2M-2k) +…+ (2M-2M-1) Final operation count: This is an algorithm with O(n*log(n)) work This scan algorithm is not that work efficient Sequential scan algorithm does n-1 adds A factor of log(n) might hurt: 20x more work for 106 elements! Homework requires a scan of about 16 million elements A parallel algorithm can be slow when execution resources are saturated due to low algorithm efficiency 19

Hillis & Steele: Kernel Function __global__ void scan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; // allocated on invocation int thid = threadIdx.x; int pout = 0, pin = 1; // load input into shared memory. // Exclusive scan: shift right by one and set first element to 0 temp[thid] = (thid > 0) ? g_idata[thid-1] : 0; __syncthreads(); for( int offset = 1; offset < n; offset <<= 1 ) pout = 1 - pout; // swap double buffer indices pin = 1 - pout; if (thid >= offset) temp[pout*n+thid] += temp[pin*n+thid - offset]; else temp[pout*n+thid] = temp[pin*n+thid]; } g_odata[thid] = temp[pout*n+thid1]; // write output 20

Hillis & Steele: Kernel Function, Quick Remarks The kernel is very simple, which is good Note the nice trick that was used to swap the buffers The kernel only works when the entire array is processed by one block One block in CUDA has 512 threads, which means I can have up to 1024 elements (short of 16 million, which is your assignment) This needs to be improved upon, can’t limit solution to what’s been presented so far 21

Improving Efficiency A common parallel algorithm pattern: Balanced Trees Build a balanced binary tree on the input data and sweep it to and then from the root Tree is not an actual data structure, but a concept to determine what each thread does at each step For scan: Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves (this is a reduction algorithm!) Traverse back up the tree building the scan from the partial sums 22 HK-UIUC

Picture and Pseudocode ~ Reduction Step~ 1 3 5 7 3 7 -1 -1 7 -1 -1 -1 i2 = 0 2 4 6 1 5 -1 -1 3 -1 -1 -1 NOTE: “-1” entries indicate no-ops for k=0 to M-1 offset = 2k for j=1 to 2M-k-1 in parallel do x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1] endfor 23

Validate the Indices/Offset function [offset,index1,index2] = coeffsReduce(M) % MATLAB utility to validate my understanding of % how the various strides and indeces are to be % constructed in the up-sweep phase of the % Blelloch scan algorithm offset = zeros(M,1); offset = offset - 1; index1 = zeros(M, 2^(M-1)); index1 = index1 - 1; index2 = zeros(M, 2^(M-1)); index2 = index2 - 1; for k=0:M-1 offset(k+1) = 2^k; for j=1:2^(M-k-1) index1(k+1,j) = j*2^(k+1)-1; index2(k+1,j) = j*2^(k+1)-1-2^k; end Validate the Indices/Offset >> [offset, i1, i2] = coeffsReduce(3) offset = 1 2 4 i1 = 1 3 5 7 3 7 -1 -1 7 -1 -1 -1 i2 = 0 2 4 6 1 5 -1 -1 3 -1 -1 -1 24

Operation Count, Reduce Phase for k=0 to M-1 offset = 2k for j=1 to 2M-k-1 in parallel do x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1] endfor By inspection: Looks promising… 25

The Down-Sweep Phase NOTE: This is just a mirror image of the reduction stage. Easy to come up with the indexing scheme… for k=M-1 to 0 offset = 2k for j=1 to 2M-k-1 in parallel do dummy = x[j·2k+1-2k-1] x[j·2k+1-2k-1] = x[j·2k+1-1] x[j·2k+1-1] = x[j·2k+1-1] + dummy endfor 26

Down-Sweep Phase, Remarks Number of operations for the down-sweep phase: Additions: n-1 Swaps: n-1 (each swap shadows an addition) Total number of operations associated with this algorithm Additions: 2n-2 Swaps: n-1 Looks very comparable with the work load in the sequential solution The algorithm is convoluted though, it won’t be easy to implement Kernel shown on next slide 27

28 01| __global__ void prescan(float *g_odata, float *g_idata, int n) 02| { 03| extern __shared__ float temp[];// allocated on invocation 04| 05| 06| int thid = threadIdx.x; 07| int offset = 1; 08| 09| temp[2*thid] = g_idata[2*thid]; // load input into shared memory 10| temp[2*thid+1] = g_idata[2*thid+1]; 11| 12| for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree 13| { 14| __syncthreads(); 15| 16| if (thid < d) 17| { 18| int ai = offset*(2*thid+1)-1; 19| int bi = offset*(2*thid+2)-1; 20| 21| temp[bi] += temp[ai]; 22| } 23| offset *= 2; 24| } 25| 26| if (thid == 0) { temp[n - 1] = 0; } // clear the last element 27| 28| for (int d = 1; d < n; d *= 2) // traverse down tree & build scan 29| { 30| offset >>= 1; 31| __syncthreads(); 32| 33| if (thid < d) 34| { 35| int ai = offset*(2*thid+1)-1; 36| int bi = offset*(2*thid+2)-1; 37| 38| float t = temp[ai]; 39| temp[ai] = temp[bi]; 40| temp[bi] += t; 41| } 42| } 43| 44| __syncthreads(); 45| 46| g_odata[2*thid] = temp[2*thid]; // write results to device memory 47| g_odata[2*thid+1] = temp[2*thid+1]; 48| } 28

Bank Conflicts Current implementation has many ShMem bank conflicts Can significantly hurt performance on current GPU hardware The source of the conflicts: linear indexing with stride that is a power of 2 multiple of thread id (see below): “j·2k+1-1” for k=0 to M-1 offset = 2k for j=1 to 2M-k-1 in parallel do x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1] endfor Simple modifications to current memory addressing scheme can save a lot of cycles 29

Bank Conflicts Occur when multiple threads access the same shared memory bank with different addresses In our case, we have something like 2k+1·j-1 k=0: two way bank conflict k=1: four way bank conflict … No penalty if all threads access different banks Or if all threads access exact same address Recall that shared memory accesses with conflicts are serialized N-bank memory conflicts lead to a set of N successive shared memory transactions 30

Initial Bank Conflicts on Load Each thread loads two shared mem data elements Tempting to interleave the loads (see lines 9 & 10, and 46 & 47) temp[2*thid]   = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; Thread 0 accesses banks 0 and 1 Thread 1 accesses banks 2 and 3 … Thread 8 accesses banks 16 and 17. Oops, that’s 0 and 1 again… Two way bank conflict, can’t be easily eliminated Better to load one element from each half of the array temp[thid]         = g_idata[thid]; temp[thid + (n/2)] = g_idata[thid + (n/2)]; (Dan): Threads:(0,1,2,…,8,9,10,…)banks:(0,2,4,…,0,2,4,…) – I don’t think it’s right… 31 HK-UIUC

Bank Conflicts in the tree algorithm When we build the sums, during the first iteration of the algorithm each thread in a half-warp reads two shared memory locations and writes one: Th(0,8) access bank 0 Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … First iteration: 2 threads access each of 8 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses 32 HK-UIUC

Bank Conflicts in the tree algorithm When we build the sums, each thread reads two shared memory locations and writes one: Th(1,9) access bank 2, etc. Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … First iteration: 2 threads access each of 8 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses 33 HK-UIUC

Bank Conflicts in the tree algorithm 2nd iteration: even worse! 4-way bank conflicts; for example: Th(0,4,8,12) access bank 1, Th(1,5,9,13) access Bank 5, etc. Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … T0 T1 T2 T3 T4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... 16 … 2nd iteration: 4 threads access each of 4 banks. Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses 34 HK-UIUC

Managing Bank Conflicts in the Tree Algorithm Use padding to prevent bank conflicts Add a word of padding every 16 words. Now you work with a virtual 17 bank shared memory layout Within a 16-thread half-warp, all threads access different banks They are aligned to a 17 word memory layout It comes at a price: you have memory words that are wasted Keep in mind: you should also load data from global into shared memory using the virtual memory layout of 17 banks 35

Use Padding to Reduce Conflicts After you compute a ShMem address like this: Address = 2 * stride * thid; Add padding like this: Address += (address >> 4); // divide by NUM_BANKS This removes most bank conflicts Not all, in the case of deep trees Material posted online will contain a discussion of this “deep tree” situation along with a proposed solution 36 HK-UIUC

Managing Bank Conflicts in the Tree Algorithm Original scenario. Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … … T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... … Actual physical memory (true bank number) Modified scenario, virtual 17 bank memory layout. (0) (1) (2) (3) Virtual Bank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … P … T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... P … 37 HK-UIUC Note that only arrows with the same color happen simultaneously.

Concluding Remarks, Parallel Scan Intuitively, the scan operation is not the type of procedure ideally suited for parallel computing Even if it doesn’t fit like a glove, leads to nice speedup: # elements CPU Scan (ms) GPU Scan (ms) Speedup 1024 0.002231 0.079492 0.03 32768 0.072663 0.106159 0.68 65536 0.146326 0.137006 1.07 131072 0.726429 0.200257 3.63 262144 1.454742 0.326900 4.45 524288 2.911067 0.624104 4.66 1048576 5.900097 1.118091 5.28 2097152 11.848376 2.099666 5.64 4194304 23.835931 4.062923 5.87 8388688 47.390906 7.987311 5.93 16777216 94.794598 15.854781 5.98 Source: 2007 paper of Harris, Sengupta, Owens 38

Concluding Remarks, Parallel Scan The Hillis-Steele (HS) implementation is simple, but suboptimal The Harris-Sengupta-Owen (HSO) solution is convoluted, but scales like O(n) The complexity of the algorithm due to an acute bank-conflict situation Michael Garland argues that there is a lot of serendipity in the HSO algorithm, and it’s not clear if it’s worth pursuing Finally, we have not solved the problem yet: we only looked at the case when our array has up to 1024 elements You will have to think how to handle the 16,777,216=224 elements case Likewise, it would be fantastic if you implement as well the case when the number of elements is not a power of 2 39

Taking it one step further: completely eliminating the bank conflicts (individual reading) 40

Scan Bank Conflicts (1) A full binary tree with 64 leaf nodes: Multiple 2-and 4-way bank conflicts Shared memory cost for whole tree 1 32-thread warp = 6 cycles per thread w/o conflicts Counting 2 shared mem reads and one write (s[a] += s[b]) 6 * (2+4+4+4+2+1) = 102 cycles 36 cycles if there were no bank conflicts (6 * 6) 32-thread warp normally takes 6 cycles for the line s[a] += s[b]. This line should become 3 instructions: G2R rA, g[a] ADD rA, g[b], rA R2G g[a], rA For 32 threads, these instructions take 2 cycles each. Let’s assume that having fewer than 32 active threads does not reduce this cycle count per instruction. Without bank conflicts, it’s 6 cycles total, times 6 levels of the tree = 36 cycles. With bank conflicts it’s 6*(2+4+4+4+2+1) = 102 cycles 41

Scan Bank Conflicts (2) It’s much worse with bigger trees! A full binary tree with 128 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: 12*2 + 6*(4+8+8+4+2+1) = 186 cycles 48 cycles if there were no bank conflicts! 12*1 + (6*6) For 32 threads, these instructions take 2 cycles each. Let’s assume that having fewer than 32 active threads does not reduce this cycle count per instruction. Without bank conflicts, it’s 6 cycles total, times 7 levels of the tree = 42 cycles (Assuming that both 32-thread warps on the lowest level (64 threads to process 128 elements) run simultaneously). With bank conflicts it’s 6*(2+4+8+8+4+2+1) = 174 cycles WMH: I believe that the first 2 waves of accesses also have 8-way conflicts. They should take 8 rounds to resolve. 42

Scan Bank Conflicts (3) A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: 48*2+24*4+12*8+6* (16+16+8+4+2+1) = 570 cycles 120 cycles if there were no bank conflicts! For 32 threads, these instructions take 2 cycles each. Let’s assume that having fewer than 32 active threads does not reduce this cycle count per instruction. Without bank conflicts, it’s 6 cycles total, times 9 levels of the tree = 54 cycles (Assuming that all 8 32-thread warps on the lowest level (256 threads to process 512 elements) run simultaneously). With bank conflicts it’s 6*(2+4+8+16+16+8+4+2+1) = 366 cycles WMH: Again, the first three should be all 16-way conflicts. 43

Fixing Scan Bank Conflicts Insert padding every NUM_BANKS elements const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) int a = s*(2*tid); int b = s*(2*tid+1) a += (a >> LOG_NUM_BANKS); // insert pad word b += (b >> LOG_NUM_BANKS); // insert pad word shared[a] += shared[b]; } 44

Fixing Scan Bank Conflicts A full binary tree with 64 leaf nodes No more bank conflicts! However, there are ~8 cycles overhead for addressing For each s[a] += s[b] (8 cycles/iter. * 6 iter. = 48 extra cycles) So just barely worth the overhead on a small tree 84 cycles vs. 102 with conflicts vs. 36 optimal 2 shifts and 2 adds per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. So (6+8)*6 = 84 cycles 45

Fixing Scan Bank Conflicts A full binary tree with 128 leaf nodes Only the last 6 iterations shown (root and 5 levels below) No more bank conflicts! Significant performance win: 106 cycles vs. 186 with bank conflicts vs. 48 optimal 1 shift and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. So (6+8)*7 = 98 cycles 46

Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) Wait, we still have bank conflicts Method is not foolproof, but still much improved 304 cycles vs. 570 with bank conflicts vs. 120 optimal 1 shift and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 8 cycles overhead plus the 6 cycles for the s[a] += s[b] without bank conflicts. But we still have 2-way bank conflicts on 4 tree levels (out of 9 total). So… (6+8)*5 + (12+8)*4= 150 cycles 47

Fixing Scan Bank Conflicts It’s possible to remove all bank conflicts Just do multi-level padding Example: two-level padding: const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) int a = s*(2*tid); int b = s*(2*tid+1) int offset = (a >> LOG_NUM_BANKS); // first level a += offset + (offset >>LOG_NUM_BANKS); // second level offset = (b >> LOG_NUM_BANKS); // first level b += offset + (offset >>LOG_NUM_BANKS); // second level temp[a] += temp[b]; } A and b calculation so that both offset and offset>>LOG_NUM_BANKS are added to them. 48

Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes Only the last 6 iterations shown (root and 5 levels below) No bank conflicts But an extra cycle overhead per address calculation Not worth it: 440 cycles vs. 304 with 1-level padding With 1-level padding, bank conflicts only occur in warp 0 Very small remaining cost due to bank conflicts Removing them hurts all other warps 2 shifts and 1 add per address computation. At 2 cycles each for a 32-thread warp, that’s 12 cycles overhead, plus the 6 cycles for the s[a] += s[b] without bank conflicts. (6+12)*9 = 162 cycles 49