ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Linear Addressing Given: Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Data types and bank conflicts This has no conflicts if type of shared is 32-bits: foo = shared[baseIndex + threadIdx.x] But not if the data type is smaller 4-way bank conflicts: __shared__ char shared[]; foo = shared[baseIndex + threadIdx.x]; 2-way bank conflicts: __shared__ short shared[]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Structs and Bank Conflicts Struct assignments compile into as many memory accesses as there are struct members: struct vector { float x, y, z; }; struct myType { float f; int c; }; __shared__ struct vector vectors[64]; __shared__ struct myType myTypes[64]; This has no bank conflicts for vector; struct size is 3 words 3 accesses per thread, contiguous banks (no common factor with 16) struct vector v = vectors[baseIndex + threadIdx.x]; This has 2-way bank conflicts for my Type; (2 accesses per thread) struct myType m = myTypes[baseIndex + threadIdx.x]; Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Common Array Bank Conflict Patterns 1D Each thread loads 2 elements into shared mem: 2-way-interleaved loads result in 2-way bank conflicts: int tid = threadIdx.x; shared[2*tid] = global[2*tid]; shared[2*tid+1] = global[2*tid+1]; This makes sense for traditional CPU threads, locality in cache line usage and reduced sharing traffice. Not in shared memory usage where there is no cache line effects but banking effects Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Vector Reduction with Bank Conflicts Array elements 1 2 3 4 5 6 7 8 9 10 11 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
A Better Array Access Pattern Each thread loads one element in every consecutive group of bockDim elements. shared[tid] = global[tid]; shared[tid + blockDim.x] = global[tid + blockDim.x]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
No Bank Conflicts 1 2 3 … 13 14 15 16 17 18 19 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Common Bank Conflict Patterns (2D) Operating on 2D array of floats in shared memory e.g. image processing Example: 16x16 block Assume that each thread processes a row So threads in a block access all element of a column simultaneously (example: row 1 in purple) All 16 of these elements are mapped into the same bank 16-way bank conflicts: rows all start at bank 0 Bank Indices without Padding 1 2 3 4 5 6 7 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Common Bank Conflict Patterns (2D) Solution 1) pad the rows Add one float to the end of each row Solution 2) transpose before processing Suffer bank conflicts during transpose But possibly save them later Matrix view: Bank Indices with Padding 1 2 3 4 5 6 7 15 8 9 10 11 12 13 14 Column elements Bank 0 … Bank 15 1 2 3 4 5 6 7 15 16 14 13 12 11 10 9 8 Row elements Memory Bank View: Matrix indices with padding Now all elements of the same column are in different banks. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 1 2 3 4 5 6 7 15 All Warps in a Block access the same row of Ms – broadcast! 1 2 3 4 5 6 7 15 All Warps in a Block access neighboring elements in a row as they access walk through neighboring columns! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Load/Store (Memory read/write) Clustering/Batching Use LD to hide LD latency (non-dependent LD ops only) Use same thread to help hide own latency Instead of: LD 0 (long latency) Dependent MATH 0 LD 1 (long latency) Dependent MATH 1 Do: LD 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Sample CUDA / PTX Programs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
PTX Virtual Machine and ISA C/C++ Application Parallel Thread eXecution (PTX) Virtual Machine and ISA Programming model Execution resources and state ISA – Instruction Set Architecture Variable declarations Instructions and operands Translator is an optimizing compiler Translates PTX to Target code Program install time Driver implements VM runtime Coupled with Translator C/C++ Compiler ASM-level Library Programmer PTX Code PTX Code PTX to Target Translator C G80 … GPU Target code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Compiling CUDA to PTX float4 me = gx[gtid]; me.x += me.y * me.z; CUDA ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; # 174 me.x += me.y * me.z; mad.f32 $f1, $f5, $f3, $f1; PTX The first assignment loads a float4 vector from global memory into 4 registers in one instruction. The notation “{reg0, reg1, …}” is an immediate vector. The second instruction is the MAD resulting from the cuda code (“mad d, a, b, c;” is d = a*b+c) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function CUDA PTX sub.f32 $f18, $f1, $f15; __device__ void interaction( float4 b0, float4 b1, float3 *accel) { r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s; } sub.f32 $f18, $f1, $f15; sub.f32 $f19, $f3, $f16; sub.f32 $f20, $f5, $f17; mul.f32 $f21, $f18, $f18; mul.f32 $f22, $f19, $f19; mul.f32 $f23, $f20, $f20; add.f32 $f24, $f21, $f22; add.f32 $f25, $f23, $f24; rsqrt.f32 $f26, $f25; mad.f32 $f13, $f18, $f26, $f13; mov.f32 $f14, $f13; mad.f32 $f11, $f19, $f26, $f11; mov.f32 $f12, $f11; mad.f32 $f9, $f20, $f26, $f9; mov.f32 $f10, $f9; PTX uses as many registers as it needs. Later, a register-allocator is run, minimizing register use. PTX does not necessarily generate MAD instructions immediately. MUL/ADD combinations are detected during optimization and combined. I can’t explain the MOV register copying instructions (yet), but I know they go away during optimization. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Compiling a loop that calls a function Cuda sx is shared mx, accel are local PTX mov.s32 $r12, 0; $Lt_0_26: setp.eq.u32 $p1, $r12, $r5; @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u32 $r14, $r13, $r1; ld.shared.f32 $f15, [$r14+0]; ld.shared.f32 $f16, [$r14+4]; ld.shared.f32 $f17, [$r14+8]; [func body from previous slide inlined here] $Lt_0_27: add.s32 $r12, $r12, 1; mov.s32 $r15, 128; setp.ne.s32 $p2, $r12, $r15; @$p2 bra $Lt_0_26; for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); } In PTX, set I = 0, test for I == threadIdx.x, jump to end if equal, otherwise, load data and do work. The notation “@p” means “for threads with a true predicate p”. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign