Download presentation
Presentation is loading. Please wait.
Published byGarry Palmer Modified over 6 years ago
1
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
2
Shared memory bank conflicts
Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
3
Linear Addressing Given:
Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
4
Data types and bank conflicts
This has no conflicts if type of shared is 32-bits: foo = shared[baseIndex + threadIdx.x] But not if the data type is smaller 4-way bank conflicts: __shared__ char shared[]; foo = shared[baseIndex + threadIdx.x]; 2-way bank conflicts: __shared__ short shared[]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
5
Structs and Bank Conflicts
Struct assignments compile into as many memory accesses as there are struct members: struct vector { float x, y, z; }; struct myType { float f; int c; }; __shared__ struct vector vectors[64]; __shared__ struct myType myTypes[64]; This has no bank conflicts for vector; struct size is 3 words 3 accesses per thread, contiguous banks (no common factor with 16) struct vector v = vectors[baseIndex + threadIdx.x]; This has 2-way bank conflicts for my Type; (2 accesses per thread) struct myType m = myTypes[baseIndex + threadIdx.x]; Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
6
Common Array Bank Conflict Patterns 1D
Each thread loads 2 elements into shared mem: 2-way-interleaved loads result in 2-way bank conflicts: int tid = threadIdx.x; shared[2*tid] = global[2*tid]; shared[2*tid+1] = global[2*tid+1]; This makes sense for traditional CPU threads, locality in cache line usage and reduced sharing traffice. Not in shared memory usage where there is no cache line effects but banking effects Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
7
Vector Reduction with Bank Conflicts
Array elements 1 2 3 4 5 6 7 8 9 10 11 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
8
A Better Array Access Pattern
Each thread loads one element in every consecutive group of bockDim elements. shared[tid] = global[tid]; shared[tid + blockDim.x] = global[tid + blockDim.x]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
9
No Bank Conflicts 1 2 3 … 13 14 15 16 17 18 19 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
10
Common Bank Conflict Patterns (2D)
Operating on 2D array of floats in shared memory e.g. image processing Example: 16x16 block Assume that each thread processes a row So threads in a block access all element of a column simultaneously (example: row 1 in purple) All 16 of these elements are mapped into the same bank 16-way bank conflicts: rows all start at bank 0 Bank Indices without Padding 1 2 3 4 5 6 7 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
11
Common Bank Conflict Patterns (2D)
Solution 1) pad the rows Add one float to the end of each row Solution 2) transpose before processing Suffer bank conflicts during transpose But possibly save them later Matrix view: Bank Indices with Padding 1 2 3 4 5 6 7 15 8 9 10 11 12 13 14 Column elements Bank 0 … Bank 15 1 2 3 4 5 6 7 15 16 14 13 12 11 10 9 8 Row elements Memory Bank View: Matrix indices with padding Now all elements of the same column are in different banks. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
12
Does Matrix Multiplication Incur Shared Memory Bank Conflicts?
1 2 3 4 5 6 7 15 All Warps in a Block access the same row of Ms – broadcast! 1 2 3 4 5 6 7 15 All Warps in a Block access neighboring elements in a row as they access walk through neighboring columns! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
13
Load/Store (Memory read/write) Clustering/Batching
Use LD to hide LD latency (non-dependent LD ops only) Use same thread to help hide own latency Instead of: LD 0 (long latency) Dependent MATH 0 LD 1 (long latency) Dependent MATH 1 Do: LD 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
14
Sample CUDA / PTX Programs
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
15
PTX Virtual Machine and ISA
C/C++ Application Parallel Thread eXecution (PTX) Virtual Machine and ISA Programming model Execution resources and state ISA – Instruction Set Architecture Variable declarations Instructions and operands Translator is an optimizing compiler Translates PTX to Target code Program install time Driver implements VM runtime Coupled with Translator C/C++ Compiler ASM-level Library Programmer PTX Code PTX Code PTX to Target Translator C G80 … GPU Target code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
16
Compiling CUDA to PTX float4 me = gx[gtid]; me.x += me.y * me.z; CUDA
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; # me.x += me.y * me.z; mad.f $f1, $f5, $f3, $f1; PTX The first assignment loads a float4 vector from global memory into 4 registers in one instruction. The notation “{reg0, reg1, …}” is an immediate vector. The second instruction is the MAD resulting from the cuda code (“mad d, a, b, c;” is d = a*b+c) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
17
CUDA Function CUDA PTX sub.f32 $f18, $f1, $f15;
__device__ void interaction( float4 b0, float4 b1, float3 *accel) { r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s; } sub.f $f18, $f1, $f15; sub.f $f19, $f3, $f16; sub.f $f20, $f5, $f17; mul.f $f21, $f18, $f18; mul.f $f22, $f19, $f19; mul.f $f23, $f20, $f20; add.f $f24, $f21, $f22; add.f $f25, $f23, $f24; rsqrt.f $f26, $f25; mad.f $f13, $f18, $f26, $f13; mov.f $f14, $f13; mad.f $f11, $f19, $f26, $f11; mov.f $f12, $f11; mad.f $f9, $f20, $f26, $f9; mov.f $f10, $f9; PTX uses as many registers as it needs. Later, a register-allocator is run, minimizing register use. PTX does not necessarily generate MAD instructions immediately. MUL/ADD combinations are detected during optimization and combined. I can’t explain the MOV register copying instructions (yet), but I know they go away during optimization. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
18
Compiling a loop that calls a function
Cuda sx is shared mx, accel are local PTX mov.s $r12, 0; $Lt_0_26: setp.eq.u $p1, $r12, $r5; @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u $r14, $r13, $r1; ld.shared.f $f15, [$r14+0]; ld.shared.f $f16, [$r14+4]; ld.shared.f $f17, [$r14+8]; [func body from previous slide inlined here] $Lt_0_27: add.s $r12, $r12, 1; mov.s $r15, 128; setp.ne.s $p2, $r12, $r15; @$p2 bra $Lt_0_26; for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); } In PTX, set I = 0, test for I == threadIdx.x, jump to end if equal, otherwise, load data and do work. The notation means “for threads with a true predicate p”. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.