ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
ME964 High Performance Computing for Engineering Applications “Once a new technology rolls over you, if you're not part of the steamroller, you're part.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W
Single Instruction Multiple Threads
Introduction to CUDA Programming
GPU Programming with CUDA
The CUDA Programming Model
Introduction to CUDA Programming
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Lecture 5: GPU Compute Architecture for the last time
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
Introduction to CUDA Programming
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
Introduction to CUDA Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Introduction to CUDA Programming
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
Programming Massively Parallel Processors Performance Considerations
Memory and Data Locality
L15: CUDA, cont. Memory Hierarchy and Examples
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
ECE 498AL Lecture 10: Control Flow
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Linear Addressing Given: Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Data types and bank conflicts This has no conflicts if type of shared is 32-bits: foo = shared[baseIndex + threadIdx.x] But not if the data type is smaller 4-way bank conflicts: __shared__ char shared[]; foo = shared[baseIndex + threadIdx.x]; 2-way bank conflicts: __shared__ short shared[]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Structs and Bank Conflicts Struct assignments compile into as many memory accesses as there are struct members: struct vector { float x, y, z; }; struct myType { float f; int c; }; __shared__ struct vector vectors[64]; __shared__ struct myType myTypes[64]; This has no bank conflicts for vector; struct size is 3 words 3 accesses per thread, contiguous banks (no common factor with 16) struct vector v = vectors[baseIndex + threadIdx.x]; This has 2-way bank conflicts for my Type; (2 accesses per thread) struct myType m = myTypes[baseIndex + threadIdx.x]; Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Common Array Bank Conflict Patterns 1D Each thread loads 2 elements into shared mem: 2-way-interleaved loads result in 2-way bank conflicts: int tid = threadIdx.x; shared[2*tid] = global[2*tid]; shared[2*tid+1] = global[2*tid+1]; This makes sense for traditional CPU threads, locality in cache line usage and reduced sharing traffice. Not in shared memory usage where there is no cache line effects but banking effects Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Vector Reduction with Bank Conflicts Array elements 1 2 3 4 5 6 7 8 9 10 11 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

A Better Array Access Pattern Each thread loads one element in every consecutive group of bockDim elements. shared[tid] = global[tid]; shared[tid + blockDim.x] = global[tid + blockDim.x]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

No Bank Conflicts 1 2 3 … 13 14 15 16 17 18 19 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Common Bank Conflict Patterns (2D) Operating on 2D array of floats in shared memory e.g. image processing Example: 16x16 block Assume that each thread processes a row So threads in a block access all element of a column simultaneously (example: row 1 in purple) All 16 of these elements are mapped into the same bank 16-way bank conflicts: rows all start at bank 0 Bank Indices without Padding 1 2 3 4 5 6 7 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Common Bank Conflict Patterns (2D) Solution 1) pad the rows Add one float to the end of each row Solution 2) transpose before processing Suffer bank conflicts during transpose But possibly save them later Matrix view: Bank Indices with Padding 1 2 3 4 5 6 7 15 8 9 10 11 12 13 14 Column elements Bank 0 … Bank 15 1 2 3 4 5 6 7 15 16 14 13 12 11 10 9 8 Row elements Memory Bank View: Matrix indices with padding Now all elements of the same column are in different banks. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 1 2 3 4 5 6 7 15 All Warps in a Block access the same row of Ms – broadcast! 1 2 3 4 5 6 7 15 All Warps in a Block access neighboring elements in a row as they access walk through neighboring columns! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Load/Store (Memory read/write) Clustering/Batching Use LD to hide LD latency (non-dependent LD ops only) Use same thread to help hide own latency Instead of: LD 0 (long latency) Dependent MATH 0 LD 1 (long latency) Dependent MATH 1 Do: LD 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Sample CUDA / PTX Programs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

PTX Virtual Machine and ISA C/C++ Application Parallel Thread eXecution (PTX) Virtual Machine and ISA Programming model Execution resources and state ISA – Instruction Set Architecture Variable declarations Instructions and operands Translator is an optimizing compiler Translates PTX to Target code Program install time Driver implements VM runtime Coupled with Translator C/C++ Compiler ASM-level Library Programmer PTX Code PTX Code PTX to Target Translator C G80 … GPU Target code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Compiling CUDA to PTX float4 me = gx[gtid]; me.x += me.y * me.z; CUDA ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; # 174 me.x += me.y * me.z; mad.f32 $f1, $f5, $f3, $f1; PTX The first assignment loads a float4 vector from global memory into 4 registers in one instruction. The notation “{reg0, reg1, …}” is an immediate vector. The second instruction is the MAD resulting from the cuda code (“mad d, a, b, c;” is d = a*b+c) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Function CUDA PTX sub.f32 $f18, $f1, $f15; __device__ void interaction( float4 b0, float4 b1, float3 *accel) { r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s; } sub.f32 $f18, $f1, $f15; sub.f32 $f19, $f3, $f16; sub.f32 $f20, $f5, $f17; mul.f32 $f21, $f18, $f18; mul.f32 $f22, $f19, $f19; mul.f32 $f23, $f20, $f20; add.f32 $f24, $f21, $f22; add.f32 $f25, $f23, $f24; rsqrt.f32 $f26, $f25; mad.f32 $f13, $f18, $f26, $f13; mov.f32 $f14, $f13; mad.f32 $f11, $f19, $f26, $f11; mov.f32 $f12, $f11; mad.f32 $f9, $f20, $f26, $f9; mov.f32 $f10, $f9; PTX uses as many registers as it needs. Later, a register-allocator is run, minimizing register use. PTX does not necessarily generate MAD instructions immediately. MUL/ADD combinations are detected during optimization and combined. I can’t explain the MOV register copying instructions (yet), but I know they go away during optimization. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Compiling a loop that calls a function Cuda sx is shared mx, accel are local PTX mov.s32 $r12, 0; $Lt_0_26: setp.eq.u32 $p1, $r12, $r5; @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u32 $r14, $r13, $r1; ld.shared.f32 $f15, [$r14+0]; ld.shared.f32 $f16, [$r14+4]; ld.shared.f32 $f17, [$r14+8]; [func body from previous slide inlined here] $Lt_0_27: add.s32 $r12, $r12, 1; mov.s32 $r15, 128; setp.ne.s32 $p2, $r12, $r15; @$p2 bra $Lt_0_26; for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); } In PTX, set I = 0, test for I == threadIdx.x, jump to end if equal, otherwise, load data and do work. The notation “@p” means “for threads with a true predicate p”. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign