Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008
Previously Optimizing Instruction Throughput Optimizing Instruction Throughput –Low throughout instructions Different versions of math functions Different versions of math functions Type conversions are costly Type conversions are costly Avoid warp diversion Avoid warp diversion Accessing global memory is expensive Accessing global memory is expensive Overlap memory ops with math ops Overlap memory ops with math ops
Previously Optimizing Instruction Throughput Optimizing Instruction Throughput –Optimal use of memory bandwidth Global memory: coalesce accesses Global memory: coalesce accesses Local memory: coalesced automatically Local memory: coalesced automatically Constant memory: cached, cost proportional to #addresses read Constant memory: cached, cost proportional to #addresses read Texture memory: cached, optimized for 2D spatial locality Texture memory: cached, optimized for 2D spatial locality Shared memory: on chip, fast but avoid bank conflicts Shared memory: on chip, fast but avoid bank conflicts
Today Optimizing Instruction Throughput Optimizing Instruction Throughput –Optimal use of memory bandwidth Shared memory: on chip, fast but avoid bank conflicts Shared memory: on chip, fast but avoid bank conflicts Registers Registers Optimizing #Threads per Block Optimizing #Threads per Block Memory copies Memory copies Texture vs. Global vs. Constant Texture vs. Global vs. Constant General optimizations General optimizations
Shared Memory Bank conflicts Bank conflicts –Shared memory divided into 32-bit modules called banks –Allow simultaneous reads –N-way bank conflict if N threads try to read from the same bank Leads to serializing of reads Leads to serializing of reads Not necessarily N serial reads Not necessarily N serial reads
Shared Memory Bank conflicts Bank conflicts –Broadcast mechanism One word is chosen as a broadcast word One word is chosen as a broadcast word Automatically passed to other threads reading from that word Automatically passed to other threads reading from that word –Cannot control which word is picked as the broadcast word
Registers Generally 0 clock cycles Generally 0 clock cycles –Time to access registers included in instruction time –There could be delays
Registers Delays may occur due to register memory bank conflicts Delays may occur due to register memory bank conflicts –Register memory banks handled by compiler and thread scheduler Try to schedule instructions to avoid conflicts Try to schedule instructions to avoid conflicts Work best when 64x threads per block Work best when 64x threads per block –Application has no other control
Registers Delays may occur due to read-after- write dependencies Delays may occur due to read-after- write dependencies –May be hidden if each SM has at least 192 active threads
Optimizing #threads per block 2 or more blocks per SM 2 or more blocks per SM –A waiting block (thread sync, memo copy) can be overlapped with running blocks Shared memory per block should be less than half the shared memory per SM Shared memory per block should be less than half the shared memory per SM
Optimizing #threads per block Having 32x threads per block fully populates warps Having 32x threads per block fully populates warps Having 64x threads per block allows compiler and thread scheduler to avoid register memory bank conflicts Having 64x threads per block allows compiler and thread scheduler to avoid register memory bank conflicts
Optimizing #threads per block More threads per block = fewer registers per kernel More threads per block = fewer registers per kernel –Compiler option to report memory requirements of a kernel, --ptxas- options=-v –#registers per device varies with compute capability
Optimizing #threads per block When optimizing, go for 64x threads per block When optimizing, go for 64x threads per block –192 or 256 recommended Occupancy of SM = (#active warps) / (max. active warps) Occupancy of SM = (#active warps) / (max. active warps) –Compiler tries to maximize occupancy
Optimizing Memory Copies Host mem Device mem Host mem Device mem –Low bandwidth –Higher bandwidth can be achieved using pagelocked/pinned memory
Optimizing Memory Copies Minimize such transfers Minimize such transfers –Move more code to the device, even if it does not fully utilize parallelism –Create intermediate data structures in device memory –Group several transfers into one
Texture fetches vs. reading Global/Constant mem Cached, optimized for spatial locality Cached, optimized for spatial locality No coalescing constraints No coalescing constraints Address calculation latency is better hidden Address calculation latency is better hidden Data can be packed Data can be packed Optional conversion of integers to normalized floats [0.0,1.0] or [-1.0,1.0] Optional conversion of integers to normalized floats [0.0,1.0] or [-1.0,1.0]
Texture fetches vs. reading Global/Constant mem For textures stored in CUDA arrays For textures stored in CUDA arrays –Filtering –Normalized texture coordinates –Addressing modes
General Guidelines Maximize parallelism Maximize parallelism Maximize memory bandwidth Maximize memory bandwidth Maximize instruction throughput Maximize instruction throughput
Maximize Parallelism Build on data parallelism Build on data parallelism –Broken in case of thread dependency –For threads in the same block __syncThreads() __syncThreads() share data using shared memory share data using shared memory –For threads in different blocks Share data using global memory Share data using global memory Two kernel calls Two kernel calls First to write data First to write data Second to read data Second to read data
Maximize Parallelism Build on data parallelism Build on data parallelism Choose kernel parameters accordingly Choose kernel parameters accordingly Clever device use: streams Clever device use: streams Clever host use: async kernels Clever host use: async kernels
Maximize Memory Bandwidth Minimize host device memory copies Minimize host device memory copies Minimize device device memory data transfer Minimize device device memory data transfer –Use shared memory Might even be better to not copy at all Might even be better to not copy at all –Just recompute on device
Maximize Memory Bandwidth Organize data for optimal memory access patterns Organize data for optimal memory access patterns –Crucial for accesses to global memory
Maximize Instruction Throughput For non-crucial cases, use higher throughput arithmetic instructions For non-crucial cases, use higher throughput arithmetic instructions –Sacrifice accuracy for performance –Replace double with float operations Pay attention to warp diversion Pay attention to warp diversion –Try to arrange diverging threads pe warp if (threadIdx / warp_size) > n
Final Projects Time-line Time-line –Thu, 20 Nov: Float write-ups on ideas of Jens & Waqar Float write-ups on ideas of Jens & Waqar –Tue, 25 Nov (today): Suggest groups and topics Suggest groups and topics –Thu, 27 Nov: Groups and topics assigned Groups and topics assigned –Tue, 2 Dec: Last chance to change groups/topics Last chance to change groups/topics Groups and topics finalized Groups and topics finalized
All for today Next time Next time –A full-fledged example project
On to exercises!