Introduction to CUDA Programming

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

ME964 High Performance Computing for Engineering Applications “Once a new technology rolls over you, if you're not part of the steamroller, you're part.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Computer Engg, IIT(BHU)

Introduction to CUDA Programming

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

Lecture 5: Performance Considerations

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

GPU Memories These notes will introduce:

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Introduction to CUDA Programming

Basic CUDA Programming

Cache Memory Presentation I

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Some things are naturally parallel

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Introduction to CUDA Programming

DRAM Bandwidth Slide credit: Slides adapted from

Programming Massively Parallel Processors Performance Considerations

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Mattan Erez The University of Texas at Austin

ECE 498AL Lecture 10: Control Flow

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Parallel Computation Patterns (Histogram)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Updated 2012 for Fermi Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk Mark Haris NVIDIA Introduction to CUDA Programming

Thread Processing Clusters Hardware recap Thread Processing Clusters 3 Stream Multiprocessors Texture Cache Stream Multiprocessors 32 Stream Processors 4 Special Function Units 16 Double-Precision Units (use all 32 Stream Processors) 16 LD/ST Units 16K/48K Shared Memory / 32 Banks / 32bit interleaved 32K Registers 32 thread warps Constant memory 64KB in DRAM / Cached Main Memory 1GByte 384 bit interface

Minimum execution unit WARP Minimum execution unit 32 Threads Same instruction Takes 2 cycles 16 threads per cycle Think of memory operating in half the speed The first 16 threads go in parallel to memory The next 16 do the same Half-warp: Coalescing possible

WARP: When a thread stalls Thread A stall Thread B Thread A Thread B

WARP THREAD WARP EXECUTION Half-Warp Memory References

Fermi Memory Architecture Overview

Shared Memory Accesses As if full warp goes together 32 Banks Accesses Serialize when two different threads access different words in the same bank Different bytes or half-word within the same word are OK

Global Memory Accesses Two types of loads: Caching Default mode Attempts to hit in L1, then L2, then GMEM Load granularity is 128-byte line Non-caching Compile with –Xptxas –dlcm=cg option to nvcc Attempts to hit in L2, then GMEM Do not hit in L1, invalidate the line if it’s in L1 already Load granularity is 32-bytes Stores: Invalidate L1, write-back for L2

Grid and block dimension restrictions Limits on # of threads Grid and block dimension restrictions Grid: 64k x 64k x 64K Block: 1kx1kx64 Max threads/block = 1k A block maps onto an SM Up to 8 blocks per SM Up to 1536 threads per SM Every thread uses registers Up to 32K regs Every block uses shared memory Up to 16/48KB shared memory Example: 16x16 blocks of threads using 40 regs each Each block uses 4K of shared memory 5120 registers / block  3.2 blocks/SM 4K shared memory/block  4 blocks/SM

Understanding Control Flow Divergence if (in[i] == 0) out[i] = sqrt(x); else out[i] = 10; WARP in[i] == 0 in[i] == 0 idle TIME out[i] = sqrt(x) out[i] = 10

Control Flow Divergence Contd. WARP WARP #1 WARP #2 in[i] == 0 in[i] == 0 in[i] == 0 idle TIME BAD SCENARIO Good Scenario

Instruction Performance Instruction processing steps per warp: Read input operands for all threads Execute the instruction Write the results back For performance: Minimize use of low throughput instructions Maximize use of available memory bandwidth Allow overlapping of memory accesses and computation High compute/access ratio Many threads

Instruction Throughput (GTX280) 4 Cycles Single precision FP: ADD, MUL, MAD Integer ADD, __mul24(x), __umul24(x) Bitwise, compare, min, max, type conversion 16 Cycles Reciprocal, reciprocal sqrt, __logf(x) 32-bit integer MUL Will be faster in future hardware 20 Cycles __fdividef(x) 32 Cycles Sqrt(x) = 1/sqrt(x)  1/that __sinf(x), __cosf(x), __exp(x) 36 Cycles Single fp div Many more Sin() (10x if x > 48039), integer div/mod,

Optimize Algorithms for the GPU Optimization Steps Optimize Algorithms for the GPU Optimize Memory Access Ordering for Coalescing Take Advantage of On-Chip Shared Memory Use Parallelism Efficiently

Optimize Algorithms for the GPU Maximize independent parallelism We’ll see more of this with examples Avoid thread synchronization as much as possible Maximize arithmetic intensity (math/bandwidth) Sometimes it’s better to re-compute than to cache GPU spends its transistors on ALUs, not memory Do more computation on the GPU to avoid costly data transfers Even low parallelism computations can sometimes be faster than transferring back and forth to host

Optimize Memory Access Ordering for Coalescing Coalesced Accesses: A single access for all requests in a warp Coalesced vs. Non-coalesced Global device memory order of magnitude Shared memory Avoid bank conflicts

Exploit the Shared Memory Hundreds of times faster than global memory 2 cycles vs. 400-800 cycles Threads can cooperate via shared memory __syncthreads () Use one / a few threads to load / compute data shared by all threads Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non-coalesceable addressing Matrix transpose example later

Use Parallelism Efficiently Partition your computation to keep the GPU multiprocessors equally busy Many threads, many thread blocks Keep resource usage low enough to support multiple active thread blocks per multiprocessor Registers, shared memory

Global Memory Reads/Writes Highest latency instructions: 400-800 clock cycles Likely to be performance bottleneck Optimizations can greatly increase performance Coalescing: up to 16x speedup Latency hiding: up to 2.5x speedup

Global Memory Accesses Two types of loads: Caching Default mode Attempts to hit in L1, then L2, then GMEM Load granularity is 128-byte line Non-caching Compile with –Xptxas –dlcm=cg option to nvcc Attempts to hit in L2, then GMEM Do not hit in L1, invalidate the line if it’s in L1 already Load granularity is 32-bytes Stores: Invalidate L1, write-back for L2

Coalescing for cached loads As long as the address fall into one continuous, aligned 128-byte region of memory One access Otherwise they are split into multiple 128B region accesses

Caching Loads (default) Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100%

Caching loads – Permuted accesses Warp requests 32 aligned, permuted 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100%

Caching Load – misaligned continuous region Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within 2 cache-lines Warp needs 128 bytes 256 bytes move across the bus on misses Bus utilization: 50%

Caching Load – One word by all All threads in a warp request the same 4-byte word Addresses fall within a single cache-line Warp needs 4 bytes 128 bytes move across the bus on a miss Bus utilization: 3.125%

Caching Load – Worst Case Scatter Warp requests 32 scattered 4-byte words Addresses fall within N cache-lines Warp needs 128 bytes N*128 bytes move across the bus on a miss Bus utilization: 128 / (N*128)

Coalescing Experiment – GTX280 Kernel: Read float, increment, write back: a[i]++; 3M floats (12MB) Times averaged over 10K runs 12K blocks x 256 threads/block Coalesced: 211 μs a[i]++ Coalesced / some don’t participate 3 out of 4 participate if (index & 0x3 != 0) a[i]++ 212 μs Coalesced / non-contiguous accesses Every two access the same a[i & ~1]++; Uncoalesced / outside the region Every 4 access a[0] 5,182 μs 24.4x slowdown: 4x from uncoalescing and another 8x from contention for a[0] if (index & 0x3 == 0) a[0]++; else a[i]++; 785 μs 4x slowdown: from not coalescing If (index & 0x3) != 0) a[i]++; else a[startOfBlock]++;

Coalescing Experiment Code for (int i = 0; i < TIMES; i++) { cutResetTimer(timer); cutStartTimer(timer); kernel <<<n_blocks, block_size>>> (a_d); cudaThreadSynchronize (); cutStopTimer (timer); total_time += cutGetTimerValue (timer); } printf (“Time %f\n”, total_time / TIMES); __global__ void kernel (float *a) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]++; if ((i & 0x3) != 00) a[i]++; a[i & ~1]++; if ((i & 0x3) != 00) a[i]++; else a[0]++; if (index & 0x3) != 0) a[i]++; else a[blockIdx.x * blockDim.x]++; 211 μs 212 μs 212 μs 5,182 μs 785 μs

Uncoalesced float3 access code __global__ void accessFloat3(float3 *d_in, float3 d_out) { int index = blockIdx.x * blockDim.x + threadIdx.x; float3 a = d_in[index]; a.x += 2; a.y += 2; a.z += 2; d_out[index] = a; } Execution time: 1,905 μs 12M float3 Averaged over 10K runs

Naïve float3 access sequence float3 is 12 bytes Each thread ends up executing 3 32bit reads sizeof(float3) = 12  warp region = 384 bytes Offsets: 0, 12, 24, …, … 372 Warp reads three 128B non-contiguous regions

Coalescing float3 access

Coalsecing float3 strategy Use shared memory to allow coalescing Need sizeof(float3)*(threads/block) bytes of SMEM Two Phases: Phase 1: Fetch data in shared memory Each thread reads 3 scalar floats Offsets: 0, (threads/block), 2*(threads/block) These will likely be processed by other threads, so sync Phase 2: Processing Each thread retrieves its float3 from SMEM array Cast the SMEM pointer to (float3*) Use thread ID as index Rest of the compute code does not change Phase 3: Write results bank to global memory Each thread writes 3 scalar floats

Coalesing float3 access code

Coalesing Experiment: float3 (GTX280) Kernel: read a float3, increment each element, write back 1M float3s (12MB) Times averaged over 10K runs 4K blocks x 256 threads: 648μs – float3 uncoalesced About 3x over float code Every half-warp now ends up making three refs 245μs – float3 coalesced through shared memory About the same as the float code

Global Memory Accesses Two types of loads: Caching Default mode Attempts to hit in L1, then L2, then GMEM Load granularity is 128-byte line Non-caching Compile with –Xptxas –dlcm=cg option to nvcc Attempts to hit in L2, then GMEM Do not hit in L1, invalidate the line if it’s in L1 already Load granularity is 32-bytes Stores: Invalidate L1, write-back for L2

Non-Caching Loads – 32 aligned continuous words Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100%

Non-Caching Load – 32 Aligned Permuted Words Warp requests 32 aligned, permuted 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100%

Caching Load – Misaligned 128-byte block Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within at most 5 segments Warp needs 128 bytes 256 bytes move across the bus on misses Bus utilization: at least 80% Some misaligned patterns will fall within 4 segments, so 100% utilization

Non-Caching Load – All one word All threads in a warp request the same 4-byte word Addresses fall within a single segment Warp needs 4 bytes 32 bytes move across the bus on a miss Bus utilization: 12.5%

Non-Caching Load – Worst Case Scatter Warp requests 32 scattered 4-byte words Addresses fall within N segments Warp needs 128 bytes N*32 bytes move across the bus on a miss Bus utilization: 128 / (N*32)

Global Memory Coalesing Summary Coalescing greatly improves throughput Critical to small or memory-bound kernels Reading structures of size other than 4, 8, or 16 bytes will break coalescing: Prefer Structures of Arrays over Arrays of Structures If SoA is not viable, read/write through SMEM

In a parallel machine, many threads access memory Shared Memory In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per two cycles A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Shared Memory Conflicts Uses: Inter-thread communication within a block Cache data to reduce redundant global memory accesses Use it to improve global memory access patterns Organization: 32 banks, 4-byte wide banks Successive 4-byte words belong to different banks Performance: 4 bytes per bank per 2 clocks per multiprocessor smemaccesses are issued per 32 threads (warp) per 16-threads for GPUs prior to Fermi serialization: if n threads of 32 access different 4-byte words in the same bank, n accesses are executed serially multicast: n threads access the same word in one fetch Could be different bytes within the same word Prior to Fermi, only broadcast was available, sub-word accesses within the same bank caused serialization

Shared Memory Accesses THREAD WARP EXECUTION Memory References Half-Warp

Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 31 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 31 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank Addressing Examples 2-way Bank Conflicts Linear addressing stride == 2 16-way Bank Conflicts Linear addressing stride == 16 Thread 31 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 16 Bank 15 Bank 31 Bank 14 Bank 2 Bank 1 Bank 0 x8 Thread 18 Thread 17 Thread 16 Thread 15 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

How Addresses Map to Banks on G200 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks GF100 has 16 banks So bank = address % 32 Same as the size of a warp No bank conflicts between different warps, only within a single warp

Shared Memory Bank Conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a warp access different banks, there is no bank conflict No two different words are accessed in the same bank The slow case: Bank Conflict: multiple threads in the same warp access different words within the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

Given: Linear Addressing __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 32 on GF100, so s must be odd s=1 Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 31 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 31 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Data types and bank conflicts This has no conflicts if type of shared is 32-bits: foo = shared[baseIndex + threadIdx.x] And no bank conflicts for smaller data types 4-way bank conflicts: __shared__ char shared[]; foo = shared[baseIndex + threadIdx.x]; 2-way bank conflicts: __shared__ short shared[]; Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Structs and Bank Conflicts Struct assignments compile into as many memory accesses as there are struct members: struct vector { float x, y, z; }; struct myType { float f; int c; }; __shared__ struct vector vectors[64]; __shared__ struct myType myTypes[64]; This has no bank conflicts for vector; struct size is 3 words 3 accesses per thread, contiguous banks (no common factor with 32) struct vector v = vectors[baseIndex + threadIdx.x]; This has 2-way bank conflicts for my Type; (2 accesses per thread) struct myType m = myTypes[baseIndex + threadIdx.x]; Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 31 Bank 31

Common Array Bank Conflict Patterns 1D Each thread loads 2 elements into shared mem: 2-way-interleaved loads result in 2-way bank conflicts: int tid = threadIdx.x; shared[2*tid] = global[2*tid]; shared[2*tid+1] = global[2*tid+1]; This makes sense for traditional CPU threads, locality in cache line usage and reduced sharing traffic. Not in shared memory usage where there is no cache line effects but banking effects Thread 31 Thread 30 Thread 29 Thread 28 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

A Better Array Access Pattern Each thread loads one element in every consecutive group of blockDim elements. Bank 31 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 31 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 shared[tid] = global[tid]; shared[tid + blockDim.x] = global[tid + blockDim.x];

Common Bank Conflict Patterns (2D) Operating on 2D array of floats in shared memory e.g., image processing Example: 32x32 block Each thread processes a row So threads in a block access the elements in each column simultaneously (example: row 1 in purple) 16-way bank conflicts: rows all start at bank 0 Solution 1) pad the rows Add one float to the end of each row Solution 2) transpose before processing Suffer bank conflicts during transpose But possibly save them later Bank Indices without Padding 1 2 3 4 5 6 7 31 15 Bank Indices with Padding 1 2 3 4 5 6 7 31 8 9 10 11 12 13 14

SDK Sample (“transpose”) Matrix Transpose SDK Sample (“transpose”) Illustrates:Coalescing Avoiding shared memory bank conflicts

Uncoalesced Transpose

Uncoalesced Transpose: Memory Access Pattern Replace 15 with 31 in the above figures

Conceptually partition the input matrix into square tiles Coalesced Transpose Conceptually partition the input matrix into square tiles Threadblock (bx, by): Read the (bx,by) input tile, store into SMEM Write the SMEM data to (by,bx) output tile Transpose the indexing into SMEM Thread (tx,ty): Reads element (tx,ty) from input tile Writes element (tx,ty) into output tile Coalescing is achieved if: Block/tile dimensions are multiples of 16

Coalesced Transpose: Access Patterns Replace 15 with 31 in these figures

Avoiding Bank Conflicts in Shared Memory Threads read SMEM with stride 32x32-way bank conflicts 32x slower than no conflicts SolutionAllocate an “extra” column Read stride = 33 Threads read from consecutive banks

Coalesced Transpose

Coalesced Transpose

Coalesced Transpose

Global Loads Other work Global Load Global Load Other work

Try to find independent work underneath Global Memory Loads Launch ASAP Try to find independent work underneath Global load Some other work Better than

Transpose Measurements (GTX280) Average over 10K runs 16x16 blocks 128x128  1.3x Optimized: 17.5 μs Naïve: 23 μs 512x512  8.0x Optimized: 108 μs Naïve: 864.6 μs 1024x1024  10x Optimized: 423.2 μs Naïve: 4300.1 μs

Optimized w/ shader memory: 430.1 Transpose Detail 512x512 Naïve: 864.1 Optimized w/ shader memory: 430.1 Optimized w/ extra float per row: 111.4