Single Instruction Multiple Threads Martin Kruliš by Martin Kruliš (v1.1) 10.11.2016
SIMT Execution Single Instruction Multiple Threads All cores are executing the same instruction Each core has its own set of registers registers Instruction Decoder by Martin Kruliš (v1.1) 10.11.2016
SIMT vs. SIMD Single Instruction Multiple Threads Width-independent programming model Serial-like code Achieved by hardware with a little help from compiler Allows code divergence Single Instruction Multiple Data Explicitly expose the width of SIMD vector Special instructions Generated by compiler or directly written by programmer Code divergence is usually not supported by Martin Kruliš (v1.1) 10.11.2016
Simultaneously run on SM cores Thread-Core Mapping How are threads assigned to SMPs Grid Block Warp Thread The same kernel Simultaneously run on SM cores Assigned to SMP Warp is made by subsequent 32 threads of the block according to their thread ID (i.e., threads 0-31, 32-63, …). Warp size is 32 for all compute capabilities. Thread ID is threadIdx.x in one dimension, (threadIdx.x + threadIdx.y*blockDim.x) in two dimensions and threadId.x + threadId.y*blockDim.x + threadId.z*blockDim.x*blockDim.y) in three dimensions. Multiple kernels may run simultaneously on the GPU (since Fermi) and multiple blocks may be assigned simultaneously to an SMP (if the registers, the schedulers, and the shared memory can accommodate them). Core GPU Streaming Multiprocessor by Martin Kruliš (v1.1) 10.11.2016
HW Revision Fermi Architecture CC 2.x SM – Streaming (Symmetric) Multiprocessor by Martin Kruliš (v1.1) 10.11.2016
HW Revision Kepler Architecture CC 3.x SMX (Streaming Multiprocessor Next Generation) by Martin Kruliš (v1.1) 10.11.2016
HW Revision Maxwell Architecture CC 5.0 SMM (Streaming Multiprocessor - Maxwell) by Martin Kruliš (v1.1) 10.11.2016
Instruction Schedulers Decomposition Each block assigned to the SMP is divided into warps and the warps are assigned to schedulers Schedulers Select warp that is ready at every instruction cycle The SMP instruction throughput depends on CC: 1.x – 1 instruction per 4 cycles, 1 scheduler 2.0 – 1 instruction per 2 cycles, 2 schedulers 2.1 – 2 instructions per 2 cycles, 2 schedulers 3.x and 5.x – 2 instructions per cycle, 4 schedulers The most common reason, why a warp is not ready to execute is that the operands of the next instruction are not available yet (i.e., they are being loaded from global memory, subjected to a read-after-write dependency, …). Maxwell (CC 5.0) has also 4x double-dealing schedulers (as Kepler), however, there are only 128 cores (assigned 32 cores per scheduler) and latency of math instructions has been improved. by Martin Kruliš (v1.1) 10.11.2016
Hiding Latency Fast Context Switch When a warp gets stalled E.g., by data load/store Scheduler switch to next active warp by Martin Kruliš (v1.1) 10.11.2016
SIMT and Branches Masking Instructions In case of data-driven branches if-else conditions, while loops, … All branches are traversed, threads mask their execution in invalid branches if (threadIdx.x % 2 == 0) { ... even threads code ... } else { ... odd threads code ... } 1 2 3 4 … 1 2 3 4 … by Martin Kruliš (v1.1) 10.11.2016
Reducing Thread Divergence Work Reorganization In case the workload is imbalanced Cheap balancing can lead to better occupancy Example Matrix with dimensions not divisible by warp size Item (i,j) has linear index i*width + j by Martin Kruliš (v1.1) 10.11.2016
SIMT Algorithms SIMD → SIMT PRAM A warp can be perceived as 32-word SIMD engine PRAM Parallel extension of the Random Access Machine Completely theoretical, no relation to practice Some PRAM algorithms can be used as a basis for creating SIMT algorithms E.g., parallel reduction trees Significant modifications required Neither approach is perfect, the right way is somewhere between. by Martin Kruliš (v1.1) 10.11.2016
SIMT Algorithm Example Prefix-sum Compaction Computing new positions for non-empty items Is this better than sequential approach? by Martin Kruliš (v1.1) 10.11.2016
Synchronization Memory Fences Weak-ordered memory model Example The order of writes into shared/global/host/peer memory is not necessarily the same as the order of reads made by another thread The order of read operations is not necessarily the same as the order of the read instructions in the code Example Let us have variables __device__ int X = 1, Y = 2; __device__ void write() { X = 10; Y = 20; } __device__ void read() { int A = X; int B = Y; } by Martin Kruliš (v1.1) 10.11.2016
Synchronization Memory Fences Barrier __threadfence(); __threadfence_block() __threadfence_system(); Barrier Synchronization between warps in block __syncthreads(); __syncthreads_count(predicate); __syncthreads_and(predicate); __syncthreads_or(predicate); https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions __threadfence_block is weakest, __threadfence_system is strongest __syncthreads is stronger than threadfences CC 2.0+ by Martin Kruliš (v1.1) 10.11.2016
Performance Fine-tuning Work Partitioning The amount of work assigned to each thread and to each thread block Perhaps the most important decision in the overall application design Many things to consider Workload balance Core occupancy Utilization of registers and shared memory Implicit and explicit synchronization by Martin Kruliš (v1.1) 10.11.2016
Performance Fine-tuning Selecting Number and Size of The Blocks Number of threads should be divisible by warp size As many threads as possible Better occupancy, hiding various latencies, … As few threads as possible Avoid register spilling, more shared memory per thread Specifying Kernel Bounds __global__ void __launch_bounds__(maxThreads, minBlocks) myKernel(...) { ... } Minimal desired blocks per multiprocessor (optional) Maximal allowed number of threads per block by Martin Kruliš (v1.1) 10.11.2016
Call Stack Intra-device Function Calling Compiler attempts to inline functions Simple limited call stack is available otherwise When the stack overflows, kernel execution fails with unspecified error Stack size can be queried and set Since CC 2.0 cudaDeviceGetLimit(&res, cudaLimitStackSize); cudaDeviceSetLimit(cudaLimitStackSize, limit) Recursion is not recommended Dangerous, divergent, high overhead by Martin Kruliš (v1.1) 10.11.2016
Thread Block Switching Non-preemptive Block Assignment Compute capability lower than 3.5 Once a block is assigned to a multiprocessor, all its threads must finish before it leaves Problems with irregular workloads Preemptive Block Assignment Compute capability 3.5 or greater Computing blocks can be suspended and offloaded Creates a basis for dynamic parallelism Blocks can spawn child-blocks and wait for them to complete by Martin Kruliš (v1.1) 10.11.2016
Discussion by Martin Kruliš (v1.1) 10.11.2016