Download presentation
1
Single Instruction Multiple Threads
Martin Kruliš by Martin Kruliš (v1.1)
2
SIMT Execution Single Instruction Multiple Threads
All cores are executing the same instruction Each core has its own set of registers registers Instruction Decoder by Martin Kruliš (v1.1)
3
SIMT vs. SIMD Single Instruction Multiple Threads
Width-independent programming model Serial-like code Achieved by hardware with a little help from compiler Allows code divergence Single Instruction Multiple Data Explicitly expose the width of SIMD vector Special instructions Generated by compiler or directly written by programmer Code divergence is usually not supported by Martin Kruliš (v1.1)
4
Simultaneously run on SM cores
Thread-Core Mapping How are threads assigned to SMPs Grid Block Warp Thread The same kernel Simultaneously run on SM cores Assigned to SMP Warp is made by subsequent 32 threads of the block according to their thread ID (i.e., threads 0-31, 32-63, …). Warp size is 32 for all compute capabilities. Thread ID is threadIdx.x in one dimension, (threadIdx.x + threadIdx.y*blockDim.x) in two dimensions and threadId.x + threadId.y*blockDim.x + threadId.z*blockDim.x*blockDim.y) in three dimensions. Multiple kernels may run simultaneously on the GPU (since Fermi) and multiple blocks may be assigned simultaneously to an SMP (if the registers, the schedulers, and the shared memory can accommodate them). Core GPU Streaming Multiprocessor by Martin Kruliš (v1.1)
5
HW Revision Fermi Architecture CC 2.x
SM – Streaming (Symmetric) Multiprocessor by Martin Kruliš (v1.1)
6
HW Revision Kepler Architecture CC 3.x SMX
(Streaming Multiprocessor Next Generation) by Martin Kruliš (v1.1)
7
HW Revision Maxwell Architecture CC 5.0 SMM
(Streaming Multiprocessor - Maxwell) by Martin Kruliš (v1.1)
8
Instruction Schedulers
Decomposition Each block assigned to the SMP is divided into warps and the warps are assigned to schedulers Schedulers Select warp that is ready at every instruction cycle The SMP instruction throughput depends on CC: 1.x – 1 instruction per 4 cycles, 1 scheduler 2.0 – 1 instruction per 2 cycles, 2 schedulers 2.1 – 2 instructions per 2 cycles, 2 schedulers 3.x and 5.x – 2 instructions per cycle, 4 schedulers The most common reason, why a warp is not ready to execute is that the operands of the next instruction are not available yet (i.e., they are being loaded from global memory, subjected to a read-after-write dependency, …). Maxwell (CC 5.0) has also 4x double-dealing schedulers (as Kepler), however, there are only 128 cores (assigned 32 cores per scheduler) and latency of math instructions has been improved. by Martin Kruliš (v1.1)
9
Hiding Latency Fast Context Switch When a warp gets stalled
E.g., by data load/store Scheduler switch to next active warp by Martin Kruliš (v1.1)
10
SIMT and Branches Masking Instructions In case of data-driven branches
if-else conditions, while loops, … All branches are traversed, threads mask their execution in invalid branches if (threadIdx.x % 2 == 0) { ... even threads code ... } else { ... odd threads code ... } 1 2 3 4 … 1 2 3 4 … by Martin Kruliš (v1.1)
11
Reducing Thread Divergence
Work Reorganization In case the workload is imbalanced Cheap balancing can lead to better occupancy Example Matrix with dimensions not divisible by warp size Item (i,j) has linear index i*width + j by Martin Kruliš (v1.1)
12
SIMT Algorithms SIMD → SIMT PRAM
A warp can be perceived as 32-word SIMD engine PRAM Parallel extension of the Random Access Machine Completely theoretical, no relation to practice Some PRAM algorithms can be used as a basis for creating SIMT algorithms E.g., parallel reduction trees Significant modifications required Neither approach is perfect, the right way is somewhere between. by Martin Kruliš (v1.1)
13
SIMT Algorithm Example
Prefix-sum Compaction Computing new positions for non-empty items Is this better than sequential approach? by Martin Kruliš (v1.1)
14
Synchronization Memory Fences Weak-ordered memory model Example
The order of writes into shared/global/host/peer memory is not necessarily the same as the order of reads made by another thread The order of read operations is not necessarily the same as the order of the read instructions in the code Example Let us have variables __device__ int X = 1, Y = 2; __device__ void write() { X = 10; Y = 20; } __device__ void read() { int A = X; int B = Y; } by Martin Kruliš (v1.1)
15
Synchronization Memory Fences Barrier
__threadfence(); __threadfence_block() __threadfence_system(); Barrier Synchronization between warps in block __syncthreads(); __syncthreads_count(predicate); __syncthreads_and(predicate); __syncthreads_or(predicate); __threadfence_block is weakest, __threadfence_system is strongest __syncthreads is stronger than threadfences CC 2.0+ by Martin Kruliš (v1.1)
16
Performance Fine-tuning
Work Partitioning The amount of work assigned to each thread and to each thread block Perhaps the most important decision in the overall application design Many things to consider Workload balance Core occupancy Utilization of registers and shared memory Implicit and explicit synchronization by Martin Kruliš (v1.1)
17
Performance Fine-tuning
Selecting Number and Size of The Blocks Number of threads should be divisible by warp size As many threads as possible Better occupancy, hiding various latencies, … As few threads as possible Avoid register spilling, more shared memory per thread Specifying Kernel Bounds __global__ void __launch_bounds__(maxThreads, minBlocks) myKernel(...) { ... } Minimal desired blocks per multiprocessor (optional) Maximal allowed number of threads per block by Martin Kruliš (v1.1)
18
Call Stack Intra-device Function Calling
Compiler attempts to inline functions Simple limited call stack is available otherwise When the stack overflows, kernel execution fails with unspecified error Stack size can be queried and set Since CC 2.0 cudaDeviceGetLimit(&res, cudaLimitStackSize); cudaDeviceSetLimit(cudaLimitStackSize, limit) Recursion is not recommended Dangerous, divergent, high overhead by Martin Kruliš (v1.1)
19
Thread Block Switching
Non-preemptive Block Assignment Compute capability lower than 3.5 Once a block is assigned to a multiprocessor, all its threads must finish before it leaves Problems with irregular workloads Preemptive Block Assignment Compute capability 3.5 or greater Computing blocks can be suspended and offloaded Creates a basis for dynamic parallelism Blocks can spawn child-blocks and wait for them to complete by Martin Kruliš (v1.1)
20
Discussion by Martin Kruliš (v1.1)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.