Single Instruction Multiple Threads

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Latency considerations of depth-first GPU ray tracing
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Martin Kruliš by Martin Kruliš (v1.0)1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Martin Kruliš by Martin Kruliš (v1.0)1.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Distributed Processors
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Simultaneous Multithreading
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
Distributed Systems CS
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
ECE 498AL Lecture 10: Control Flow
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Programming with Shared Memory Specifying parallelism
Lecture 5: Synchronization and ILP
Synchronization These notes introduce:
Operating System Overview
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
GPU Architectures and CUDA in More Detail
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Single Instruction Multiple Threads Martin Kruliš by Martin Kruliš (v1.1) 10.11.2016

SIMT Execution Single Instruction Multiple Threads All cores are executing the same instruction Each core has its own set of registers registers Instruction Decoder by Martin Kruliš (v1.1) 10.11.2016

SIMT vs. SIMD Single Instruction Multiple Threads Width-independent programming model Serial-like code Achieved by hardware with a little help from compiler Allows code divergence Single Instruction Multiple Data Explicitly expose the width of SIMD vector Special instructions Generated by compiler or directly written by programmer Code divergence is usually not supported by Martin Kruliš (v1.1) 10.11.2016

Simultaneously run on SM cores Thread-Core Mapping How are threads assigned to SMPs Grid Block Warp Thread The same kernel Simultaneously run on SM cores Assigned to SMP Warp is made by subsequent 32 threads of the block according to their thread ID (i.e., threads 0-31, 32-63, …). Warp size is 32 for all compute capabilities. Thread ID is threadIdx.x in one dimension, (threadIdx.x + threadIdx.y*blockDim.x) in two dimensions and threadId.x + threadId.y*blockDim.x + threadId.z*blockDim.x*blockDim.y) in three dimensions. Multiple kernels may run simultaneously on the GPU (since Fermi) and multiple blocks may be assigned simultaneously to an SMP (if the registers, the schedulers, and the shared memory can accommodate them). Core GPU Streaming Multiprocessor by Martin Kruliš (v1.1) 10.11.2016

HW Revision Fermi Architecture CC 2.x SM – Streaming (Symmetric) Multiprocessor by Martin Kruliš (v1.1) 10.11.2016

HW Revision Kepler Architecture CC 3.x SMX (Streaming Multiprocessor Next Generation) by Martin Kruliš (v1.1) 10.11.2016

HW Revision Maxwell Architecture CC 5.0 SMM (Streaming Multiprocessor - Maxwell) by Martin Kruliš (v1.1) 10.11.2016

Instruction Schedulers Decomposition Each block assigned to the SMP is divided into warps and the warps are assigned to schedulers Schedulers Select warp that is ready at every instruction cycle The SMP instruction throughput depends on CC: 1.x – 1 instruction per 4 cycles, 1 scheduler 2.0 – 1 instruction per 2 cycles, 2 schedulers 2.1 – 2 instructions per 2 cycles, 2 schedulers 3.x and 5.x – 2 instructions per cycle, 4 schedulers The most common reason, why a warp is not ready to execute is that the operands of the next instruction are not available yet (i.e., they are being loaded from global memory, subjected to a read-after-write dependency, …). Maxwell (CC 5.0) has also 4x double-dealing schedulers (as Kepler), however, there are only 128 cores (assigned 32 cores per scheduler) and latency of math instructions has been improved. by Martin Kruliš (v1.1) 10.11.2016

Hiding Latency Fast Context Switch When a warp gets stalled E.g., by data load/store Scheduler switch to next active warp by Martin Kruliš (v1.1) 10.11.2016

SIMT and Branches Masking Instructions In case of data-driven branches if-else conditions, while loops, … All branches are traversed, threads mask their execution in invalid branches if (threadIdx.x % 2 == 0) { ... even threads code ... } else { ... odd threads code ... } 1 2 3 4 … 1 2 3 4 … by Martin Kruliš (v1.1) 10.11.2016

Reducing Thread Divergence Work Reorganization In case the workload is imbalanced Cheap balancing can lead to better occupancy Example Matrix with dimensions not divisible by warp size Item (i,j) has linear index i*width + j by Martin Kruliš (v1.1) 10.11.2016

SIMT Algorithms SIMD → SIMT PRAM A warp can be perceived as 32-word SIMD engine PRAM Parallel extension of the Random Access Machine Completely theoretical, no relation to practice Some PRAM algorithms can be used as a basis for creating SIMT algorithms E.g., parallel reduction trees Significant modifications required Neither approach is perfect, the right way is somewhere between. by Martin Kruliš (v1.1) 10.11.2016

SIMT Algorithm Example Prefix-sum Compaction Computing new positions for non-empty items Is this better than sequential approach? by Martin Kruliš (v1.1) 10.11.2016

Synchronization Memory Fences Weak-ordered memory model Example The order of writes into shared/global/host/peer memory is not necessarily the same as the order of reads made by another thread The order of read operations is not necessarily the same as the order of the read instructions in the code Example Let us have variables __device__ int X = 1, Y = 2; __device__ void write() { X = 10; Y = 20; } __device__ void read() { int A = X; int B = Y; } by Martin Kruliš (v1.1) 10.11.2016

Synchronization Memory Fences Barrier __threadfence(); __threadfence_block() __threadfence_system(); Barrier Synchronization between warps in block __syncthreads(); __syncthreads_count(predicate); __syncthreads_and(predicate); __syncthreads_or(predicate); https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions __threadfence_block is weakest, __threadfence_system is strongest __syncthreads is stronger than threadfences CC 2.0+ by Martin Kruliš (v1.1) 10.11.2016

Performance Fine-tuning Work Partitioning The amount of work assigned to each thread and to each thread block Perhaps the most important decision in the overall application design Many things to consider Workload balance Core occupancy Utilization of registers and shared memory Implicit and explicit synchronization by Martin Kruliš (v1.1) 10.11.2016

Performance Fine-tuning Selecting Number and Size of The Blocks Number of threads should be divisible by warp size As many threads as possible Better occupancy, hiding various latencies, … As few threads as possible Avoid register spilling, more shared memory per thread Specifying Kernel Bounds __global__ void __launch_bounds__(maxThreads, minBlocks) myKernel(...) { ... } Minimal desired blocks per multiprocessor (optional) Maximal allowed number of threads per block by Martin Kruliš (v1.1) 10.11.2016

Call Stack Intra-device Function Calling Compiler attempts to inline functions Simple limited call stack is available otherwise When the stack overflows, kernel execution fails with unspecified error Stack size can be queried and set Since CC 2.0 cudaDeviceGetLimit(&res, cudaLimitStackSize); cudaDeviceSetLimit(cudaLimitStackSize, limit) Recursion is not recommended Dangerous, divergent, high overhead by Martin Kruliš (v1.1) 10.11.2016

Thread Block Switching Non-preemptive Block Assignment Compute capability lower than 3.5 Once a block is assigned to a multiprocessor, all its threads must finish before it leaves Problems with irregular workloads Preemptive Block Assignment Compute capability 3.5 or greater Computing blocks can be suspended and offloaded Creates a basis for dynamic parallelism Blocks can spawn child-blocks and wait for them to complete by Martin Kruliš (v1.1) 10.11.2016

Discussion by Martin Kruliš (v1.1) 10.11.2016