Single Instruction Multiple Threads

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Latency considerations of depth-first GPU ray tracing

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Martin Kruliš by Martin Kruliš (v1.0)1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Martin Kruliš by Martin Kruliš (v1.0)1.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Distributed Processors

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Simultaneous Multithreading

Computer Engg, IIT(BHU)

Task Scheduling for Multicore CPUs and NUMA Systems

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

CS/EE 217 – GPU Architecture and Parallel Programming

Distributed Systems CS

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

ECE 498AL Lecture 10: Control Flow

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Programming with Shared Memory Specifying parallelism

Lecture 5: Synchronization and ILP

Synchronization These notes introduce:

Operating System Overview

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)

6- General Purpose GPU Programming

GPU Architectures and CUDA in More Detail

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Single Instruction Multiple Threads Martin Kruliš by Martin Kruliš (v1.1) 10.11.2016

SIMT Execution Single Instruction Multiple Threads All cores are executing the same instruction Each core has its own set of registers registers Instruction Decoder by Martin Kruliš (v1.1) 10.11.2016

SIMT vs. SIMD Single Instruction Multiple Threads Width-independent programming model Serial-like code Achieved by hardware with a little help from compiler Allows code divergence Single Instruction Multiple Data Explicitly expose the width of SIMD vector Special instructions Generated by compiler or directly written by programmer Code divergence is usually not supported by Martin Kruliš (v1.1) 10.11.2016

Simultaneously run on SM cores Thread-Core Mapping How are threads assigned to SMPs Grid Block Warp Thread The same kernel Simultaneously run on SM cores Assigned to SMP Warp is made by subsequent 32 threads of the block according to their thread ID (i.e., threads 0-31, 32-63, …). Warp size is 32 for all compute capabilities. Thread ID is threadIdx.x in one dimension, (threadIdx.x + threadIdx.y*blockDim.x) in two dimensions and threadId.x + threadId.y*blockDim.x + threadId.z*blockDim.x*blockDim.y) in three dimensions. Multiple kernels may run simultaneously on the GPU (since Fermi) and multiple blocks may be assigned simultaneously to an SMP (if the registers, the schedulers, and the shared memory can accommodate them). Core GPU Streaming Multiprocessor by Martin Kruliš (v1.1) 10.11.2016

HW Revision Fermi Architecture CC 2.x SM – Streaming (Symmetric) Multiprocessor by Martin Kruliš (v1.1) 10.11.2016

HW Revision Kepler Architecture CC 3.x SMX (Streaming Multiprocessor Next Generation) by Martin Kruliš (v1.1) 10.11.2016

HW Revision Maxwell Architecture CC 5.0 SMM (Streaming Multiprocessor - Maxwell) by Martin Kruliš (v1.1) 10.11.2016

Instruction Schedulers Decomposition Each block assigned to the SMP is divided into warps and the warps are assigned to schedulers Schedulers Select warp that is ready at every instruction cycle The SMP instruction throughput depends on CC: 1.x – 1 instruction per 4 cycles, 1 scheduler 2.0 – 1 instruction per 2 cycles, 2 schedulers 2.1 – 2 instructions per 2 cycles, 2 schedulers 3.x and 5.x – 2 instructions per cycle, 4 schedulers The most common reason, why a warp is not ready to execute is that the operands of the next instruction are not available yet (i.e., they are being loaded from global memory, subjected to a read-after-write dependency, …). Maxwell (CC 5.0) has also 4x double-dealing schedulers (as Kepler), however, there are only 128 cores (assigned 32 cores per scheduler) and latency of math instructions has been improved. by Martin Kruliš (v1.1) 10.11.2016

Hiding Latency Fast Context Switch When a warp gets stalled E.g., by data load/store Scheduler switch to next active warp by Martin Kruliš (v1.1) 10.11.2016

SIMT and Branches Masking Instructions In case of data-driven branches if-else conditions, while loops, … All branches are traversed, threads mask their execution in invalid branches if (threadIdx.x % 2 == 0) { ... even threads code ... } else { ... odd threads code ... } 1 2 3 4 … 1 2 3 4 … by Martin Kruliš (v1.1) 10.11.2016

Reducing Thread Divergence Work Reorganization In case the workload is imbalanced Cheap balancing can lead to better occupancy Example Matrix with dimensions not divisible by warp size Item (i,j) has linear index i*width + j by Martin Kruliš (v1.1) 10.11.2016

SIMT Algorithms SIMD → SIMT PRAM A warp can be perceived as 32-word SIMD engine PRAM Parallel extension of the Random Access Machine Completely theoretical, no relation to practice Some PRAM algorithms can be used as a basis for creating SIMT algorithms E.g., parallel reduction trees Significant modifications required Neither approach is perfect, the right way is somewhere between. by Martin Kruliš (v1.1) 10.11.2016

SIMT Algorithm Example Prefix-sum Compaction Computing new positions for non-empty items Is this better than sequential approach? by Martin Kruliš (v1.1) 10.11.2016

Synchronization Memory Fences Weak-ordered memory model Example The order of writes into shared/global/host/peer memory is not necessarily the same as the order of reads made by another thread The order of read operations is not necessarily the same as the order of the read instructions in the code Example Let us have variables __device__ int X = 1, Y = 2; __device__ void write() { X = 10; Y = 20; } __device__ void read() { int A = X; int B = Y; } by Martin Kruliš (v1.1) 10.11.2016

Synchronization Memory Fences Barrier __threadfence(); __threadfence_block() __threadfence_system(); Barrier Synchronization between warps in block __syncthreads(); __syncthreads_count(predicate); __syncthreads_and(predicate); __syncthreads_or(predicate); https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions __threadfence_block is weakest, __threadfence_system is strongest __syncthreads is stronger than threadfences CC 2.0+ by Martin Kruliš (v1.1) 10.11.2016

Performance Fine-tuning Work Partitioning The amount of work assigned to each thread and to each thread block Perhaps the most important decision in the overall application design Many things to consider Workload balance Core occupancy Utilization of registers and shared memory Implicit and explicit synchronization by Martin Kruliš (v1.1) 10.11.2016

Performance Fine-tuning Selecting Number and Size of The Blocks Number of threads should be divisible by warp size As many threads as possible Better occupancy, hiding various latencies, … As few threads as possible Avoid register spilling, more shared memory per thread Specifying Kernel Bounds __global__ void __launch_bounds__(maxThreads, minBlocks) myKernel(...) { ... } Minimal desired blocks per multiprocessor (optional) Maximal allowed number of threads per block by Martin Kruliš (v1.1) 10.11.2016

Call Stack Intra-device Function Calling Compiler attempts to inline functions Simple limited call stack is available otherwise When the stack overflows, kernel execution fails with unspecified error Stack size can be queried and set Since CC 2.0 cudaDeviceGetLimit(&res, cudaLimitStackSize); cudaDeviceSetLimit(cudaLimitStackSize, limit) Recursion is not recommended Dangerous, divergent, high overhead by Martin Kruliš (v1.1) 10.11.2016

Thread Block Switching Non-preemptive Block Assignment Compute capability lower than 3.5 Once a block is assigned to a multiprocessor, all its threads must finish before it leaves Problems with irregular workloads Preemptive Block Assignment Compute capability 3.5 or greater Computing blocks can be suspended and offloaded Creates a basis for dynamic parallelism Blocks can spawn child-blocks and wait for them to complete by Martin Kruliš (v1.1) 10.11.2016

Discussion by Martin Kruliš (v1.1) 10.11.2016