Martin Kruliš Martin Kruliš1
Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs NVIDIA GeForce 256 ◦ First Transform&Lightning unit NVIDIA GeForce2, ATI Radeon GPU has programmable parts ◦ DirectX – vertex and fragment shaders (v1.0) OpenGL, DirectX 10, Windows Vista ◦ Unified shader architecture in HW ◦ Geometry shader added Martin Kruliš2
NVIDIA CUDA ◦ First GPGPU solution, restricted to NVIDIA GPUs AMD Stream SDK (previously CTM) OpenCL, Direct Compute ◦ Mac OS (Snow Leopard) first to implement OpenCL 2010 – OpenCL revision 1.1 ◦ Stable implementations from AMD and NVIDIA OpenCL revision NVIDIA Kepler Architecture 2013 – OpenCL revision Martin Kruliš3
CPU ◦ Few cores per chip ◦ General purpose cores ◦ Processing different threads ◦ Huge caches to reduce memory latency Locality of reference problem Martin Kruliš4 GPU ◦ Many cores per chip ◦ Cores specialized for numeric computations ◦ SIMT thread processing ◦ Huge amount of threads and fast context switch Results in more complex memory transfers Architecture Convergence
NVIDIA Fermi ◦ 16 SMP units ◦ 512 CUDA cores ◦ 786kB L2 cache Martin Kruliš5 Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU. Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU.
Fermi Multiprocessor (SMP) ◦ 32 CUDA cores ◦ 64kB shared memory (or L1 cache) ◦ 1024 registers per core ◦ 16 load/store units ◦ 4 special function units ◦ 16 double precision ops per clock ◦ 1 instruction decoder All cores are running in lockstep Martin Kruliš6
Data Parallelism ◦ Many data elements are processed concurrently by the same routine ◦ GPUs are designed under this particular paradigm Also have limited ways to express task parallelism Threading Execution Model ◦ One function (kernel) is executed in many threads Much more lightweight than the CPU threads ◦ Threads are grouped into blocks/work groups of the same size Martin Kruliš7
Single Instruction Multiple Threads ◦ All cores are executing the same instruction ◦ Each core has its own set of registers Martin Kruliš8
Single Instruction Multiple Threads ◦ Width-independent programming model ◦ Serial-like code ◦ Achieved by hardware with a little help from compiler ◦ Allows code divergence Martin Kruliš9 Single Instruction Multiple Data ◦ Explicitly expose the width of SIMD vector ◦ Special instructions ◦ Generated by compiler or directly written by programmer ◦ Code divergence is usually not supported
How are threads assigned to SMPs Martin Kruliš10 GPU Symmetric Multiprocessor Core GridBlockWarpThread Assigned to SMP Simultaneously run on SM cores The same kernel
Masking Instructions ◦ In case of data-driven branches if-else conditions, while loops, … ◦ All branches are traversed, threads mask their execution in invalid branches if (threadId % 2 == 0) {... even threads code... } else {... odd threads code... } Martin Kruliš … …
Martin Kruliš12 Global Memory L2 Cache L1 Cache Registers Core … L1 Cache Registers Core … … Host CPU Host Memory SMP GPU Chip GPU Device PCI Express (16/32 GBps) PCI Express (16/32 GBps) ~ 25 GBps Note that details about host memory interconnection are platform specific > 100 GBps
PCIe Transfers ◦ Much slower than internal GPU data transfers ◦ Issued explicitly by host code As a bulk transfer operation ◦ Or issued explicitly by the HW/operating system When GPU memory is mapped to host memory space ◦ The transfers have significant overhead Overlapping - up to 2 asynchronous transfers may run whilst the GPU is computing ◦ Data must be binary safe (no pointers) CUDA introduces unified memory addressing Martin Kruliš13
Global Memory Properties ◦ Off-chip, but on the GPU device ◦ High bandwidth and high latency ~ 100 GBps, of clock cycles ◦ Operated in transactions Continuous aligned segments of 32 B B Number of transactions caused by global memory access depends on the pattern of the access Certain access patterns are optimized ◦ Data are cached in L2 On Fermi also cached in L1 cache Kepler/Maxwell uses L1 for specific purposes Martin Kruliš14
Memory Shared by the SMP ◦ Divided into banks Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (Kepler) ◦ Bank conflicts are serialized Except for reading the same address (broadcast) Martin Kruliš15 ArchitectureMem. size# of bankslatency Tesla16 kB1632bits / 2 cycles Fermi48 kB3232 bits / 2 cycles Kepler48 kB3264 bits / 1 cycle
Linear Addressing ◦ Each thread in warp access different memory bank ◦ No collisions Martin Kruliš16
Linear Addressing with Stride ◦ Each thread access 2*i -th item ◦ 2-way conflicts (2x slowdown) on Fermi and older ◦ No collisions on Kepler and newer Due to 64-bits per cycle throughput Martin Kruliš17
Linear Addressing with Stride ◦ Each thread access 3*i -th item ◦ No collisions, since the number of banks is not divisible by the stride Martin Kruliš18
Registers ◦ One register pool per multiprocessor 8-64k of 32-bit registers (depending on architecture) Registers assigned to threads by compiler ◦ As fast as the cores (no extra clock cycles) ◦ Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps ◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over them Martin Kruliš19
Fast Context Switch ◦ When a warp gets stalled E.g., by data load/store ◦ Scheduler switches to the next active warp Martin Kruliš20
Kepler’s Major Improvements ◦ Streaming Processors Next Generation (SMX) 192 cores, 32 SFUs, 32 load/store units 3 cores share a DP unit, 6 cores share LD and SFU ◦ Dynamic Parallelism Kernel may spawn child kernels (to depth of 24) Implies the work group context-switch capability ◦ Hyper-Q Up to 32 simultaneous GPU-host connections Better throughput if multiple processes/threads use the GPU (concurrent connections are managed in HW) Martin Kruliš21
Martin Kruliš22
Maxwell’s Major Improvements ◦ Maxwell Symmetric Multiprocessor (SMM) Many internal optimizations, better power efficiency Improved scheduling, increased occupancy Reduced arithmetic instruction latency ◦ Larger L2 cache (2MB) ◦ Dedicated shared memory (separate L1 cache) ◦ Native shared memory atomics ◦ Better support for dynamic parallelism Martin Kruliš23
Martin Kruliš24
AMD’s Graphic Core Next (GCN) ◦ Abandoning the VLIW4 architecture 1VLIW x 4 ALU ops => 4 SIMD x 1 ALU op ◦ 32 compute units (Radeon HD7970) ◦ 4 SIMD units per CU (each processing 16 elements) ◦ 10 planned wavefronts per SIMD unit ◦ Emphasis on vector processing (instructions, registers, memory, …) ◦ OpenCL 1.2, DirectCompute 11.1 and C++ AMP compatibility Martin Kruliš25
Martin Kruliš26
Data Parallelism ◦ SIMT execution model ◦ Load balancing needs to be carefully considered Host-device Transfers ◦ Data needs to be transferred to the GPU and back ◦ Computations should overlap with data transfers There should be sufficient amount of computations Memory Architecture ◦ Various types of memories and caches ◦ Coalesced loads/stores from/to global memory ◦ Banking properties of the shared memory Martin Kruliš27
Martin Kruliš28
Martin Kruliš Martin Kruliš29
Universal Framework for Parallel Computations ◦ Specification created by Khronos group ◦ Multiple implementations exist (AMD, NVIDIA, Mac, …) API for Different Parallel Architectures ◦ Multicore CPUs, Manycore GPUs, IBM Cell cards, Xeon Phi, … ◦ Host runtime handles device detection, data transfers, and code execution Extended Version of C99 for Programming Devices ◦ The code is compiled at runtime for selected device ◦ Theoretically, we may chose best device for our application dynamically However, we have to consider HW-specific optimizations… Martin Kruliš30
Hardware Model ◦ Device (CPU die or GPU card) ◦ Compute unit (CPU core or GPU SMP) ◦ Processing element (slot in SSE registers or GPU core) Martin Kruliš31
Logical Layers ◦ Platform An implementation of OCL ◦ Context Groups devices of selected kind Buffers, programs, and other objects live in a context ◦ Device ◦ Command Queue Created for a device Controls data transfers, kernel execution and synchronization Martin Kruliš32 Intel Core i7 (4 cores with HT) ATI Radeon 5870 (320 cores)
std::vector platforms; cl_int err = cl::Platform::get(&platforms); if (err != CL_SUCCESS) return 1; cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, cps, NULL, NULL, &err); std::vector devices = context.getInfo (); cl::Buffer buf(context, CL_MEM_READ_ONLY, sizeof(cl_float)*n); cl::Program program(context, cl::Program::Sources(1, std::make_pair(source.c_str(), source.length())) ); err = program.build(devices); cl::Kernel kernel(program, "function_name", &err); err = kernel.setArg(0, buf); Martin Kruliš33 List all platforms Context of all GPUs on the 1 st platform List all GPUs Create and compile the program from string source Mark function as a kernel Allocate memory on the GPU
cl::CommandQueue cmdQueue(context, devices[0], 0, &err); cmdQueue.enqueueWriteBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(n), cl::NDRange(grp), NULL, NULL); cmdQueue.finish(); cmdQueue.enqueueReadBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); Martin Kruliš34 GPU (in-order) command queue Copy input data to the GPU buffer Execute the kernel and wait for it to finish (not necessary for the following readBuffer operation since the queue is in-order). Copy the results back from the GPU
A Kernel ◦ Written in OpenCL C (extended version of C99) ◦ Compiled at runtime for destination platform With (possibly) high degree of optimization Kernel Execution ◦ Task Parallelism Multiple kernels are enlisted in command queue and executed concurrently ◦ Data Parallelism Multiple instances (threads) are created from single kernel, each operate on distinct data Martin Kruliš35
Data Parallelism ◦ Each kernel instance has its own ID A 1-3 dimensional vector of numbers from 0 to N-1 ◦ ID identifies the portion of data to be processed ◦ Threads form groups (blocks) Threads within one group are executed on one SMP Groups have IDs as well Martin Kruliš36
Types of Memory ◦ private – memory that belongs to one thread (registers) ◦ local – memory shared by workgroup (shared memory) ◦ global – memory of the device ◦ constant – read-only version of global memory More easily cached Martin Kruliš37
Martin Kruliš38
Functions ◦ Other functions (beside the kernel) can be defined in the program (kernel is just an entry point) Some devices may impose limitations on call stack ◦ It is possible to call std. functions like printf() However, they do not work on GPUs ◦ There are many built-in functions Thread-related functions (e.g. get_global_id() ) Mathematical and geometrical functions Originally designed for graphics Some of them are translated into single instruction Functions for asynchronous memory transfers Martin Kruliš39
Restrictions And Optimization Issues ◦ Branching problem (if-else) Workgroup runs in SIMT, thus all branches are followed Use conditional assignment rather than branches ◦ For-cycles The compiler attempts to unroll them automatically ◦ While-cycles The same problem as branching ◦ Vector operations Translated into single instruction if possible (e.g., SSE) The compiler attempts to generate them automatically Different efficiency on different architectures Martin Kruliš40
Global ◦ Explicit barriers added to the command queue ◦ Events and event dependencies Within a Kernel ◦ Local barriers (for a work group) ◦ Memory fences ◦ Atomic operations Common operations (add, sub, xchg, cmpxhg, …) Extended – min, max, and, or, xor Martin Kruliš41
__kernel void mul_matrix (__global const float *A, __global const float *B, __global float *C) { int width = get_global_size(0); int x = get_global_id(0); int y = get_global_id(1); float sum = 0; for (int i = 0; i < width; ++i) sum += A[y*width + i] * B[i*width + x]; C[y*width + x] = sum; } Martin Kruliš42 Example
Optimized Solution ◦ Workgroup computes block of 16x16 results ◦ In each step, appropriate blocks of 16x16 numbers are loaded into local memory and intermediate results are updated Martin Kruliš43 16 Exact numbers may vary a little Example
Open CL 2.0 (July 2013) ◦ First implementations from Intel and AMD ◦ Shared virtual memory For sharing complex data structures with pointers ◦ Dynamic parallelism Device kernels can enqueue other kernels ◦ Generic address space Functions pointer arguments need not declare their address space ◦ Subset of C11 atomics were added ◦ Pipes - FIFO data structures, which can be read and written by kernels Martin Kruliš44
Martin Kruliš45