Download presentation
Presentation is loading. Please wait.
Published byHollie Horn Modified over 9 years ago
1
Martin Kruliš 9. 4. 2015 Martin Kruliš1
2
1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning unit 2000 - NVIDIA GeForce2, ATI Radeon 2001 - GPU has programmable parts ◦ DirectX – vertex and fragment shaders (v1.0) 2006 - OpenGL, DirectX 10, Windows Vista ◦ Unified shader architecture in HW ◦ Geometry shader added 9. 4. 2015 Martin Kruliš2
3
2007 - NVIDIA CUDA ◦ First GPGPU solution, restricted to NVIDIA GPUs 2007 - AMD Stream SDK (previously CTM) 2009 - OpenCL, Direct Compute ◦ Mac OS (Snow Leopard) first to implement OpenCL 2010 – OpenCL revision 1.1 ◦ Stable implementations from AMD and NVIDIA 2011 - OpenCL revision 1.2 2012 - NVIDIA Kepler Architecture 2013 – OpenCL revision 2.0 9. 4. 2015 Martin Kruliš3
4
CPU ◦ Few cores per chip ◦ General purpose cores ◦ Processing different threads ◦ Huge caches to reduce memory latency Locality of reference problem 9. 4. 2015 Martin Kruliš4 GPU ◦ Many cores per chip ◦ Cores specialized for numeric computations ◦ SIMT thread processing ◦ Huge amount of threads and fast context switch Results in more complex memory transfers Architecture Convergence
5
NVIDIA Fermi ◦ 16 SMP units ◦ 512 CUDA cores ◦ 786kB L2 cache 9. 4. 2015 Martin Kruliš5 Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU. Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU.
6
Fermi Multiprocessor (SMP) ◦ 32 CUDA cores ◦ 64kB shared memory (or L1 cache) ◦ 1024 registers per core ◦ 16 load/store units ◦ 4 special function units ◦ 16 double precision ops per clock ◦ 1 instruction decoder All cores are running in lockstep 9. 4. 2015 Martin Kruliš6
7
Data Parallelism ◦ Many data elements are processed concurrently by the same routine ◦ GPUs are designed under this particular paradigm Also have limited ways to express task parallelism Threading Execution Model ◦ One function (kernel) is executed in many threads Much more lightweight than the CPU threads ◦ Threads are grouped into blocks/work groups of the same size 9. 4. 2015 Martin Kruliš7
8
Single Instruction Multiple Threads ◦ All cores are executing the same instruction ◦ Each core has its own set of registers 9. 4. 2015 Martin Kruliš8
9
Single Instruction Multiple Threads ◦ Width-independent programming model ◦ Serial-like code ◦ Achieved by hardware with a little help from compiler ◦ Allows code divergence 9. 4. 2015 Martin Kruliš9 Single Instruction Multiple Data ◦ Explicitly expose the width of SIMD vector ◦ Special instructions ◦ Generated by compiler or directly written by programmer ◦ Code divergence is usually not supported
10
How are threads assigned to SMPs 9. 4. 2015 Martin Kruliš10 GPU Symmetric Multiprocessor Core GridBlockWarpThread Assigned to SMP Simultaneously run on SM cores The same kernel
11
Masking Instructions ◦ In case of data-driven branches if-else conditions, while loops, … ◦ All branches are traversed, threads mask their execution in invalid branches if (threadId % 2 == 0) {... even threads code... } else {... odd threads code... } 9. 4. 2015 Martin Kruliš11 01234 … 01234 …
12
9. 4. 2015 Martin Kruliš12 Global Memory L2 Cache L1 Cache Registers Core … L1 Cache Registers Core … … Host CPU Host Memory SMP GPU Chip GPU Device PCI Express (16/32 GBps) PCI Express (16/32 GBps) ~ 25 GBps Note that details about host memory interconnection are platform specific > 100 GBps
13
PCIe Transfers ◦ Much slower than internal GPU data transfers ◦ Issued explicitly by host code As a bulk transfer operation ◦ Or issued explicitly by the HW/operating system When GPU memory is mapped to host memory space ◦ The transfers have significant overhead Overlapping - up to 2 asynchronous transfers may run whilst the GPU is computing ◦ Data must be binary safe (no pointers) CUDA introduces unified memory addressing 9. 4. 2015 Martin Kruliš13
14
Global Memory Properties ◦ Off-chip, but on the GPU device ◦ High bandwidth and high latency ~ 100 GBps, 400-600 of clock cycles ◦ Operated in transactions Continuous aligned segments of 32 B - 128 B Number of transactions caused by global memory access depends on the pattern of the access Certain access patterns are optimized ◦ Data are cached in L2 On Fermi also cached in L1 cache Kepler/Maxwell uses L1 for specific purposes 9. 4. 2015 Martin Kruliš14
15
Memory Shared by the SMP ◦ Divided into banks Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (Kepler) ◦ Bank conflicts are serialized Except for reading the same address (broadcast) 9. 4. 2015 Martin Kruliš15 ArchitectureMem. size# of bankslatency Tesla16 kB1632bits / 2 cycles Fermi48 kB3232 bits / 2 cycles Kepler48 kB3264 bits / 1 cycle
16
Linear Addressing ◦ Each thread in warp access different memory bank ◦ No collisions 9. 4. 2015 Martin Kruliš16
17
Linear Addressing with Stride ◦ Each thread access 2*i -th item ◦ 2-way conflicts (2x slowdown) on Fermi and older ◦ No collisions on Kepler and newer Due to 64-bits per cycle throughput 9. 4. 2015 Martin Kruliš17
18
Linear Addressing with Stride ◦ Each thread access 3*i -th item ◦ No collisions, since the number of banks is not divisible by the stride 9. 4. 2015 Martin Kruliš18
19
Registers ◦ One register pool per multiprocessor 8-64k of 32-bit registers (depending on architecture) Registers assigned to threads by compiler ◦ As fast as the cores (no extra clock cycles) ◦ Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps ◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over them 9. 4. 2015 Martin Kruliš19
20
Fast Context Switch ◦ When a warp gets stalled E.g., by data load/store ◦ Scheduler switches to the next active warp 9. 4. 2015 Martin Kruliš20
21
Kepler’s Major Improvements ◦ Streaming Processors Next Generation (SMX) 192 cores, 32 SFUs, 32 load/store units 3 cores share a DP unit, 6 cores share LD and SFU ◦ Dynamic Parallelism Kernel may spawn child kernels (to depth of 24) Implies the work group context-switch capability ◦ Hyper-Q Up to 32 simultaneous GPU-host connections Better throughput if multiple processes/threads use the GPU (concurrent connections are managed in HW) 9. 4. 2015 Martin Kruliš21
22
9. 4. 2015 Martin Kruliš22
23
Maxwell’s Major Improvements ◦ Maxwell Symmetric Multiprocessor (SMM) Many internal optimizations, better power efficiency Improved scheduling, increased occupancy Reduced arithmetic instruction latency ◦ Larger L2 cache (2MB) ◦ Dedicated shared memory (separate L1 cache) ◦ Native shared memory atomics ◦ Better support for dynamic parallelism 9. 4. 2015 Martin Kruliš23
24
9. 4. 2015 Martin Kruliš24
25
AMD’s Graphic Core Next (GCN) ◦ Abandoning the VLIW4 architecture 1VLIW x 4 ALU ops => 4 SIMD x 1 ALU op ◦ 32 compute units (Radeon HD7970) ◦ 4 SIMD units per CU (each processing 16 elements) ◦ 10 planned wavefronts per SIMD unit ◦ Emphasis on vector processing (instructions, registers, memory, …) ◦ OpenCL 1.2, DirectCompute 11.1 and C++ AMP compatibility 9. 4. 2015 Martin Kruliš25
26
9. 4. 2015 Martin Kruliš26
27
Data Parallelism ◦ SIMT execution model ◦ Load balancing needs to be carefully considered Host-device Transfers ◦ Data needs to be transferred to the GPU and back ◦ Computations should overlap with data transfers There should be sufficient amount of computations Memory Architecture ◦ Various types of memories and caches ◦ Coalesced loads/stores from/to global memory ◦ Banking properties of the shared memory 9. 4. 2015 Martin Kruliš27
28
9. 4. 2015 Martin Kruliš28
29
Martin Kruliš 9. 4. 2015 Martin Kruliš29
30
Universal Framework for Parallel Computations ◦ Specification created by Khronos group ◦ Multiple implementations exist (AMD, NVIDIA, Mac, …) API for Different Parallel Architectures ◦ Multicore CPUs, Manycore GPUs, IBM Cell cards, Xeon Phi, … ◦ Host runtime handles device detection, data transfers, and code execution Extended Version of C99 for Programming Devices ◦ The code is compiled at runtime for selected device ◦ Theoretically, we may chose best device for our application dynamically However, we have to consider HW-specific optimizations… 9. 4. 2015 Martin Kruliš30
31
Hardware Model ◦ Device (CPU die or GPU card) ◦ Compute unit (CPU core or GPU SMP) ◦ Processing element (slot in SSE registers or GPU core) 9. 4. 2015 Martin Kruliš31
32
Logical Layers ◦ Platform An implementation of OCL ◦ Context Groups devices of selected kind Buffers, programs, and other objects live in a context ◦ Device ◦ Command Queue Created for a device Controls data transfers, kernel execution and synchronization 9. 4. 2015 Martin Kruliš32 Intel Core i7 (4 cores with HT) ATI Radeon 5870 (320 cores)
33
std::vector platforms; cl_int err = cl::Platform::get(&platforms); if (err != CL_SUCCESS) return 1; cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, cps, NULL, NULL, &err); std::vector devices = context.getInfo (); cl::Buffer buf(context, CL_MEM_READ_ONLY, sizeof(cl_float)*n); cl::Program program(context, cl::Program::Sources(1, std::make_pair(source.c_str(), source.length())) ); err = program.build(devices); cl::Kernel kernel(program, "function_name", &err); err = kernel.setArg(0, buf); 9. 4. 2015 Martin Kruliš33 List all platforms Context of all GPUs on the 1 st platform List all GPUs Create and compile the program from string source Mark function as a kernel Allocate memory on the GPU
34
cl::CommandQueue cmdQueue(context, devices[0], 0, &err); cmdQueue.enqueueWriteBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(n), cl::NDRange(grp), NULL, NULL); cmdQueue.finish(); cmdQueue.enqueueReadBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); 9. 4. 2015 Martin Kruliš34 GPU (in-order) command queue Copy input data to the GPU buffer Execute the kernel and wait for it to finish (not necessary for the following readBuffer operation since the queue is in-order). Copy the results back from the GPU
35
A Kernel ◦ Written in OpenCL C (extended version of C99) ◦ Compiled at runtime for destination platform With (possibly) high degree of optimization Kernel Execution ◦ Task Parallelism Multiple kernels are enlisted in command queue and executed concurrently ◦ Data Parallelism Multiple instances (threads) are created from single kernel, each operate on distinct data 9. 4. 2015 Martin Kruliš35
36
Data Parallelism ◦ Each kernel instance has its own ID A 1-3 dimensional vector of numbers from 0 to N-1 ◦ ID identifies the portion of data to be processed ◦ Threads form groups (blocks) Threads within one group are executed on one SMP Groups have IDs as well 9. 4. 2015 Martin Kruliš36
37
Types of Memory ◦ private – memory that belongs to one thread (registers) ◦ local – memory shared by workgroup (shared memory) ◦ global – memory of the device ◦ constant – read-only version of global memory More easily cached 9. 4. 2015 Martin Kruliš37
38
9. 4. 2015 Martin Kruliš38
39
Functions ◦ Other functions (beside the kernel) can be defined in the program (kernel is just an entry point) Some devices may impose limitations on call stack ◦ It is possible to call std. functions like printf() However, they do not work on GPUs ◦ There are many built-in functions Thread-related functions (e.g. get_global_id() ) Mathematical and geometrical functions Originally designed for graphics Some of them are translated into single instruction Functions for asynchronous memory transfers 9. 4. 2015 Martin Kruliš39
40
Restrictions And Optimization Issues ◦ Branching problem (if-else) Workgroup runs in SIMT, thus all branches are followed Use conditional assignment rather than branches ◦ For-cycles The compiler attempts to unroll them automatically ◦ While-cycles The same problem as branching ◦ Vector operations Translated into single instruction if possible (e.g., SSE) The compiler attempts to generate them automatically Different efficiency on different architectures 9. 4. 2015 Martin Kruliš40
41
Global ◦ Explicit barriers added to the command queue ◦ Events and event dependencies Within a Kernel ◦ Local barriers (for a work group) ◦ Memory fences ◦ Atomic operations Common operations (add, sub, xchg, cmpxhg, …) Extended – min, max, and, or, xor 9. 4. 2015 Martin Kruliš41
42
__kernel void mul_matrix (__global const float *A, __global const float *B, __global float *C) { int width = get_global_size(0); int x = get_global_id(0); int y = get_global_id(1); float sum = 0; for (int i = 0; i < width; ++i) sum += A[y*width + i] * B[i*width + x]; C[y*width + x] = sum; } 9. 4. 2015 Martin Kruliš42 Example
43
Optimized Solution ◦ Workgroup computes block of 16x16 results ◦ In each step, appropriate blocks of 16x16 numbers are loaded into local memory and intermediate results are updated 9. 4. 2015 Martin Kruliš43 16 Exact numbers may vary a little Example
44
Open CL 2.0 (July 2013) ◦ First implementations from Intel and AMD ◦ Shared virtual memory For sharing complex data structures with pointers ◦ Dynamic parallelism Device kernels can enqueue other kernels ◦ Generic address space Functions pointer arguments need not declare their address space ◦ Subset of C11 atomics were added ◦ Pipes - FIFO data structures, which can be read and written by kernels 9. 4. 2015 Martin Kruliš44
45
9. 4. 2015 Martin Kruliš45
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.