Martin Kruliš 9. 4. 2015 Martin Kruliš1. 1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning.

Martin Kruliš 9. 4. 2015 Martin Kruliš1

1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning unit 2000 - NVIDIA GeForce2, ATI Radeon 2001 - GPU has programmable parts ◦ DirectX – vertex and fragment shaders (v1.0) 2006 - OpenGL, DirectX 10, Windows Vista ◦ Unified shader architecture in HW ◦ Geometry shader added 9. 4. 2015 Martin Kruliš2

2007 - NVIDIA CUDA ◦ First GPGPU solution, restricted to NVIDIA GPUs 2007 - AMD Stream SDK (previously CTM) 2009 - OpenCL, Direct Compute ◦ Mac OS (Snow Leopard) first to implement OpenCL 2010 – OpenCL revision 1.1 ◦ Stable implementations from AMD and NVIDIA 2011 - OpenCL revision 1.2 2012 - NVIDIA Kepler Architecture 2013 – OpenCL revision 2.0 9. 4. 2015 Martin Kruliš3

 CPU ◦ Few cores per chip ◦ General purpose cores ◦ Processing different threads ◦ Huge caches to reduce memory latency  Locality of reference problem 9. 4. 2015 Martin Kruliš4  GPU ◦ Many cores per chip ◦ Cores specialized for numeric computations ◦ SIMT thread processing ◦ Huge amount of threads and fast context switch  Results in more complex memory transfers Architecture Convergence

 NVIDIA Fermi ◦ 16 SMP units ◦ 512 CUDA cores ◦ 786kB L2 cache 9. 4. 2015 Martin Kruliš5 Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU. Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU.

 Fermi Multiprocessor (SMP) ◦ 32 CUDA cores ◦ 64kB shared memory (or L1 cache) ◦ 1024 registers per core ◦ 16 load/store units ◦ 4 special function units ◦ 16 double precision ops per clock ◦ 1 instruction decoder  All cores are running in lockstep 9. 4. 2015 Martin Kruliš6

 Data Parallelism ◦ Many data elements are processed concurrently by the same routine ◦ GPUs are designed under this particular paradigm  Also have limited ways to express task parallelism  Threading Execution Model ◦ One function (kernel) is executed in many threads  Much more lightweight than the CPU threads ◦ Threads are grouped into blocks/work groups of the same size 9. 4. 2015 Martin Kruliš7

 Single Instruction Multiple Threads ◦ All cores are executing the same instruction ◦ Each core has its own set of registers 9. 4. 2015 Martin Kruliš8

 Single Instruction Multiple Threads ◦ Width-independent programming model ◦ Serial-like code ◦ Achieved by hardware with a little help from compiler ◦ Allows code divergence 9. 4. 2015 Martin Kruliš9  Single Instruction Multiple Data ◦ Explicitly expose the width of SIMD vector ◦ Special instructions ◦ Generated by compiler or directly written by programmer ◦ Code divergence is usually not supported

 How are threads assigned to SMPs 9. 4. 2015 Martin Kruliš10 GPU Symmetric Multiprocessor Core GridBlockWarpThread Assigned to SMP Simultaneously run on SM cores The same kernel

 Masking Instructions ◦ In case of data-driven branches  if-else conditions, while loops, … ◦ All branches are traversed, threads mask their execution in invalid branches if (threadId % 2 == 0) {... even threads code... } else {... odd threads code... } 9. 4. 2015 Martin Kruliš11 01234 … 01234 …

9. 4. 2015 Martin Kruliš12 Global Memory L2 Cache L1 Cache Registers Core … L1 Cache Registers Core … … Host CPU Host Memory SMP GPU Chip GPU Device PCI Express (16/32 GBps) PCI Express (16/32 GBps) ~ 25 GBps Note that details about host memory interconnection are platform specific > 100 GBps

 PCIe Transfers ◦ Much slower than internal GPU data transfers ◦ Issued explicitly by host code  As a bulk transfer operation ◦ Or issued explicitly by the HW/operating system  When GPU memory is mapped to host memory space ◦ The transfers have significant overhead  Overlapping - up to 2 asynchronous transfers may run whilst the GPU is computing ◦ Data must be binary safe (no pointers)  CUDA introduces unified memory addressing 9. 4. 2015 Martin Kruliš13

 Global Memory Properties ◦ Off-chip, but on the GPU device ◦ High bandwidth and high latency  ~ 100 GBps, 400-600 of clock cycles ◦ Operated in transactions  Continuous aligned segments of 32 B - 128 B  Number of transactions caused by global memory access depends on the pattern of the access  Certain access patterns are optimized ◦ Data are cached in L2  On Fermi also cached in L1 cache  Kepler/Maxwell uses L1 for specific purposes 9. 4. 2015 Martin Kruliš14

 Memory Shared by the SMP ◦ Divided into banks  Each bank can be accessed independently  Consecutive 32-bit words are in consecutive banks  Optionally, 64-bit words division is used (Kepler) ◦ Bank conflicts are serialized  Except for reading the same address (broadcast) 9. 4. 2015 Martin Kruliš15 ArchitectureMem. size# of bankslatency Tesla16 kB1632bits / 2 cycles Fermi48 kB3232 bits / 2 cycles Kepler48 kB3264 bits / 1 cycle

 Linear Addressing ◦ Each thread in warp access different memory bank ◦ No collisions 9. 4. 2015 Martin Kruliš16

 Linear Addressing with Stride ◦ Each thread access 2*i -th item ◦ 2-way conflicts (2x slowdown) on Fermi and older ◦ No collisions on Kepler and newer  Due to 64-bits per cycle throughput 9. 4. 2015 Martin Kruliš17

 Linear Addressing with Stride ◦ Each thread access 3*i -th item ◦ No collisions, since the number of banks is not divisible by the stride 9. 4. 2015 Martin Kruliš18

 Registers ◦ One register pool per multiprocessor  8-64k of 32-bit registers (depending on architecture)  Registers assigned to threads by compiler ◦ As fast as the cores (no extra clock cycles) ◦ Read-after-write dependency  24 clock cycles  Can be hidden if there are enough active warps ◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible  The programmer have no direct control over them 9. 4. 2015 Martin Kruliš19

 Fast Context Switch ◦ When a warp gets stalled  E.g., by data load/store ◦ Scheduler switches to the next active warp 9. 4. 2015 Martin Kruliš20

 Kepler’s Major Improvements ◦ Streaming Processors Next Generation (SMX)  192 cores, 32 SFUs, 32 load/store units  3 cores share a DP unit, 6 cores share LD and SFU ◦ Dynamic Parallelism  Kernel may spawn child kernels (to depth of 24)  Implies the work group context-switch capability ◦ Hyper-Q  Up to 32 simultaneous GPU-host connections  Better throughput if multiple processes/threads use the GPU (concurrent connections are managed in HW) 9. 4. 2015 Martin Kruliš21

9. 4. 2015 Martin Kruliš22

 Maxwell’s Major Improvements ◦ Maxwell Symmetric Multiprocessor (SMM)  Many internal optimizations, better power efficiency  Improved scheduling, increased occupancy  Reduced arithmetic instruction latency ◦ Larger L2 cache (2MB) ◦ Dedicated shared memory (separate L1 cache) ◦ Native shared memory atomics ◦ Better support for dynamic parallelism 9. 4. 2015 Martin Kruliš23

 AMD’s Graphic Core Next (GCN) ◦ Abandoning the VLIW4 architecture  1VLIW x 4 ALU ops => 4 SIMD x 1 ALU op ◦ 32 compute units (Radeon HD7970) ◦ 4 SIMD units per CU (each processing 16 elements) ◦ 10 planned wavefronts per SIMD unit ◦ Emphasis on vector processing (instructions, registers, memory, …) ◦ OpenCL 1.2, DirectCompute 11.1 and C++ AMP compatibility 9. 4. 2015 Martin Kruliš25

 Data Parallelism ◦ SIMT execution model ◦ Load balancing needs to be carefully considered  Host-device Transfers ◦ Data needs to be transferred to the GPU and back ◦ Computations should overlap with data transfers  There should be sufficient amount of computations  Memory Architecture ◦ Various types of memories and caches ◦ Coalesced loads/stores from/to global memory ◦ Banking properties of the shared memory 9. 4. 2015 Martin Kruliš27

Martin Kruliš 9. 4. 2015 Martin Kruliš29

 Universal Framework for Parallel Computations ◦ Specification created by Khronos group ◦ Multiple implementations exist (AMD, NVIDIA, Mac, …)  API for Different Parallel Architectures ◦ Multicore CPUs, Manycore GPUs, IBM Cell cards, Xeon Phi, … ◦ Host runtime handles device detection, data transfers, and code execution  Extended Version of C99 for Programming Devices ◦ The code is compiled at runtime for selected device ◦ Theoretically, we may chose best device for our application dynamically  However, we have to consider HW-specific optimizations… 9. 4. 2015 Martin Kruliš30

 Hardware Model ◦ Device (CPU die or GPU card) ◦ Compute unit (CPU core or GPU SMP) ◦ Processing element (slot in SSE registers or GPU core) 9. 4. 2015 Martin Kruliš31

 Logical Layers ◦ Platform  An implementation of OCL ◦ Context  Groups devices of selected kind  Buffers, programs, and other objects live in a context ◦ Device ◦ Command Queue  Created for a device  Controls data transfers, kernel execution and synchronization 9. 4. 2015 Martin Kruliš32 Intel Core i7 (4 cores with HT) ATI Radeon 5870 (320 cores)

std::vector platforms; cl_int err = cl::Platform::get(&platforms); if (err != CL_SUCCESS) return 1; cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, cps, NULL, NULL, &err); std::vector devices = context.getInfo (); cl::Buffer buf(context, CL_MEM_READ_ONLY, sizeof(cl_float)*n); cl::Program program(context, cl::Program::Sources(1, std::make_pair(source.c_str(), source.length())) ); err = program.build(devices); cl::Kernel kernel(program, "function_name", &err); err = kernel.setArg(0, buf); 9. 4. 2015 Martin Kruliš33 List all platforms Context of all GPUs on the 1 st platform List all GPUs Create and compile the program from string source Mark function as a kernel Allocate memory on the GPU

cl::CommandQueue cmdQueue(context, devices[0], 0, &err); cmdQueue.enqueueWriteBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(n), cl::NDRange(grp), NULL, NULL); cmdQueue.finish(); cmdQueue.enqueueReadBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); 9. 4. 2015 Martin Kruliš34 GPU (in-order) command queue Copy input data to the GPU buffer Execute the kernel and wait for it to finish (not necessary for the following readBuffer operation since the queue is in-order). Copy the results back from the GPU

 A Kernel ◦ Written in OpenCL C (extended version of C99) ◦ Compiled at runtime for destination platform  With (possibly) high degree of optimization  Kernel Execution ◦ Task Parallelism  Multiple kernels are enlisted in command queue and executed concurrently ◦ Data Parallelism  Multiple instances (threads) are created from single kernel, each operate on distinct data 9. 4. 2015 Martin Kruliš35

 Data Parallelism ◦ Each kernel instance has its own ID  A 1-3 dimensional vector of numbers from 0 to N-1 ◦ ID identifies the portion of data to be processed ◦ Threads form groups (blocks)  Threads within one group are executed on one SMP  Groups have IDs as well 9. 4. 2015 Martin Kruliš36

 Types of Memory ◦ private – memory that belongs to one thread (registers) ◦ local – memory shared by workgroup (shared memory) ◦ global – memory of the device ◦ constant – read-only version of global memory  More easily cached 9. 4. 2015 Martin Kruliš37

 Functions ◦ Other functions (beside the kernel) can be defined in the program (kernel is just an entry point)  Some devices may impose limitations on call stack ◦ It is possible to call std. functions like printf()  However, they do not work on GPUs ◦ There are many built-in functions  Thread-related functions (e.g. get_global_id() )  Mathematical and geometrical functions  Originally designed for graphics  Some of them are translated into single instruction  Functions for asynchronous memory transfers 9. 4. 2015 Martin Kruliš39

 Restrictions And Optimization Issues ◦ Branching problem (if-else)  Workgroup runs in SIMT, thus all branches are followed  Use conditional assignment rather than branches ◦ For-cycles  The compiler attempts to unroll them automatically ◦ While-cycles  The same problem as branching ◦ Vector operations  Translated into single instruction if possible (e.g., SSE)  The compiler attempts to generate them automatically  Different efficiency on different architectures 9. 4. 2015 Martin Kruliš40

 Global ◦ Explicit barriers added to the command queue ◦ Events and event dependencies  Within a Kernel ◦ Local barriers (for a work group) ◦ Memory fences ◦ Atomic operations  Common operations (add, sub, xchg, cmpxhg, …)  Extended – min, max, and, or, xor 9. 4. 2015 Martin Kruliš41

__kernel void mul_matrix (__global const float *A, __global const float *B, __global float *C) { int width = get_global_size(0); int x = get_global_id(0); int y = get_global_id(1); float sum = 0; for (int i = 0; i < width; ++i) sum += A[y*width + i] * B[i*width + x]; C[y*width + x] = sum; } 9. 4. 2015 Martin Kruliš42 Example

 Optimized Solution ◦ Workgroup computes block of 16x16 results ◦ In each step, appropriate blocks of 16x16 numbers are loaded into local memory and intermediate results are updated 9. 4. 2015 Martin Kruliš43 16 Exact numbers may vary a little Example

 Open CL 2.0 (July 2013) ◦ First implementations from Intel and AMD ◦ Shared virtual memory  For sharing complex data structures with pointers ◦ Dynamic parallelism  Device kernels can enqueue other kernels ◦ Generic address space  Functions pointer arguments need not declare their address space ◦ Subset of C11 atomics were added ◦ Pipes - FIFO data structures, which can be read and written by kernels 9. 4. 2015 Martin Kruliš44

Martin Kruliš 9. 4. 2015 Martin Kruliš1. 1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning.

Similar presentations

Presentation on theme: "Martin Kruliš 9. 4. 2015 Martin Kruliš1. 1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Kruliš 9. 4. 2015 Martin Kruliš1. 1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning.

Similar presentations

Presentation on theme: "Martin Kruliš 9. 4. 2015 Martin Kruliš1. 1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning."— Presentation transcript:

Similar presentations

About project

Feedback