Martin Kruliš 9. 4. 2015 Martin Kruliš1. 1996 - 3Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs 1999 - NVIDIA GeForce 256 ◦ First Transform&Lightning.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.

Optimization on Kepler Zehuan Wang

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Martin Kruliš by Martin Kruliš (v1.0)1.

Extracted directly from:

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

GPU Programming with CUDA – Optimisation Mike Griffiths

Advanced / Other Programming Models Sathish Vadhiyar.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

Martin Kruliš by Martin Kruliš (v1.1)1.

Single Instruction Multiple Threads

Computer Engg, IIT(BHU)

GPGPU Architectures Martin Kruliš Martin Kruliš

Intel Many Integrated Cores Architecture

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

L18: CUDA, cont. Memory Hierarchy and Examples

Lecture 5: GPU Compute Architecture for the last time

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

GPU Architectures and CUDA in More Detail

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Martin Kruliš Martin Kruliš1

Dfx Voodoo 1 ◦ First graphical (3D) accelerator for desktop PCs NVIDIA GeForce 256 ◦ First Transform&Lightning unit NVIDIA GeForce2, ATI Radeon GPU has programmable parts ◦ DirectX – vertex and fragment shaders (v1.0) OpenGL, DirectX 10, Windows Vista ◦ Unified shader architecture in HW ◦ Geometry shader added Martin Kruliš2

NVIDIA CUDA ◦ First GPGPU solution, restricted to NVIDIA GPUs AMD Stream SDK (previously CTM) OpenCL, Direct Compute ◦ Mac OS (Snow Leopard) first to implement OpenCL 2010 – OpenCL revision 1.1 ◦ Stable implementations from AMD and NVIDIA OpenCL revision NVIDIA Kepler Architecture 2013 – OpenCL revision Martin Kruliš3

 CPU ◦ Few cores per chip ◦ General purpose cores ◦ Processing different threads ◦ Huge caches to reduce memory latency  Locality of reference problem Martin Kruliš4  GPU ◦ Many cores per chip ◦ Cores specialized for numeric computations ◦ SIMT thread processing ◦ Huge amount of threads and fast context switch  Results in more complex memory transfers Architecture Convergence

 NVIDIA Fermi ◦ 16 SMP units ◦ 512 CUDA cores ◦ 786kB L2 cache Martin Kruliš5 Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU. Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU.

 Fermi Multiprocessor (SMP) ◦ 32 CUDA cores ◦ 64kB shared memory (or L1 cache) ◦ 1024 registers per core ◦ 16 load/store units ◦ 4 special function units ◦ 16 double precision ops per clock ◦ 1 instruction decoder  All cores are running in lockstep Martin Kruliš6

 Data Parallelism ◦ Many data elements are processed concurrently by the same routine ◦ GPUs are designed under this particular paradigm  Also have limited ways to express task parallelism  Threading Execution Model ◦ One function (kernel) is executed in many threads  Much more lightweight than the CPU threads ◦ Threads are grouped into blocks/work groups of the same size Martin Kruliš7

 Single Instruction Multiple Threads ◦ All cores are executing the same instruction ◦ Each core has its own set of registers Martin Kruliš8

 Single Instruction Multiple Threads ◦ Width-independent programming model ◦ Serial-like code ◦ Achieved by hardware with a little help from compiler ◦ Allows code divergence Martin Kruliš9  Single Instruction Multiple Data ◦ Explicitly expose the width of SIMD vector ◦ Special instructions ◦ Generated by compiler or directly written by programmer ◦ Code divergence is usually not supported

 How are threads assigned to SMPs Martin Kruliš10 GPU Symmetric Multiprocessor Core GridBlockWarpThread Assigned to SMP Simultaneously run on SM cores The same kernel

 Masking Instructions ◦ In case of data-driven branches  if-else conditions, while loops, … ◦ All branches are traversed, threads mask their execution in invalid branches if (threadId % 2 == 0) {... even threads code... } else {... odd threads code... } Martin Kruliš … …

Martin Kruliš12 Global Memory L2 Cache L1 Cache Registers Core … L1 Cache Registers Core … … Host CPU Host Memory SMP GPU Chip GPU Device PCI Express (16/32 GBps) PCI Express (16/32 GBps) ~ 25 GBps Note that details about host memory interconnection are platform specific > 100 GBps

 PCIe Transfers ◦ Much slower than internal GPU data transfers ◦ Issued explicitly by host code  As a bulk transfer operation ◦ Or issued explicitly by the HW/operating system  When GPU memory is mapped to host memory space ◦ The transfers have significant overhead  Overlapping - up to 2 asynchronous transfers may run whilst the GPU is computing ◦ Data must be binary safe (no pointers)  CUDA introduces unified memory addressing Martin Kruliš13

 Global Memory Properties ◦ Off-chip, but on the GPU device ◦ High bandwidth and high latency  ~ 100 GBps, of clock cycles ◦ Operated in transactions  Continuous aligned segments of 32 B B  Number of transactions caused by global memory access depends on the pattern of the access  Certain access patterns are optimized ◦ Data are cached in L2  On Fermi also cached in L1 cache  Kepler/Maxwell uses L1 for specific purposes Martin Kruliš14

 Memory Shared by the SMP ◦ Divided into banks  Each bank can be accessed independently  Consecutive 32-bit words are in consecutive banks  Optionally, 64-bit words division is used (Kepler) ◦ Bank conflicts are serialized  Except for reading the same address (broadcast) Martin Kruliš15 ArchitectureMem. size# of bankslatency Tesla16 kB1632bits / 2 cycles Fermi48 kB3232 bits / 2 cycles Kepler48 kB3264 bits / 1 cycle

 Linear Addressing ◦ Each thread in warp access different memory bank ◦ No collisions Martin Kruliš16

 Linear Addressing with Stride ◦ Each thread access 2*i -th item ◦ 2-way conflicts (2x slowdown) on Fermi and older ◦ No collisions on Kepler and newer  Due to 64-bits per cycle throughput Martin Kruliš17

 Linear Addressing with Stride ◦ Each thread access 3*i -th item ◦ No collisions, since the number of banks is not divisible by the stride Martin Kruliš18

 Registers ◦ One register pool per multiprocessor  8-64k of 32-bit registers (depending on architecture)  Registers assigned to threads by compiler ◦ As fast as the cores (no extra clock cycles) ◦ Read-after-write dependency  24 clock cycles  Can be hidden if there are enough active warps ◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible  The programmer have no direct control over them Martin Kruliš19

 Fast Context Switch ◦ When a warp gets stalled  E.g., by data load/store ◦ Scheduler switches to the next active warp Martin Kruliš20

 Kepler’s Major Improvements ◦ Streaming Processors Next Generation (SMX)  192 cores, 32 SFUs, 32 load/store units  3 cores share a DP unit, 6 cores share LD and SFU ◦ Dynamic Parallelism  Kernel may spawn child kernels (to depth of 24)  Implies the work group context-switch capability ◦ Hyper-Q  Up to 32 simultaneous GPU-host connections  Better throughput if multiple processes/threads use the GPU (concurrent connections are managed in HW) Martin Kruliš21

Martin Kruliš22

 Maxwell’s Major Improvements ◦ Maxwell Symmetric Multiprocessor (SMM)  Many internal optimizations, better power efficiency  Improved scheduling, increased occupancy  Reduced arithmetic instruction latency ◦ Larger L2 cache (2MB) ◦ Dedicated shared memory (separate L1 cache) ◦ Native shared memory atomics ◦ Better support for dynamic parallelism Martin Kruliš23

Martin Kruliš24

 AMD’s Graphic Core Next (GCN) ◦ Abandoning the VLIW4 architecture  1VLIW x 4 ALU ops => 4 SIMD x 1 ALU op ◦ 32 compute units (Radeon HD7970) ◦ 4 SIMD units per CU (each processing 16 elements) ◦ 10 planned wavefronts per SIMD unit ◦ Emphasis on vector processing (instructions, registers, memory, …) ◦ OpenCL 1.2, DirectCompute 11.1 and C++ AMP compatibility Martin Kruliš25

Martin Kruliš26

 Data Parallelism ◦ SIMT execution model ◦ Load balancing needs to be carefully considered  Host-device Transfers ◦ Data needs to be transferred to the GPU and back ◦ Computations should overlap with data transfers  There should be sufficient amount of computations  Memory Architecture ◦ Various types of memories and caches ◦ Coalesced loads/stores from/to global memory ◦ Banking properties of the shared memory Martin Kruliš27

Martin Kruliš28

Martin Kruliš Martin Kruliš29

 Universal Framework for Parallel Computations ◦ Specification created by Khronos group ◦ Multiple implementations exist (AMD, NVIDIA, Mac, …)  API for Different Parallel Architectures ◦ Multicore CPUs, Manycore GPUs, IBM Cell cards, Xeon Phi, … ◦ Host runtime handles device detection, data transfers, and code execution  Extended Version of C99 for Programming Devices ◦ The code is compiled at runtime for selected device ◦ Theoretically, we may chose best device for our application dynamically  However, we have to consider HW-specific optimizations… Martin Kruliš30

 Hardware Model ◦ Device (CPU die or GPU card) ◦ Compute unit (CPU core or GPU SMP) ◦ Processing element (slot in SSE registers or GPU core) Martin Kruliš31

 Logical Layers ◦ Platform  An implementation of OCL ◦ Context  Groups devices of selected kind  Buffers, programs, and other objects live in a context ◦ Device ◦ Command Queue  Created for a device  Controls data transfers, kernel execution and synchronization Martin Kruliš32 Intel Core i7 (4 cores with HT) ATI Radeon 5870 (320 cores)

std::vector platforms; cl_int err = cl::Platform::get(&platforms); if (err != CL_SUCCESS) return 1; cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, cps, NULL, NULL, &err); std::vector devices = context.getInfo (); cl::Buffer buf(context, CL_MEM_READ_ONLY, sizeof(cl_float)*n); cl::Program program(context, cl::Program::Sources(1, std::make_pair(source.c_str(), source.length())) ); err = program.build(devices); cl::Kernel kernel(program, "function_name", &err); err = kernel.setArg(0, buf); Martin Kruliš33 List all platforms Context of all GPUs on the 1 st platform List all GPUs Create and compile the program from string source Mark function as a kernel Allocate memory on the GPU

cl::CommandQueue cmdQueue(context, devices[0], 0, &err); cmdQueue.enqueueWriteBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(n), cl::NDRange(grp), NULL, NULL); cmdQueue.finish(); cmdQueue.enqueueReadBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); Martin Kruliš34 GPU (in-order) command queue Copy input data to the GPU buffer Execute the kernel and wait for it to finish (not necessary for the following readBuffer operation since the queue is in-order). Copy the results back from the GPU

 A Kernel ◦ Written in OpenCL C (extended version of C99) ◦ Compiled at runtime for destination platform  With (possibly) high degree of optimization  Kernel Execution ◦ Task Parallelism  Multiple kernels are enlisted in command queue and executed concurrently ◦ Data Parallelism  Multiple instances (threads) are created from single kernel, each operate on distinct data Martin Kruliš35

 Data Parallelism ◦ Each kernel instance has its own ID  A 1-3 dimensional vector of numbers from 0 to N-1 ◦ ID identifies the portion of data to be processed ◦ Threads form groups (blocks)  Threads within one group are executed on one SMP  Groups have IDs as well Martin Kruliš36

 Types of Memory ◦ private – memory that belongs to one thread (registers) ◦ local – memory shared by workgroup (shared memory) ◦ global – memory of the device ◦ constant – read-only version of global memory  More easily cached Martin Kruliš37

Martin Kruliš38

 Functions ◦ Other functions (beside the kernel) can be defined in the program (kernel is just an entry point)  Some devices may impose limitations on call stack ◦ It is possible to call std. functions like printf()  However, they do not work on GPUs ◦ There are many built-in functions  Thread-related functions (e.g. get_global_id() )  Mathematical and geometrical functions  Originally designed for graphics  Some of them are translated into single instruction  Functions for asynchronous memory transfers Martin Kruliš39

 Restrictions And Optimization Issues ◦ Branching problem (if-else)  Workgroup runs in SIMT, thus all branches are followed  Use conditional assignment rather than branches ◦ For-cycles  The compiler attempts to unroll them automatically ◦ While-cycles  The same problem as branching ◦ Vector operations  Translated into single instruction if possible (e.g., SSE)  The compiler attempts to generate them automatically  Different efficiency on different architectures Martin Kruliš40

 Global ◦ Explicit barriers added to the command queue ◦ Events and event dependencies  Within a Kernel ◦ Local barriers (for a work group) ◦ Memory fences ◦ Atomic operations  Common operations (add, sub, xchg, cmpxhg, …)  Extended – min, max, and, or, xor Martin Kruliš41

__kernel void mul_matrix (__global const float *A, __global const float *B, __global float *C) { int width = get_global_size(0); int x = get_global_id(0); int y = get_global_id(1); float sum = 0; for (int i = 0; i < width; ++i) sum += A[y*width + i] * B[i*width + x]; C[y*width + x] = sum; } Martin Kruliš42 Example

 Optimized Solution ◦ Workgroup computes block of 16x16 results ◦ In each step, appropriate blocks of 16x16 numbers are loaded into local memory and intermediate results are updated Martin Kruliš43 16 Exact numbers may vary a little Example

 Open CL 2.0 (July 2013) ◦ First implementations from Intel and AMD ◦ Shared virtual memory  For sharing complex data structures with pointers ◦ Dynamic parallelism  Device kernels can enqueue other kernels ◦ Generic address space  Functions pointer arguments need not declare their address space ◦ Subset of C11 atomics were added ◦ Pipes - FIFO data structures, which can be read and written by kernels Martin Kruliš44

Martin Kruliš45