INF5063 – GPU & CUDA Håkon Kvale Stensland Department for Informatics / Simula Research Laboratory.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Is There a Real Difference between DSPs and GPUs?
Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory September.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Håkon Kvale Stensland Simula Research Laboratory
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
INF5062 – GPU & CUDA Håkon Kvale Stensland Simula Research Laboratory.
INF5063 – GPU & CUDA Håkon Kvale Stensland Simula Research Laboratory.
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Computer Graphics Graphics Hardware
Extracted directly from:
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
GPU Architecture and Its Application
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Computer Graphics Graphics Hardware
Introduction to Heterogeneous Parallel Computing
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

INF5063 – GPU & CUDA Håkon Kvale Stensland Department for Informatics / Simula Research Laboratory

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo GPU – Graphics Processing Units

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Basic 3D Graphics Pipeline Application Scene Management Geometry Rasterization Pixel Processing ROP/FBI/Display Frame Buffer Memory Host GPU

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo PC Graphics Timeline  Challenges: −Render infinitely complex scenes −And extremely high resolution −In 1/60 th of one second (60 frames per second)  Graphics hardware has evolved from a simple hardwired pipeline to a highly programmable multiword processor DirectX 6 Multitexturing Riva TNT DirectX 8 SM 1.x GeForce 3Cg DirectX 9 SM 2.0 GeForceFX DirectX 9.0c SM 3.0 GeForce 6 DirectX 5 Riva 128 DirectX 7 T&L TextureStageState GeForce GeForce 7GeForce 8 SM 3.0SM 4.0 DirectX 9.0cDirectX 10

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Graphics in the PC Architecture  DMI (Direct Media Interface) between processor and chipset −Memory Control now integrated in CPU  The old “Northbridge” integrated onto CPU −Intel calls this part of the CPU “System Agent” −PCI Express 3.0 x16 bandwidth at 32 GB/s (16 GB in each direction)  “Southbridge” (X99) handles all other peripherals  All mainstream CPUs now come with integrated GPU −Same capabilities as discrete GPU’s −Less performance (limited by die space and power) Intel Haswell

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo GPUs not always for Graphics  GPUs are now common in HPC  Second largest supercomputer in November 2013 is the Titan at Oak Ridge National Laboratory − core Opteron processors −16688 Nvidia Kepler GPU’s −Theoretical: 27 petaflops  Before: Dedicated compute card released after graphics model  Now: Nvidia’s high-end Kepler GPU (GK110) released first as a compute product GeForce GTX Titan Black Tesla K40

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo High-end Compute Hardware  nVIDIA Kepler Architecture  The latest generation GPU, codenamed GK110  7,1 billion transistors  2688 Processing cores (SP) −IEEE Capable −Shared coherent L2 cache −Full C++ Support −Up to 32 concurrent kernels −6 GB memory −Supports GPU virtualization Tesla K40

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo High-end Graphics Hardware  nVIDIA Maxwell Architecture  The latest generation GPU, codenamed GM204  5,2 billion transistors  2048 Processing cores (SP) −IEEE Capable −Shared coherent L2 cache −Full C++ Support −Up to 32 concurrent kernels −4 GB memory −Supports GPU virtualization GeForce GTX 980

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo GeForce GM204 Architecture

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Lab Hardware  nVIDIA GeForce GTX 750 −All machines have the same GPU −Maxwell Architecture  Based on the GM107 chip −1870 million transistors −512 Processing cores (SP) at 1022 MHz (4 SMM) −1024 MB Memory with 80 GB/sec bandwidth −1044 GFLOPS theoretical performance −Compute version 5.0

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CPU and GPU Design Philosophy GPU Throughput Oriented Cores Chip Compute Unit Cache/Local Mem Registers SIMD Unit Threading CPU Latency Oriented Cores Chip Core Local Cache Registers SIMD Unit Control

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CPUs: Latency Oriented Design  Large caches −Convert long latency memory accesses to short latency cache accesses  Sophisticated control −Branch prediction for reduced branch latency −Data forwarding for reduced data latency  Powerful ALU −Reduced operation latency Cache ALU Control ALU DRAM CPU

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo GPUs: Throughput Oriented Design  Small caches −To boost memory throughput  Simple control −No branch prediction −No data forwarding  Energy efficient ALUs −Many, long latency but heavily pipelined for high throughput  Require massive number of threads to tolerate latencies DRAM GPU

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Think both about CPU and GPU…  CPUs for sequential parts where latency matters −CPUs can be 10+X faster than GPUs for sequential code  GPUs for parallel parts where throughput wins −GPUs can be 10+X faster than CPUs for parallel code

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The Core: The basic processing block  The nVIDIA Approach: −Called Stream Processor and CUDA cores. Works on a single operation.  The AMD Approach: −VLIW5: The GPU work on up to five operations −VLIW4: The GPU work on up to four operations −GCN: 16-wide SIMD vector unit Graphics Core Next (GCN):

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The Core: The basic processing block  The (failed) Intel Approach: −512-bit SIMD units in x86 cores −Failed because of complex x86 cores and software ROP pipeline  The (new) Intel Approach: −Used in Sandy Bridge, Ivy Bridge, Haswell & Broadwell −128 SIMD-8 32-bit registers

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The nVIDIA GPU Architecture Evolving  Streaming Multiprocessor (SM) 2.0 on the Fermi Architecture (GF1xx)  32 CUDA Cores (Core)  4 Super Function Units (SFU)  Dual schedulers and dispatch units  1 to 1536 threads active  Local register (32k)  64 KB shared memory / Level 1 cache  2 operations per cycle  Streaming Multiprocessor (SM) 1.x on the Tesla Architecture  8 CUDA Cores (Core)  2 Super Function Units (SFU)  Dual schedulers and dispatch units  1 to 512 or 768 threads active  Local register (32k)  16 KB shared memory  2 operations per cycle

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The nVIDIA GPU Architecture Evolving  Streaming Multiprocessor (SMX) 3.5 on the Kepler Architecture (Compute)  192 CUDA Cores (Core)  64 DP CUDA Cores (DP Core)  32 Super Function Units (SFU)  Four (simple) schedulers and eight dispatch units  1 to 2048 threads active  Local register (64k)  64 KB shared memory / Level 1 cache  48 KB Texture Cache  1 operation per cycle  Streaming Multiprocessor (SMX) 3.0 on the Kepler Architecture (Graphics)  192 CUDA Cores (CC)  8 DP CUDA Cores (DP Core)  32 Super Function Units (SFU)  Four (simple) schedulers and eight dispatch units  1 to 2048 threads active  Local register (32k)  64 KB shared memory / Level 1 cahce  1 operation per cycle

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Streaming Multiprocessor (SMM) – 4.0  Streaming Multiprocessor (SMM) on Maxwell (Graphics)  128 CUDA Cores (Core)  4 DP CUDA Cores (DP Core)  32 Super Function Units (SFU)  Four schedulers and eight dispatch units  1 to 2048 active threads  Software controlled scheduling  Local register (64k)  64 KB shared memory  24 KB Level 1 / Texture Cache  1 operation per cycle  GM107 / GM204

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Execution Pipes in SMM  Scalar MAD pipe −Float Multiply, Add, etc. −Integer operations −Conversions −Only one instruction per clock  Scalar SFU pipe −Special functions like Sin, Cos, Log, etc. Only one operation per four clocks  Load/Store pipe −CUDA has both global and local memory access through Load/Store −Access to shared memory  Texture Units −Access to Texture memory with texture cache (per SMM) I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

GPGPU Foils adapted from nVIDIA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo What is really GPGPU?  Idea: Potential for very high performance at low cost Architecture well suited for certain kinds of parallel applications (data parallel) Demonstrations of X speedup over CPU  Early challenges: −Architectures very customized to graphics problems (e.g., vertex and fragment processors) −Programmed using graphics-specific programming models or libraries

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Previous GPGPU use, and limitations  Working with a Graphics API −Special cases with an API like Microsoft Direct3D or OpenGL  Addressing modes −Limited by texture size  Shader capabilities −Limited outputs of the available shader programs  Instruction sets −No integer or bit operations  Communication is limited −Between pixels Input Registers Fragment Program Output Registers Constants Texture Temp Registers per thread per Shader per Context FB Memory

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Heterogeneous computing is catching on… Financial Analysis Scientific Simulation Engineering Simulation Data Intensive Analytics Medical Imaging Digital Audio Processing Computer Vision Digital Video Processing Biomedical Informatics Electronic Design Automation Statistical Modeling Ray Tracing Rendering Interactive Physics Numerical Methods

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA CUDA  “Compute Unified Device Architecture”  General purpose programming model −User starts several batches of threads on a GPU −GPU is in this case a dedicated super-threaded, massively data parallel co-processor  Software Stack −Graphics driver, language compilers (Toolkit), and tools (SDK)  Graphics driver loads programs into GPU −All drivers from nVIDIA now support CUDA −Interface is designed for computing (no graphics ) −“Guaranteed” maximum download & readback speeds −Explicit GPU memory management

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Outline  The CUDA Programming Model −Basic concepts and data types  An example application: −The good old Motion JPEG implementation! −Profiling with CUDA −Make an example program!

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The CUDA Programming Model  The GPU is viewed as a compute device that: −Is a coprocessor to the CPU, referred to as the host −Has its own DRAM called device memory −Runs many threads in parallel  Data-parallel parts of an application are executed on the device as kernels, which run in parallel on many threads  Differences between GPU and CPU threads −GPU threads are extremely lightweight Very little creation overhead −GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA C – Execution Model  Integrated host + device C program −Serial or modestly parallel parts in host C code −Highly parallel parts in device SPMD kernel C code Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args);

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Thread Batching: Grids and Blocks  A kernel is executed as a grid of thread blocks −All threads share data memory space  A thread block is a batch of threads that can cooperate with each other by: −Synchronizing their execution Non synchronous execution is very bad for performance! −Efficiently sharing data through a low latency shared memory  Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo CUDA Device Memory Space Overview  Each thread can: −R/W per-thread registers −R/W per-thread local memory −R/W per-block shared memory −R/W per-grid global memory −Read only per-grid constant memory −Read only per-grid texture memory  The host can R/W global, constant, and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Global, Constant, and Texture Memories  Global memory: −Main means of communicating R/W Data between host and device −Contents visible to all threads  Texture and Constant Memories: −Constants initialized by host −Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Terminology Recap  device = GPU = Set of multiprocessors  Multiprocessor = Set of processors & shared memory  Kernel = Program running on the GPU  Grid = Array of thread blocks that execute a kernel  Thread block = Group of SIMD threads that execute a kernel and can communicate via shared memory MemoryLocationCachedAccessWho LocalOff-chipNoRead/writeOne thread SharedOn-chipN/A - residentRead/writeAll threads in a block GlobalOff-chipNoRead/writeAll threads + host ConstantOff-chipYesReadAll threads + host TextureOff-chipYesReadAll threads + host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Access Times  Register – Dedicated HW – Single cycle  Shared Memory – Dedicated HW – Single cycle  Local Memory – DRAM, no cache – “Slow”  Global Memory – DRAM, no cache – “Slow”  Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality  Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Terminology Recap  device = GPU = Set of multiprocessors  Multiprocessor = Set of processors & shared memory  Kernel = Program running on the GPU  Grid = Array of thread blocks that execute a kernel  Thread block = Group of SIMD threads that execute a kernel and can communicate via shared memory MemoryLocationCachedAccessWho LocalOff-chipNoRead/writeOne thread SharedOn-chipN/A - residentRead/writeAll threads in a block GlobalOff-chipNoRead/writeAll threads + host ConstantOff-chipYesReadAll threads + host TextureOff-chipYesReadAll threads + host

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Global Memory Bandwidth Ideal Reality

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Conflicting Data Accesses Cause Serialization and Delays  Massively parallel execution cannot afford serialization  Contentions in accessing critical data causes serialization

Some Information on the Toolkit

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Compilation  Any source file containing CUDA language extensions must be compiled with nvcc  nvcc is a compiler driver −Works by invoking all the necessary tools and compilers like cudacc, g++, etc.  nvcc can output: −Either C code That must then be compiled with the rest of the application using another tool −Or object code directly

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Linking & Profiling  Any executable with CUDA code requires two dynamic libraries: −The CUDA runtime library ( cudart ) −The CUDA core library ( cuda )  Several tools are available to optimize your application −nVIDIA CUDA Visual Profiler −nVIDIA Occupancy Calculator  NVIDIA Parallel Nsight for Visual Studio and Eclipse

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Before you start…  Four lines have to be added to your group users.bash_profile or.bashrc file PATH=$PATH:/usr/local/cuda-6.5/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda- 6.5/lib64:/lib export PATH export LD_LIBRARY_PATH  Code samples is installed with CUDA  Copy and build in your users home directory

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Some usefull resources NVIDIA CUDA Programming Guide NVIDIA CUDA C Best Practices Guide Tuning CUDA Applications for Maxwell Parallel Thread Execution ISA – Version NVIDIA CUDA Getting Started Guide for Linux

Example: Motion JPEG Encoding

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo 14 different MJPEG encoders on GPU Nvidia GeForce GPU Problems: Only used global memory To much synchronization between threads Host part of the code not optimized

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Profiling a Motion JPEG encoder on x86 A small selection of DCT algorithms: 2D-Plain: Standard forward 2D DCT 1D-Plain: Two consecutive 1D transformations with transpose in between and after 1D-AAN: Optimized version of 1D-Plain 2D-Matrix: 2D-Plain implemented with matrix multiplication Single threaded application profiled on a Intel Core i5 750

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Optimizing for GPU, use the memory correctly!! Several different types of memory on GPU: Global memory Constant memory Texture memory Shared memory First Commandment when using the GPUs: Select the correct memory space, AND use it correctly!

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo How about using a better algorithm?? Used CUDA Visual Profiler to isolate DCT performance 2D-Plain Optimized is optimized for GPU: Shared memory Coalesced memory access Loop unrolling Branch prevention Asynchronous transfers Second Commandment when using the GPUs: Choose an algorithm suited for the architecture!

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Effect of offloading VLC to the GPU VLC (Variable Length Coding) can also be offloaded: One thread per macro block CPU does bitstream merge Even though algorithm is not perfectly suited for the architecture, offloading effect is still important!