GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
CUDA Programming. Floating Point Operations for the CPU and the GPU.
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
GPU Architecture and Programming
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Sunpyo Hong, Hyesoon Kim
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Single Instruction Multiple Threads
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
Lecture 5: GPU Compute Architecture
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
CUDA Parallelism Model
NVIDIA Fermi Architecture
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
CUDA Grids, Blocks, and Threads
General Purpose Graphics Processing Units (GPGPUs)
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Presentation transcript:

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh

Hardware Memory NVIDIA accelerated system: 2

GPU performance inhibitors Copying data to/from device Device under-utilisation/ GPU memory latency GPU memory bandwidth Code branching This lecture will address each of these –And advise how to maximise performance –Concentrating on NVIDIA, but many concepts will be transferable to e.g. AMD 3

Host – Device Data Copy CPU (host) and GPU (device) have separate memories. All data read/written on the device must be copied to/from the device (over PCIe bus). –This very expensive Must try to minimise copies –Keep data resident on device –May involve porting more routines to device, even if they are not computationally expensive –Might be quicker to calculate something from scratch on device instead of copying from host 4

Data copy optimisation example Port inexpensive routine to device and move data copies outside of loop Loop over timesteps inexpensive_routine_on_host(data_on_host) copy data from host to device expensive_routine_on_device(data_on_device) copy data from device to host End loop over timesteps Loop over timesteps inexpensive_routine_on_host(data_on_host) copy data from host to device expensive_routine_on_device(data_on_device) copy data from device to host End loop over timesteps copy data from host to device Loop over timesteps inexpensive_routine_on_device(data_on_device) expensive_routine_on_device(data_on_device) End loop over timesteps copy data from device to host copy data from host to device Loop over timesteps inexpensive_routine_on_device(data_on_device) expensive_routine_on_device(data_on_device) End loop over timesteps copy data from device to host 5

Exposing parallelism GPU performance relies on parallel use of many threads –Degree of parallelism much higher than a CPU Effort must be made to expose as much parallelism as possible within application –May involve rewriting/refactoring If significant sections of code remain serial, effectiveness of GPU acceleration will be limited (Amdahl’s law) 6

Occupancy and Memory Latency hiding Programmer decomposes loops in code to threads –Obviously, there must be at least as many total threads as cores, otherwise cores will be left idle. For best performance, actually want #threads >> #cores Accesses to GPU memory have several hundred cycles latency –When a thread stalls waiting for data, if another thread can switch in this latency can be hidden. NVIDIA GPUs have very fast thread switching, and support many concurrent threads 7

Exposing parallelism example Loop over i from 1 to 512 Loop over j from 1 to 512 independent iteration Loop over i from 1 to 512 Loop over j from 1 to 512 independent iteration Calc i from thread/block ID Loop over j from 1 to 512 independent iteration Calc i from thread/block ID Loop over j from 1 to 512 independent iteration Calc i & j from thread/block ID independent iteration Calc i & j from thread/block ID independent iteration Original code 1D decomposition2D decomposition 512 threads 262,144 threads ✖✔ 8

Memory coalescing GPUs have high peak memory bandwidth Maximum memory bandwidth is only achieved when data is accessed for multiple threads in a single transaction: memory coalescing To achieve this, ensure that consecutive threads access consecutive memory locations Otherwise, memory accesses are serialised, significantly degrading performance –Adapting code to allow coalescing can dramatically improve performance 9

Memory coalescing example consecutive threads are those with consecutive threadIdx.x or threadidx%x values Do consecutive threads access consecutive memory locations? index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index]; index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index]; Coalesced. Consecutive threadIdx values correspond to consecutive index values ✔ index = (blockidx%x-1)*blockdim%x + threadidx%x result(index) = 2*input(index) index = (blockidx%x-1)*blockdim%x + threadidx%x result(index) = 2*input(index) C: F: 10

Memory coalescing examples Do consecutive threads read consecutive memory locations? In C, outermost index runs fastest: j here i = blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; i = blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j = blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j]; j = blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j]; ✖ Not Coalesced. Consecutive threadIdx.x corresponds to consecutive i values Coalesced. Consecutive threadIdx.x corresponds to consecutive j values ✔ 11

Memory coalescing examples Do consecutive threads read consecutive memory locations? In Fortran, innermost index runs fastest: i here j = (blockIdx%x-1)*blockDim%x + threadIdx%x do i=1, 256 output(i,j) = 2*input(i,j) end do j = (blockIdx%x-1)*blockDim%x + threadIdx%x do i=1, 256 output(i,j) = 2*input(i,j) end do ✖ Not Coalesced. Consecutive threadIdx%x corresponds to consecutive j values Coalesced. Consecutive threadIdx%x corresponds to consecutive i values ✔ i = (blockIdx%x-1)*blockDim%x + threadIdx%x do j=1, 256 output(i,j) = 2*input(i,j) end do i = (blockIdx%x-1)*blockDim%x + threadIdx%x do j=1, 256 output(i,j) = 2*input(i,j) end do 12

Memory coalescing examples What about when using 2D or 3D CUDA decompositions? –Same procedure. X component of threadIdx is always that which increments with consecutive threads –E.g., for matrix addition, coalescing achieved as follows: int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j]; int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j]; i = (blockidx%x-1)*blockdim%x + threadidx%x j = (blockidx%y-1)*blockdim%y + threadidx%y c(i,j) = a(i,j) + b(i,j) i = (blockidx%x-1)*blockdim%x + threadidx%x j = (blockidx%y-1)*blockdim%y + threadidx%y c(i,j) = a(i,j) + b(i,j) C: F: 13

Code Branching On NVIDIA GPUs, there are less instruction scheduling units than cores Threads are scheduled in groups of 32, called a warp Threads within a warp must execute the same instruction in lock-step (on different data elements) The CUDA programming allows branching, but this results in all cores following all branches –With only the required results saved –This is obviously suboptimal Must avoid intra-warp branching wherever possible (especially in key computational sections) 14

Branching example E.g you want to split your threads into 2 groups: i = blockIdx.x*blockDim.x + threadIdx.x; if (i%2 == 0) … else … i = blockIdx.x*blockDim.x + threadIdx.x; if (i%2 == 0) … else … i = blockIdx.x*blockDim.x + threadIdx.x; if ((i/32)%2 == 0) … else … i = blockIdx.x*blockDim.x + threadIdx.x; if ((i/32)%2 == 0) … else … Threads within warp diverge Threads within warp follow same path ✖ ✔ 15

On-chip memory 16 Global Memory accesses can be avoided altogether if the fast on-chip memory can be utilised. –But relatively small Shared memory: shared between threads in a block – 64kB on each SM, in current devices –__shared__ qualifier (C), shared attribute (Fortran) –Same memory can be used by system as an automatic L1 cache –User can specify L1/shared configuration (e.g.48/16 split) Constant memory: 64kB read only memory –__constant__ qualifier (C), constant attribute (Fortran) and use cudaMemcpyToSymbol to copy from host (see programming guide)

CUDA Profiling Simply set COMPUTE_PROFILE environment variable to 1 Log file, e.g. cuda_profile_0.log created at runtime: timing information for kernels and data transfer Possible to output more metrics (cache misses etc) –See doc/Compute_Profiler.txt file in main CUDA installation # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M1060 # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff6e2e9ee8858 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ ] cputime=[ ] method=[ memcpyHtoD ] gputime=[ ] cputime=[ ] method=[ memcpyHtoD ] gputime=[ ] cputime=[ ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ ] cputime=[ ] occupancy=[ ]... # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M1060 # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff6e2e9ee8858 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ ] cputime=[ ] method=[ memcpyHtoD ] gputime=[ ] cputime=[ ] method=[ memcpyHtoD ] gputime=[ ] cputime=[ ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ ] cputime=[ ] occupancy=[ ]... 17

Conclusions GPU architecture offers higher Floating Point and memory bandwidth performance over leading CPUs There are a number of factors which can inhibit application performance on the GPU. –And a number of steps which can be taken to circumvent these inhibitors –Some of these may require significant development/tuning for real applications It is important to have a good understanding of the application, architecture and programming model. 18