GPU Programming with CUDA – Optimisation Mike Griffiths

GPU Programming with CUDA – Optimisation Mike Griffiths GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/

Hardware Model Main Program Code ___________________________ DRAMGDRAM CPUGPU I/OI/O PCIe SM Shared Memory GPU Kernel Code ____________________ ____________________ ____________________

Data transfer to/from device memory Device under-utilisation GPU memory bandwidth Code Branching Performance Inhibitors

Data transfer to/from device memory Device under-utilisation and occupancy GPU memory bandwidth Code Branching Performance Inhibitors

CPU (host) and GPU (device) have separate dedicated memory All data read/written on the device must be copied via PCIe bus Very expensive operation Optimisation Technique: Minimise data copies Keep resident data on the device May have to move some computation to the GPU even if is not computationally expensive Might be quicker to re-calculate data on the device than copy it Data Transfer

Port inexpensive routine to the device Minimise transfers by moving copy out of the loop Data Transfer Example Loop over timesteps inexpensive_routine_on_host(data_on_host) copy data from host to device expensive_routine_on_device(data_on_device) copy data from device to host End loop over timesteps copy data from host to device Loop over timesteps inexpensive_routine_on_device(data_on_device) expensive_routine_on_device(data_on_device) End loop over timesteps copy data from device to host

GPU performance relies on the use of many threads Degree of parallelism must be much higher than on the CPU Ideally need many more threads than cores Effort must be made to expose as much parallelism as possible May require re-engineering your problem If significant sections of code are serial then GPU acceleration will be limited Amdahl’s Law Exposing Parallelism

Access to GPU memory has several hundred cycles of latency When a thread is waiting for data it is stalled GPUs have very fast context switching Stalled threads can be switched with active threads Switching hides memory latency if other threads are performing compute Requires many threads ideally performing large amounts of computation Optimisation Technique: Have lots of threads with high arithmetic intensity Defined as the ratio of arithmetic computation to memory accesses Memory Latency

Exposing parallelism example Loop over i from 1 to 512 Loop over j from 1 to 512 independent iteration Calc i from thread/block ID Loop over j from 1 to 512 independent iteration Calc i & j from thread/block ID independent iteration Original code 1D decomposition2D decomposition 512 threads 262,144 threads ✖ ✔

GPUs have high peak memory bandwidth Maximum bandwidth is achieved when data is accessed in large requests rather than many small requests Large requests must come from multiple threads Otherwise memory accesses are serialised degrading performance Memory coalescing: Consecutive threads accessing consecutive memory locations Optimisation technique: Coalesced memory accesses reduce the number of requests and achieve higher bandwidth Memory Coalescing

Consecutive threads are those with consecutive threadIdx.x values Question: Do consecutive threads access consecutive memory locations? Coalescing Example index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index];

Consecutive threads are those with consecutive threadIdx.x values Question: Do consecutive threads access consecutive memory locations? Yes: Consecutive index values leads to consecutive threadIdx values Coalescing Example index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index];

Question: Do consecutive threads access consecutive memory locations? Coalescing Example 2 i = blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j = blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j];

Question: Do consecutive threads access consecutive memory locations? No: Consecutive threadIdx.x corresponds to consecutive i values Yes: Consecutive threadIdx.x corresponds to consecutive j values Coalescing Example 2 i = blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j = blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j];

What about 2D or 3D decompositions Exactly the same Always threadIdx.x which should increment with consecutive threads E.g. Matrix addition Memory Coalescing in 2D int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j];

On NVIDIA GPUs there are less instructional scheduling units than cores Threads are scheduled in groups of 32 (a warp) Threads within a warp execute the same instruction in lock-step Single Instruction Multiple Data (SIMD) CUDA C Kernels are free to specify branches BUT all threads will have to follow all code paths within the warp Optimisation Technique: Avoid inter warp branching wherever possible Code Branching

You want to split your threads into two groups: Branching Example i = blockIdx.x*blockDim.x + threadIdx.x; if (i%2 == 0) … else … i = blockIdx.x*blockDim.x + threadIdx.x; if ((i/32)%2 == 0) … else …

Set COMPUTE_PROFILE environment variable to 1 Log file will be created at runtime E.g. cuda_profile_0.log Contains timing information for kernel and data transfer Possible to output more metrics (see doc/Compute_Profiler.txt) CUDA Profiling # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M1060 # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff6e2e9ee8858 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyHtoD ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyHtoD ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.00 0 ] occupancy=[ 0.250 ]...

GPUs offers higher Floating Point and memory bandwidth performance than CPUs A number of factors will inhibit execution performance A number of techniques can be applied to circumvent these Some techniques may require re-engineering your problem If you application cant be adapted GPU performance will not be good! It is important to have an understanding of the application, architecture and programming model Conclusions

GPU Programming with CUDA – Optimisation Mike Griffiths

Similar presentations

Presentation on theme: "GPU Programming with CUDA – Optimisation Mike Griffiths"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU Programming with CUDA – Optimisation Mike Griffiths

Similar presentations

Presentation on theme: "GPU Programming with CUDA – Optimisation Mike Griffiths"— Presentation transcript:

Similar presentations

About project

Feedback