Lecture 18 CUDA Program Implementation and Debugging Kyu Ho Park June 7, 2016 Lecture 18 CUDA Program Implementation and Debugging Ref: John Cheng, Max Grossman, Ty McKercher. Professional CUDA C Programming,WROX
CUDA Debugging CUDA debugging Kernel debugging Memory debugging To inspect the flow and state of kernel execution on the fly. CUDA debugging tools enable us to examine the state of any variable in ant thread and at any code location on the GPU. (2)Memory Debugging It focuses on the discovery of odd program behavior, such as invalid memory accesses, conflicting accesses to the same memory location.
Kernel Debugging Three techniques for kernel debugging (1)CUDA-gdb CUDA-gdd printf assert (1)CUDA-gdb $nvcc –g –G foo.cu –o foo $cuda-gdb foo
CUDA-gdb Debugging Commands: break print run continue next step quit
CUDA-gdb A CUDA program may contain multiple host threads and many CUDA threads, but cuda-gdb debugging sessions only focus on a single thread at a time. We can use cuda-gbd to report information about the current focus including the current device, current block, current thread. (cuda-gdb) cuda thread lane warp block sm grid device kernel kernel1 1026,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0 (cuda-gdb) cuda thread (128) (cuda-gdb) help cuda
Kernel debug 1
Kernel debug 2
Kernel debug 3
kernel Debug 4
cuda printf printf : print the state of the host. But starting with CUDA 4.0, NVIDIA added printf support on the device. __global__ void kernel() { int tid=blockIdx.x*blockDim.x + threadIdx.x; printf(“Hello from CUDA thread %d\n”, tid); }
printf
CUDA assert
Memory Debugging $cuda-memcheck [memcheck_options] app [app_options] Cuda-memcheck includes two separate utilities: (1)The memcheck tool: To check for out-of-bounds and misaligned accesses in CUDA kernels. (2)The racecheck tool: To detect conflicting accesses to shared memory. These tools can be useful for debugging erratic kernel behavior caused by threads reading and writing unexpected locations.
memcheck $nvcc –lineinfo –Xcompiler –rdynamic –o debug-segfault debug-segfault.cu $cuda-memcheck ./debug-segfault It checks: Memory Access Error Hardware Exception malloc/free erroe CUDA API Errors cudaMalloc Memory Leaks Device Heap Memory Leaks
racecheck $cuda-memcheck –tool racecheck –save racecheck.dump ./debug-hazards >log
racecheck
CUDA Code Compilation sample.cu Frontend Host code Device code Device Compiler nvcc Fatbinary Host Compiler C/C++ sample.o
Compiling CUDA functions CUDA provides the two methods: (1)Whole program compilation (2)Separate compilation From CUDA 5.0, separate compilation for device code was introduced.
Separate Compilation a.cu b.cu c.cpp ……… …… …… a.o b.o c.o dlink.o Frontend Frontend ……… …… …… a.o Device Linker b.o c.o dlink.o $nvcc –arch=sm_20 –dc a.cu b.cu /*-dc option passed to nvcc instructs the compiler to compile each input file into an object file that contains re-locatable device code.*/ $nvcc –arch=sm_20 –dlink a.o b.o –o link.o $g++ -c c.pp –o c.o $g++ c.o link.o –o test –L<path> -lcudart Host Linker Executable
Profile Driven Optimization Iterative approach: 1.Apply profiler to an application to gather information 2.Identify application hotspots 3.Dertermine performance inhibitors 4.Optimize the code 5.Repeat the previous steps until your desirable result is achieved. Performance Inhibitors for a kernel: 1.Memory bandwidth 2.Instruction throughtput 3.Latency
Optimization using nvprof Command: nvprof [nvprof-options] <application> [application-arguments] nvprof modes: 1.Summary mode :Default mode 2.Trace mode :nvprof-options={--print-gpu-trace, --print—api-trace} 3. Event/Metric summary mode :nvprof-options={--events <event names>,--metrics <metric names> 4. Event/Metric trace mode :--aggregate-mode off [events|metrics]
nvprof 5.To query all built-in events and metrics : options={ --query-events, --query-metrics } $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram
Global Memory Access Pattern $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram The accesses of global memory should be aligned and coalesced for the optimal execution. -gld_efficiency: the ratio of the requested global memory load throughput to the required global memory load throughput. -gst_efficiency: for global memory stores.
CUDA C Development Process APOD Assessment Parallelization Optimization Deployment
Performance Optimization Paulius Micikevicius,”Performance Optimization”,NVIDIA, 2011. sc11-perf-optimization.pdf
Future of GPUs (1) John Ashley,”GPUs and the Future of Accelerated Computing”,Emerging Technologu Conference 2014, U. of Manchester NVIDIA_ManchesterEMiT.pdf
Future of GPUs (2) Timothy Lanfear,”GPU Computing: Past, Present,Future”,NVIDIA TimLanfear.pdf