Lecture 18 CUDA Program Implementation and Debugging

Lecture 18 CUDA Program Implementation and Debugging
Kyu Ho Park June 7, 2016 Lecture 18 CUDA Program Implementation and Debugging Ref: John Cheng, Max Grossman, Ty McKercher. Professional CUDA C Programming,WROX

CUDA Debugging CUDA debugging Kernel debugging Memory debugging
To inspect the flow and state of kernel execution on the fly. CUDA debugging tools enable us to examine the state of any variable in ant thread and at any code location on the GPU. (2)Memory Debugging It focuses on the discovery of odd program behavior, such as invalid memory accesses, conflicting accesses to the same memory location.

Kernel Debugging Three techniques for kernel debugging (1)CUDA-gdb
CUDA-gdd printf assert (1)CUDA-gdb $nvcc –g –G foo.cu –o foo $cuda-gdb foo

CUDA-gdb Debugging Commands: break print run continue next step quit

CUDA-gdb A CUDA program may contain multiple host threads and many CUDA threads, but cuda-gdb debugging sessions only focus on a single thread at a time. We can use cuda-gbd to report information about the current focus including the current device, current block, current thread. (cuda-gdb) cuda thread lane warp block sm grid device kernel kernel1 1026,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0 (cuda-gdb) cuda thread (128) (cuda-gdb) help cuda

Kernel debug 1

Kernel debug 2

Kernel debug 3

kernel Debug 4

cuda printf printf : print the state of the host.
But starting with CUDA 4.0, NVIDIA added printf support on the device. __global__ void kernel() { int tid=blockIdx.x*blockDim.x + threadIdx.x; printf(“Hello from CUDA thread %d\n”, tid); }

printf

CUDA assert

Memory Debugging $cuda-memcheck [memcheck_options] app [app_options]
Cuda-memcheck includes two separate utilities: (1)The memcheck tool: To check for out-of-bounds and misaligned accesses in CUDA kernels. (2)The racecheck tool: To detect conflicting accesses to shared memory. These tools can be useful for debugging erratic kernel behavior caused by threads reading and writing unexpected locations.

memcheck $nvcc –lineinfo –Xcompiler –rdynamic –o debug-segfault debug-segfault.cu $cuda-memcheck ./debug-segfault It checks: Memory Access Error Hardware Exception malloc/free erroe CUDA API Errors cudaMalloc Memory Leaks Device Heap Memory Leaks

racecheck $cuda-memcheck –tool racecheck –save racecheck.dump ./debug-hazards >log

racecheck

CUDA Code Compilation sample.cu Frontend Host code Device code
Device Compiler nvcc Fatbinary Host Compiler C/C++ sample.o

Compiling CUDA functions
CUDA provides the two methods: (1)Whole program compilation (2)Separate compilation From CUDA 5.0, separate compilation for device code was introduced.

Separate Compilation a.cu b.cu c.cpp ……… …… …… a.o b.o c.o dlink.o
Frontend Frontend ……… …… …… a.o Device Linker b.o c.o dlink.o $nvcc –arch=sm_20 –dc a.cu b.cu /*-dc option passed to nvcc instructs the compiler to compile each input file into an object file that contains re-locatable device code.*/ $nvcc –arch=sm_20 –dlink a.o b.o –o link.o $g++ -c c.pp –o c.o $g++ c.o link.o –o test –L<path> -lcudart Host Linker Executable

Profile Driven Optimization
Iterative approach: 1.Apply profiler to an application to gather information 2.Identify application hotspots 3.Dertermine performance inhibitors 4.Optimize the code 5.Repeat the previous steps until your desirable result is achieved. Performance Inhibitors for a kernel: 1.Memory bandwidth 2.Instruction throughtput 3.Latency

Optimization using nvprof
Command: nvprof [nvprof-options] <application> [application-arguments] nvprof modes: 1.Summary mode :Default mode 2.Trace mode :nvprof-options={--print-gpu-trace, --print—api-trace} 3. Event/Metric summary mode :nvprof-options={--events <event names>,--metrics <metric names> 4. Event/Metric trace mode :--aggregate-mode off [events|metrics]

nvprof 5.To query all built-in events and metrics : options={ --query-events, --query-metrics } $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram

Global Memory Access Pattern
$nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram The accesses of global memory should be aligned and coalesced for the optimal execution. -gld_efficiency: the ratio of the requested global memory load throughput to the required global memory load throughput. -gst_efficiency: for global memory stores.

CUDA C Development Process
APOD Assessment Parallelization Optimization Deployment

Performance Optimization
Paulius Micikevicius,”Performance Optimization”,NVIDIA, sc11-perf-optimization.pdf

Future of GPUs (1) John Ashley,”GPUs and the Future of Accelerated Computing”,Emerging Technologu Conference 2014, U. of Manchester NVIDIA_ManchesterEMiT.pdf

Future of GPUs (2) Timothy Lanfear,”GPU Computing: Past, Present,Future”,NVIDIA TimLanfear.pdf

Lecture 18 CUDA Program Implementation and Debugging

Similar presentations

Presentation on theme: "Lecture 18 CUDA Program Implementation and Debugging"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 18 CUDA Program Implementation and Debugging

Similar presentations

Presentation on theme: "Lecture 18 CUDA Program Implementation and Debugging"— Presentation transcript:

Similar presentations

About project

Feedback