Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 18 CUDA Program Implementation and Debugging

Similar presentations


Presentation on theme: "Lecture 18 CUDA Program Implementation and Debugging"— Presentation transcript:

1 Lecture 18 CUDA Program Implementation and Debugging
Kyu Ho Park June 7, 2016 Lecture 18 CUDA Program Implementation and Debugging Ref: John Cheng, Max Grossman, Ty McKercher. Professional CUDA C Programming,WROX

2 CUDA Debugging CUDA debugging Kernel debugging Memory debugging
To inspect the flow and state of kernel execution on the fly. CUDA debugging tools enable us to examine the state of any variable in ant thread and at any code location on the GPU. (2)Memory Debugging It focuses on the discovery of odd program behavior, such as invalid memory accesses, conflicting accesses to the same memory location.

3 Kernel Debugging Three techniques for kernel debugging (1)CUDA-gdb
CUDA-gdd printf assert (1)CUDA-gdb $nvcc –g –G foo.cu –o foo $cuda-gdb foo

4 CUDA-gdb Debugging Commands: break print run continue next step quit

5 CUDA-gdb A CUDA program may contain multiple host threads and many CUDA threads, but cuda-gdb debugging sessions only focus on a single thread at a time. We can use cuda-gbd to report information about the current focus including the current device, current block, current thread. (cuda-gdb) cuda thread lane warp block sm grid device kernel kernel1 1026,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0 (cuda-gdb) cuda thread (128) (cuda-gdb) help cuda

6 Kernel debug 1

7 Kernel debug 2

8 Kernel debug 3

9 kernel Debug 4

10 cuda printf printf : print the state of the host.
But starting with CUDA 4.0, NVIDIA added printf support on the device. __global__ void kernel() { int tid=blockIdx.x*blockDim.x + threadIdx.x; printf(“Hello from CUDA thread %d\n”, tid); }

11 printf

12 CUDA assert

13 Memory Debugging $cuda-memcheck [memcheck_options] app [app_options]
Cuda-memcheck includes two separate utilities: (1)The memcheck tool: To check for out-of-bounds and misaligned accesses in CUDA kernels. (2)The racecheck tool: To detect conflicting accesses to shared memory. These tools can be useful for debugging erratic kernel behavior caused by threads reading and writing unexpected locations.

14 memcheck $nvcc –lineinfo –Xcompiler –rdynamic –o debug-segfault debug-segfault.cu $cuda-memcheck ./debug-segfault It checks: Memory Access Error Hardware Exception malloc/free erroe CUDA API Errors cudaMalloc Memory Leaks Device Heap Memory Leaks

15 racecheck $cuda-memcheck –tool racecheck –save racecheck.dump ./debug-hazards >log

16 racecheck

17 CUDA Code Compilation sample.cu Frontend Host code Device code
Device Compiler nvcc Fatbinary Host Compiler C/C++ sample.o

18 Compiling CUDA functions
CUDA provides the two methods: (1)Whole program compilation (2)Separate compilation From CUDA 5.0, separate compilation for device code was introduced.

19 Separate Compilation a.cu b.cu c.cpp ……… …… …… a.o b.o c.o dlink.o
Frontend Frontend ……… …… …… a.o Device Linker b.o c.o dlink.o $nvcc –arch=sm_20 –dc a.cu b.cu /*-dc option passed to nvcc instructs the compiler to compile each input file into an object file that contains re-locatable device code.*/ $nvcc –arch=sm_20 –dlink a.o b.o –o link.o $g++ -c c.pp –o c.o $g++ c.o link.o –o test –L<path> -lcudart Host Linker Executable

20

21 Profile Driven Optimization
Iterative approach: 1.Apply profiler to an application to gather information 2.Identify application hotspots 3.Dertermine performance inhibitors 4.Optimize the code 5.Repeat the previous steps until your desirable result is achieved. Performance Inhibitors for a kernel: 1.Memory bandwidth 2.Instruction throughtput 3.Latency

22 Optimization using nvprof
Command: nvprof [nvprof-options] <application> [application-arguments] nvprof modes: 1.Summary mode :Default mode 2.Trace mode :nvprof-options={--print-gpu-trace, --print—api-trace} 3. Event/Metric summary mode :nvprof-options={--events <event names>,--metrics <metric names> 4. Event/Metric trace mode :--aggregate-mode off [events|metrics]

23 nvprof 5.To query all built-in events and metrics : options={ --query-events, --query-metrics } $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram

24 Global Memory Access Pattern
$nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram The accesses of global memory should be aligned and coalesced for the optimal execution. -gld_efficiency: the ratio of the requested global memory load throughput to the required global memory load throughput. -gst_efficiency: for global memory stores.

25 CUDA C Development Process
APOD Assessment Parallelization Optimization Deployment

26 Performance Optimization
Paulius Micikevicius,”Performance Optimization”,NVIDIA, sc11-perf-optimization.pdf

27 Future of GPUs (1) John Ashley,”GPUs and the Future of Accelerated Computing”,Emerging Technologu Conference 2014, U. of Manchester NVIDIA_ManchesterEMiT.pdf

28 Future of GPUs (2) Timothy Lanfear,”GPU Computing: Past, Present,Future”,NVIDIA TimLanfear.pdf


Download ppt "Lecture 18 CUDA Program Implementation and Debugging"

Similar presentations


Ads by Google