Download presentation
Presentation is loading. Please wait.
Published byBernadette Page Modified over 7 years ago
1
Lecture 18 CUDA Program Implementation and Debugging
Kyu Ho Park June 7, 2016 Lecture 18 CUDA Program Implementation and Debugging Ref: John Cheng, Max Grossman, Ty McKercher. Professional CUDA C Programming,WROX
2
CUDA Debugging CUDA debugging Kernel debugging Memory debugging
To inspect the flow and state of kernel execution on the fly. CUDA debugging tools enable us to examine the state of any variable in ant thread and at any code location on the GPU. (2)Memory Debugging It focuses on the discovery of odd program behavior, such as invalid memory accesses, conflicting accesses to the same memory location.
3
Kernel Debugging Three techniques for kernel debugging (1)CUDA-gdb
CUDA-gdd printf assert (1)CUDA-gdb $nvcc –g –G foo.cu –o foo $cuda-gdb foo
4
CUDA-gdb Debugging Commands: break print run continue next step quit
5
CUDA-gdb A CUDA program may contain multiple host threads and many CUDA threads, but cuda-gdb debugging sessions only focus on a single thread at a time. We can use cuda-gbd to report information about the current focus including the current device, current block, current thread. (cuda-gdb) cuda thread lane warp block sm grid device kernel kernel1 1026,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0 (cuda-gdb) cuda thread (128) (cuda-gdb) help cuda
6
Kernel debug 1
7
Kernel debug 2
8
Kernel debug 3
9
kernel Debug 4
10
cuda printf printf : print the state of the host.
But starting with CUDA 4.0, NVIDIA added printf support on the device. __global__ void kernel() { int tid=blockIdx.x*blockDim.x + threadIdx.x; printf(“Hello from CUDA thread %d\n”, tid); }
11
printf
12
CUDA assert
13
Memory Debugging $cuda-memcheck [memcheck_options] app [app_options]
Cuda-memcheck includes two separate utilities: (1)The memcheck tool: To check for out-of-bounds and misaligned accesses in CUDA kernels. (2)The racecheck tool: To detect conflicting accesses to shared memory. These tools can be useful for debugging erratic kernel behavior caused by threads reading and writing unexpected locations.
14
memcheck $nvcc –lineinfo –Xcompiler –rdynamic –o debug-segfault debug-segfault.cu $cuda-memcheck ./debug-segfault It checks: Memory Access Error Hardware Exception malloc/free erroe CUDA API Errors cudaMalloc Memory Leaks Device Heap Memory Leaks
15
racecheck $cuda-memcheck –tool racecheck –save racecheck.dump ./debug-hazards >log
16
racecheck
17
CUDA Code Compilation sample.cu Frontend Host code Device code
Device Compiler nvcc Fatbinary Host Compiler C/C++ sample.o
18
Compiling CUDA functions
CUDA provides the two methods: (1)Whole program compilation (2)Separate compilation From CUDA 5.0, separate compilation for device code was introduced.
19
Separate Compilation a.cu b.cu c.cpp ……… …… …… a.o b.o c.o dlink.o
Frontend Frontend ……… …… …… a.o Device Linker b.o c.o dlink.o $nvcc –arch=sm_20 –dc a.cu b.cu /*-dc option passed to nvcc instructs the compiler to compile each input file into an object file that contains re-locatable device code.*/ $nvcc –arch=sm_20 –dlink a.o b.o –o link.o $g++ -c c.pp –o c.o $g++ c.o link.o –o test –L<path> -lcudart Host Linker Executable
21
Profile Driven Optimization
Iterative approach: 1.Apply profiler to an application to gather information 2.Identify application hotspots 3.Dertermine performance inhibitors 4.Optimize the code 5.Repeat the previous steps until your desirable result is achieved. Performance Inhibitors for a kernel: 1.Memory bandwidth 2.Instruction throughtput 3.Latency
22
Optimization using nvprof
Command: nvprof [nvprof-options] <application> [application-arguments] nvprof modes: 1.Summary mode :Default mode 2.Trace mode :nvprof-options={--print-gpu-trace, --print—api-trace} 3. Event/Metric summary mode :nvprof-options={--events <event names>,--metrics <metric names> 4. Event/Metric trace mode :--aggregate-mode off [events|metrics]
23
nvprof 5.To query all built-in events and metrics : options={ --query-events, --query-metrics } $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram
24
Global Memory Access Pattern
$nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram The accesses of global memory should be aligned and coalesced for the optimal execution. -gld_efficiency: the ratio of the requested global memory load throughput to the required global memory load throughput. -gst_efficiency: for global memory stores.
25
CUDA C Development Process
APOD Assessment Parallelization Optimization Deployment
26
Performance Optimization
Paulius Micikevicius,”Performance Optimization”,NVIDIA, sc11-perf-optimization.pdf
27
Future of GPUs (1) John Ashley,”GPUs and the Future of Accelerated Computing”,Emerging Technologu Conference 2014, U. of Manchester NVIDIA_ManchesterEMiT.pdf
28
Future of GPUs (2) Timothy Lanfear,”GPU Computing: Past, Present,Future”,NVIDIA TimLanfear.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.