Lecture 18 CUDA Program Implementation and Debugging

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

An Introduction to Programming with CUDA Paul Richmond

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.

CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:

GPU Programming with CUDA – Optimisation Mike Griffiths

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

QCAdesigner – CUDA HPPS project

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan UC Riverside

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CUDA programming Performance considerations (CUDA best practices)

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.

CUDA-MEMCHECK. Cuda-memcheck is a functional correctness checking suite included in the CUDA toolkit contains multiple tools that can perform different.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Single Instruction Multiple Threads

Lecture 10 CUDA Instructions

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

GPU Computing CIS-543 Lecture 03: Introduction to CUDA

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

GPU Computing CIS-543 Lecture 10: Streams and Events

CUDA Programming Model

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

NVIDIA Profiler’s Guide

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Synchronization, Matrix Transpose, Profiling, AWS Cluster

Operation System Program 4

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

CS/EE 217 – GPU Architecture and Parallel Programming

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

Makefiles, GDB, Valgrind

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Lecture 18 CUDA Program Implementation and Debugging Kyu Ho Park June 7, 2016 Lecture 18 CUDA Program Implementation and Debugging Ref: John Cheng, Max Grossman, Ty McKercher. Professional CUDA C Programming,WROX

CUDA Debugging CUDA debugging Kernel debugging Memory debugging To inspect the flow and state of kernel execution on the fly. CUDA debugging tools enable us to examine the state of any variable in ant thread and at any code location on the GPU. (2)Memory Debugging It focuses on the discovery of odd program behavior, such as invalid memory accesses, conflicting accesses to the same memory location.

Kernel Debugging Three techniques for kernel debugging (1)CUDA-gdb CUDA-gdd printf assert (1)CUDA-gdb $nvcc –g –G foo.cu –o foo $cuda-gdb foo

CUDA-gdb Debugging Commands: break print run continue next step quit

CUDA-gdb A CUDA program may contain multiple host threads and many CUDA threads, but cuda-gdb debugging sessions only focus on a single thread at a time. We can use cuda-gbd to report information about the current focus including the current device, current block, current thread. (cuda-gdb) cuda thread lane warp block sm grid device kernel kernel1 1026,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0 (cuda-gdb) cuda thread (128) (cuda-gdb) help cuda

Kernel debug 1

Kernel debug 2

Kernel debug 3

kernel Debug 4

cuda printf printf : print the state of the host. But starting with CUDA 4.0, NVIDIA added printf support on the device. __global__ void kernel() { int tid=blockIdx.x*blockDim.x + threadIdx.x; printf(“Hello from CUDA thread %d\n”, tid); }

printf

CUDA assert

Memory Debugging $cuda-memcheck [memcheck_options] app [app_options] Cuda-memcheck includes two separate utilities: (1)The memcheck tool: To check for out-of-bounds and misaligned accesses in CUDA kernels. (2)The racecheck tool: To detect conflicting accesses to shared memory. These tools can be useful for debugging erratic kernel behavior caused by threads reading and writing unexpected locations.

memcheck $nvcc –lineinfo –Xcompiler –rdynamic –o debug-segfault debug-segfault.cu $cuda-memcheck ./debug-segfault It checks: Memory Access Error Hardware Exception malloc/free erroe CUDA API Errors cudaMalloc Memory Leaks Device Heap Memory Leaks

racecheck $cuda-memcheck –tool racecheck –save racecheck.dump ./debug-hazards >log

racecheck

CUDA Code Compilation sample.cu Frontend Host code Device code Device Compiler nvcc Fatbinary Host Compiler C/C++ sample.o

Compiling CUDA functions CUDA provides the two methods: (1)Whole program compilation (2)Separate compilation From CUDA 5.0, separate compilation for device code was introduced.

Separate Compilation a.cu b.cu c.cpp ……… …… …… a.o b.o c.o dlink.o Frontend Frontend ……… …… …… a.o Device Linker b.o c.o dlink.o $nvcc –arch=sm_20 –dc a.cu b.cu /*-dc option passed to nvcc instructs the compiler to compile each input file into an object file that contains re-locatable device code.*/ $nvcc –arch=sm_20 –dlink a.o b.o –o link.o $g++ -c c.pp –o c.o $g++ c.o link.o –o test –L<path> -lcudart Host Linker Executable

Profile Driven Optimization Iterative approach: 1.Apply profiler to an application to gather information 2.Identify application hotspots 3.Dertermine performance inhibitors 4.Optimize the code 5.Repeat the previous steps until your desirable result is achieved. Performance Inhibitors for a kernel: 1.Memory bandwidth 2.Instruction throughtput 3.Latency

Optimization using nvprof Command: nvprof [nvprof-options] <application> [application-arguments] nvprof modes: 1.Summary mode :Default mode 2.Trace mode :nvprof-options={--print-gpu-trace, --print—api-trace} 3. Event/Metric summary mode :nvprof-options={--events <event names>,--metrics <metric names> 4. Event/Metric trace mode :--aggregate-mode off [events|metrics]

nvprof 5.To query all built-in events and metrics : options={ --query-events, --query-metrics } $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram

Global Memory Access Pattern $nvprof –devcees 0 –metrics gld_efficiency –metrics gst_efficiency ./sampleProgram The accesses of global memory should be aligned and coalesced for the optimal execution. -gld_efficiency: the ratio of the requested global memory load throughput to the required global memory load throughput. -gst_efficiency: for global memory stores.

CUDA C Development Process APOD Assessment Parallelization Optimization Deployment

Performance Optimization Paulius Micikevicius,”Performance Optimization”,NVIDIA, 2011. sc11-perf-optimization.pdf

Future of GPUs (1) John Ashley,”GPUs and the Future of Accelerated Computing”,Emerging Technologu Conference 2014, U. of Manchester NVIDIA_ManchesterEMiT.pdf

Future of GPUs (2) Timothy Lanfear,”GPU Computing: Past, Present,Future”,NVIDIA TimLanfear.pdf