Download presentation
Presentation is loading. Please wait.
Published byGiles Skinner Modified over 9 years ago
1
Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1
2
GPU ◦ “Independent” device ◦ Controlled by host ◦ Used for “offloading” Host Code ◦ Needs to be designed in a way that Utilizes GPU(s) efficiently Utilize CPU while GPU is working CPU and GPU do not wait for each other 4 11. 2014 by Martin Kruliš (v1.0)2
3
Bad Example cudaMemcpy(..., HostToDevice); Kernel1 >>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost);... cudaMemcpy(..., HostToDevice); Kernel2 >>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost);... 4 11. 2014 by Martin Kruliš (v1.0)3 CPUGPU Device is working
4
Overlapping CPU and GPU work ◦ Kernels Started asynchronously Can be waited for ( cudaDeviceSynchronize() ) A little more can be done with streams ◦ Memory transfers cudaMemcpy() is synchronous and blocking Alternatively cudaMemcpyAsync() starts the transfer and returns immediately Can be synchronized the same way as the kernel 4 11. 2014 by Martin Kruliš (v1.0)4
5
Using Asynchronous Transfers cudaMemcpyAsync(HostToDevice); Kernel1 >>(...); cudaMemcpyAsync(DeviceToHost);... do_something_on_cpu();... cudaDeviceSynchronize(); 4 11. 2014 by Martin Kruliš (v1.0)5 CPUGPU Workload balance becomes an issue
6
CPU Threads ◦ Multiple CPU threads may use the GPU GPU Overlapping Capabilities ◦ Multiple kernels may run simultaneously Since Fermi architecture cudaDeviceProp.concurrentKernels ◦ Kernel execution may overlap with data transfers Or even multiple data transfers cudaDeviceProp.asyncEngineCount 4 11. 2014 by Martin Kruliš (v1.0)6
7
Stream ◦ In-order GPU command queue (like in OpenCL) Asynchronous GPU operations are registered in queue Kernel execution Memory data transfers Commands in different streams may overlap Provide means for explicit and implicit synchronization ◦ Default stream (stream 0) Always present, does not have to be created Global synchronization capabilities 4 11. 2014 by Martin Kruliš (v1.0)7
8
Stream Creation cudaStream_t stream; cudaStreamCreate(&stream); Stream Usage cudaMemcpyAsync(dst, src, size, kind, stream); kernel >>(...); Stream Destruction cudaStreamDestroy(stream); 4 11. 2014 by Martin Kruliš (v1.0)8
9
Synchronization ◦ Explicit cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed cudaStreamQuery(stream) – a non-blocking test whether the stream has finished ◦ Implicit Operations in different streams cannot overlap if a special operation is issued between them Memory allocation A CUDA command to default stream Switch between L1/shared memory configuration 4 11. 2014 by Martin Kruliš (v1.0)9
10
Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel >>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]); } 4 11. 2014 by Martin Kruliš (v1.0)10 May have many implicit synchronizations, depending on CC and hardware overlapping capabilities.
11
Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]); for (int i = 0; i < 2; ++i) MyKernel >>(...); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…DeviceToHost, stream[i]); 4 11. 2014 by Martin Kruliš (v1.0)11 Much less opportunities for implicit synchronization
12
Callbacks ◦ Callbacks are registered in streams by cudaStreamAddCallback(stream, fnc, data, 0); ◦ The callback function is invoked asynchronously after all preceding commands terminate ◦ Callback registered to the default stream is invoked after previous commands in all streams terminate ◦ Operations issued after registration start after the callback returns ◦ The callback looks like void CUDART_CB MyCallback(stream, errorStatus, userData) {... 4 11. 2014 by Martin Kruliš (v1.0)12
13
Events ◦ Special markers that can be used for synchronization and performance monitoring ◦ The typical usage is Waiting for all commands before the marker finishes Explicit synchronization between selected streams Measuring time between two events ◦ Example cudaEvent_t event; cudaEventCreate(&event); cudaEventRecord(event, stream); cudaEventSynchronize(event); 4 11. 2014 by Martin Kruliš (v1.0)13
14
Making a Good Use of Overlapping ◦ Split the work into smaller fragments ◦ Create a pipeline effect (load, process, store) 4 11. 2014 by Martin Kruliš (v1.0)14
15
Data Gather and Scatter Problem 4 11. 2014 by Martin Kruliš (v1.0)15 Host Memory GPU Memory Host Memory Gather Scatter Kernel Execution Input Data Results Multiple cudaMemcpy() calls may be quite inefficient
16
Gather and Scatter ◦ Reducing overhead ◦ Performed by CPU before/after cudaMemcpy 4 11. 2014 by Martin Kruliš (v1.0)16 Main Thread Gather Scatter Kernel HtD copy DtH copy Gather Scatter Kernel HtD copy DtH copy Stream 0 Stream 1 … # of thread per GPU and # of streams per thread depends on the workload structure
17
Page-locked (Pinned) Host Memory ◦ Host memory that is prevented from swapping ◦ Created/dismissed by cudaHostAlloc(), cudaFreeHost() cudaHostRegister(), cudaHostUnregister() ◦ Optionally with flags cudaHostAllocWriteCombined cudaHostAllocMapped cudaHostAllocPortable ◦ Copies between pinned host memory and device are automatically performed asynchronously ◦ Pinned memory is a scarce resource 4 11. 2014 by Martin Kruliš (v1.0)17 Optimized for writing, not cached on CPU
18
Device Memory Mapping ◦ Allowing GPU to access portions of host memory directly (i.e., without explicit copy operations) For both reading and writing ◦ The memory must be allocated/registered with flag cudaHostAllocMapped ◦ The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags() ) ◦ Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer 4 11. 2014 by Martin Kruliš (v1.0)18
19
Asynchronous Errors ◦ An error may occur outside the a CUDA call In case of asynchronous memory transfers or kernel execution ◦ The error is reported by the following CUDA call ◦ To make sure all errors were reported, the device must synchronize ( cudaDeviceSynchronize() ) ◦ Error handling functions cudaGetLastError() cudaPeekAtLastError() cudaGetErrorString(error) 4 11. 2014 by Martin Kruliš (v1.0)19
20
4 11. 2014 by Martin Kruliš (v1.0)20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.