Multi-GPU Programming Martin Kruliš by Martin Kruliš (v1.1) 05.01.2017
Multi-GPU Systems Connecting Multiple GPUs to Host Workload division and management Sharing PCI-Express/host memory throughput Host architecture example (NUMA): GPU Memory CPU/Chipset PCIe QPI by Martin Kruliš (v1.1) 05.01.2017
Multi-GPU Systems Detection and Selection cudaGetDeviceCount(), cudaSetDevice() Each device may be manually queried for properties cudaGetDeviceProperties(), cudaDeviceGetAttribute() A stream may be created for each device Then we use streams to determine, which device is used Automatic selection of the optimal device cudaChooseDevice(&device, props) Selecting devices by their physical layout cudaDeviceGetByPCIBusId(&device, pciId) cudaDeviceGetPCIBusId() The devices that are visible to the application can be restricted by CUDA_VISIBLE_DEVICES environment variable. It may contain a list of integers that specify the devices visible to the application. The application always lists the devices as 0…N-1, where N is the number of visible devices. by Martin Kruliš (v1.1) 05.01.2017
Workload Division Task Management Similar on GPU and CPU E.g., each task must have sufficient size Static task scheduling Works only in special cases (e.g., all tasks have the same size and all GPUs are identical) Dynamic task scheduling Oversubscription – much more tasks than devices Tasks are dispatched to devices as they become available More complex for GPU, since the copy-work-copy pipeline is required to be maintained by Martin Kruliš (v1.1) 05.01.2017
Peer-to-Peer Transfers Copying memory between devices Special methods that copy memory directly between two devices cudaMemcpyPeer(dst, dstDev, src, srcDev, size) cudaMemcpyPeerAsync(…, stream) Synchronous version is asynchronous towards host, but synchronized with other async operations Works as a barrier on both devices Portable memory allocation Page-locked memory used on multiple GPUs The cudaHostAllocPortable flag must be used by Martin Kruliš (v1.1) 05.01.2017
Peer-to-Peer Memory Access Direct Inter-GPU Data Exchange Possibly without storing in host memory Since CC 2.0 (Tesla devices), 64-bit processes only cudaDeviceCanAccessPeer() cudaDeviceEnablePeerAccess() Unified Virtual Memory Space Both host and device buffers have one VS The unifiedAddressing device property must be 1 cudaPointerGetAttributes() Devices can directly use cudaHostAlloc() pointers by Martin Kruliš (v1.1) 05.01.2017
IPC Inter-Process Communication CUDA resources are restricted to the process Device buffer pointers, events, … Multiprocess sharing may be inevitable (e.g., when integrating multiple CUDA applications) IPC API allows sharing these resources cudaIpcGetMemHandle(), cudaIpcGetEventHandle() return cudaIpcEventHandle_t handle The handle can be transferred via IPC mechanisms cudaIpcOpenMemHandle(), cudaIpcOpenEventHandle() open the handle passed on from another process cudaIpcCloseMemHandle() by Martin Kruliš (v1.1) 05.01.2017
Discussion by Martin Kruliš (v1.1) 05.01.2017