Download presentation
Presentation is loading. Please wait.
Published byDustin Gray Modified over 8 years ago
1
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011
2
Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors
3
Agenda Atomic Functions Paged-Locked Host Memory Streams Graphics Interoperability
4
Atomic Functions __device__ unsigned int count = 0; //... ++count; What is the value of count if 8 threads execute ++count ?
5
Atomic Functions Read-modify-write atomic operation Guaranteed no interference from other threads No guarantee on order Shared or global memory Requires compute capability 1.1 (> G80) See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements
6
Atomic Functions __device__ unsigned int count = 0; //... // atomic ++count atomicInc(&count, 1); What is the value of count if 8 threads execute atomicInc below?
7
Atomic Functions How do you implement atomicInc ? __device__ int atomicAdd( int *address, int val);
8
Atomic Functions How do you implement atomicInc ? __device__ int atomicAdd( int *address, int val) { // Made up keyword: __lock (address) { *address += value; }
9
Atomic Functions How do you implement atomicInc without locking?
10
Atomic Functions How do you implement atomicInc without locking? What if you were given an atomic compare and swap? int atomicCAS(int *address, int compare, int val);
11
Atomic Functions atomicCAS pseudo implementation int atomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }
12
Atomic Functions atomicCAS pseudo implementation int atomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }
13
Atomic Functions atomicCAS pseudo implementation int atomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }
14
Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3);
15
Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 1 // *addr = 2
16
Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // *addr = 2
17
Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // *addr = 3
18
Atomic Functions Again, how do you implement atomicInc given atomicCAS ? __device__ int atomicAdd( int *address, int val);
19
Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; }
20
Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; } Read original value at *address.
21
Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; } If the value at *address didn’t change, increment it.
22
Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, assumed + val); } while (assumed != old); return old; } Otherwise, loop until atomicCAS succeeds. The value of *address after this function returns is not necessarily the original value of *address + val, why?
23
Atomic Functions Lots of atomics: // Arithmetic // Bitwise atomicAdd() atomicAnd() atomicSub() atomicOr() atomicExch() atomicXor() atomicMin() atomicMax() atomicInc() atomicDec() atomicCAS() See B.10 in the NVIDIA CUDA C Programming Guide
24
Atomic Functions How can threads from different blocks work together? Use atomics sparingly. Why?
25
Page-Locked Host Memory Page-locked Memory Host memory that is essentially removed from virtual memory Also called Pinned Memory
26
Page-Locked Host Memory Benefits Overlap kernel execution and data transfers See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements Time Data TransferKernel Execution Data Transfer Kernel Execution Normally: Paged-locked:
27
Page-Locked Host Memory Benefits Increased memory bandwidth for systems with a front-side bus Up to ~2x throughput Image from http://arstechnica.com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives.ars
28
Page-Locked Host Memory Benefits Writing-Combing Memory Page-locked memory is cacheable Allocate with cudaHostAllocWriteCombined to Avoid polluting L1 and L2 caches Avoid snooping transfers across PCIe Improve transfer performance up to 40% - in theory Reading from write-combing memory is slow! Only write to it from the host
29
Page-Locked Host Memory Benefits Paged-locked host memory can be mapped into the address space of the device on some systems What systems allow this? What does this eliminate? What applications does this enable? Call cudaGetDeviceProperties() and check canMapHostMemory
30
Page-Locked Host Memory Usage: cudaHostAlloc() / cudaMallocHost() cudaHostFree() cudaMemcpyAsync() See 3.2.5 in the NVIDIA CUDA C Programming Guide
31
Page-Locked Host Memory DEMO CUDA SDK Example: bandwidthTest
32
Page-Locked Host Memory What’s the catch? Page-locked memory is scarce Allocations will start failing before allocation of in pageable memory Reduces amount of physical memory available to the OS for paging Allocating too much will hurt overall system performance
33
Streams Stream: Sequence of commands that execute in order Streams may execute their commands out-of-order or concurrently with respect to other streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream B
34
Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?
35
Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?
36
Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?
37
Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 2 Command 1 Command 0 Command 2 Command 1 Is this a possible order?
38
Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?
39
Streams In CUDA, what commands go in a stream? Kernel launches Host device memory transfers
40
Streams Code Example 1. Create two streams 2. Each stream: 1. Copy page-locked memory to device 2. Launch kernel 3. Copy memory back to host 3. Destroy streams
41
Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size);
42
Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size); Create two streams
43
Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size); Allocate two buffers in page-locked memory
44
Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/*... */, cudaMemcpyHostToDevice, stream[i]); kernel >> (/*... */); cudaMemcpyAsync(/*... */, cudaMemcpyDeviceToHost, stream[i]); }
45
Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/*... */, cudaMemcpyHostToDevice, stream[i]); kernel >> (/*... */); cudaMemcpyAsync(/*... */, cudaMemcpyDeviceToHost, stream[i]); } Commands are assigned to, and executed by streams
46
Stream Example (Step 3 of 3) for (int i = 0; i < 2; ++i) { // Blocks until commands complete cudaStreamDestroy(stream[i]); }
47
Streams Assume compute capabilities: Overlap of data transfer and kernel execution Concurrent kernel execution Concurrent data transfer How can the streams overlap? See G.1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities
48
Streams Time Kernel execution Stream AStream B Host device memory Device to host memory Kernel execution Host device memory Device to host memory Can we have more overlap than this?
49
Streams Time Kernel execution Stream AStream B Host device memoryDevice to host memory Kernel execution Host device memoryDevice to host memory Can we have this?
50
Streams Implicit Synchronization An operation that requires a dependency check to see if a kernel finished executing: Blocks all kernel launches from any stream until the checked kernel is finished See 3.2.6.5.3 in the NVIDIA CUDA C Programming Guide for all limitations
51
Streams Time Kernel execution Stream A Stream B Host device memoryDevice to host memory Kernel execution Host device memoryDevice to host memory Can we have this? Dependent on kernel completion Blocked until kernel from Stream A completes
52
Streams Performance Advice Issue all independent commands before dependent ones Delay synchronization (implicit or explicit) as long as possible
53
Streams for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/*... */, stream[i]); kernel >>(); cudaMemcpyAsync(/*... */, stream[i]); } Rewrite this to allow concurrent kernel execution
54
Streams for (int i = 0; i < 2; ++i) // to device cudaMemcpyAsync(/*... */, stream[i]); for (int i = 0; i < 2; ++i) kernel >>(); for (int i = 0; i < 2; ++i) // to host cudaMemcpyAsync(/*... */, stream[i]);
55
Streams Explicit Synchronization cudaThreadSynchronize() Blocks until commands in all streams finish cudaStreamSynchronize() Blocks until commands in a stream finish See 3.2.6.5 in the NVIDIA CUDA C Programming Guide for more synchronization functions
56
Timing with Stream Events Events can be added to a stream to monitor the device’s progress An event is completed when all commands in the stream preceding it complete.
57
Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...)
58
Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Create two events. Each will record the time
59
Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Record events before and after each stream is assigned its work
60
Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Delay addition commands in stream until after the stop event
61
Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Compute elapsed time
62
Graphics Interoperability What applications use both CUDA and OpenGL/Direct3D? CUDA GL GL CUDA If CUDA and GL cannot share resources, what is the performance implication?
63
Graphics Interoperability Graphics Interop: Map GL resource into CUDA address space Buffers Textures Renderbuffers
64
Graphics Interoperability OpenGL Buffer Interop 1. Assign device with GL interop cudaGLSetGLDevice() 2. Register GL resource with CUDA cudaGraphicsGLRegisterBuffer() 3. Map it cudaGraphicsMapResources() 4. Get mapped pointer cudaGraphicsResourceGetMappedPointer()
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.