CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Multi-GPU and Stream Programming Kishan Wimalawarne.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

CS179: GPU Programming Lecture 8: More CUDA Runtime.

ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

CUDA More GA, Events, Atomics. GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation.

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

L7: Writing Correct Programs. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

Cuda Streams Presented by Savitha Parur Venkitachalam.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Introduction to CUDA 2 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2013.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CIS 565 Fall 2011 Qing Sun

CS6963 L17: Asynchronous Concurrent Execution, Open GL Rendering.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Parallel Event Driven Simulation using GPU (CUDA) M.Sancar Koyunlu & Ervin Domazet Cmpe 436 TERM PROJECT.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

L7: Writing Correct Programs. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Synchronization These notes introduce:

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

CUDA programming Performance considerations (CUDA best practices)

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.

© David Kirk/NVIDIA and Wen-mei W

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

GPU Computing CIS-543 Lecture 10: Streams and Events

Heterogeneous Programming

Lecture 5: GPU Compute Architecture

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

More on GPU Programming

Measuring Performance

CUDA Execution Model – III Streams and Events

L7: Writing Correct Programs

Measuring Performance

Synchronization These notes introduce:

Presentation transcript:

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011

Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors

Agenda Atomic Functions Paged-Locked Host Memory Streams Graphics Interoperability

Atomic Functions __device__ unsigned int count = 0; //... ++count; What is the value of count if 8 threads execute ++count ?

Atomic Functions Read-modify-write atomic operation  Guaranteed no interference from other threads  No guarantee on order Shared or global memory Requires compute capability 1.1 (> G80) See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

Atomic Functions __device__ unsigned int count = 0; //... // atomic ++count atomicInc(&count, 1); What is the value of count if 8 threads execute atomicInc below?

Atomic Functions How do you implement atomicInc ? __device__ int atomicAdd( int *address, int val);

Atomic Functions How do you implement atomicInc ? __device__ int atomicAdd( int *address, int val) { // Made up keyword: __lock (address) { *address += value; }

Atomic Functions How do you implement atomicInc without locking?

Atomic Functions How do you implement atomicInc without locking? What if you were given an atomic compare and swap? int atomicCAS(int *address, int compare, int val);

Atomic Functions atomicCAS pseudo implementation int atomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }

Atomic Functions atomicCAS pseudo implementation int atomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }

Atomic Functions atomicCAS pseudo implementation int atomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }

Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3);

Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 1 // *addr = 2

Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // *addr = 2

Atomic Functions Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // *addr = 3

Atomic Functions Again, how do you implement atomicInc given atomicCAS ? __device__ int atomicAdd( int *address, int val);

Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; }

Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; } Read original value at *address.

Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; } If the value at *address didn’t change, increment it.

Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, assumed + val); } while (assumed != old); return old; } Otherwise, loop until atomicCAS succeeds. The value of *address after this function returns is not necessarily the original value of *address + val, why?

Atomic Functions Lots of atomics: // Arithmetic // Bitwise atomicAdd() atomicAnd() atomicSub() atomicOr() atomicExch() atomicXor() atomicMin() atomicMax() atomicInc() atomicDec() atomicCAS() See B.10 in the NVIDIA CUDA C Programming Guide

Atomic Functions How can threads from different blocks work together? Use atomics sparingly. Why?

Page-Locked Host Memory Page-locked Memory  Host memory that is essentially removed from virtual memory  Also called Pinned Memory

Page-Locked Host Memory Benefits  Overlap kernel execution and data transfers See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements Time Data TransferKernel Execution Data Transfer Kernel Execution Normally: Paged-locked:

Page-Locked Host Memory Benefits  Increased memory bandwidth for systems with a front-side bus Up to ~2x throughput Image from

Page-Locked Host Memory Benefits  Writing-Combing Memory Page-locked memory is cacheable Allocate with cudaHostAllocWriteCombined to  Avoid polluting L1 and L2 caches  Avoid snooping transfers across PCIe  Improve transfer performance up to 40% - in theory Reading from write-combing memory is slow!  Only write to it from the host

Page-Locked Host Memory Benefits  Paged-locked host memory can be mapped into the address space of the device on some systems What systems allow this? What does this eliminate? What applications does this enable?  Call cudaGetDeviceProperties() and check canMapHostMemory

Page-Locked Host Memory Usage: cudaHostAlloc() / cudaMallocHost() cudaHostFree() cudaMemcpyAsync() See in the NVIDIA CUDA C Programming Guide

Page-Locked Host Memory DEMO CUDA SDK Example: bandwidthTest

Page-Locked Host Memory What’s the catch?  Page-locked memory is scarce Allocations will start failing before allocation of in pageable memory  Reduces amount of physical memory available to the OS for paging Allocating too much will hurt overall system performance

Streams Stream: Sequence of commands that execute in order Streams may execute their commands out-of-order or concurrently with respect to other streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream B

Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?

Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?

Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?

Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 2 Command 1 Command 0 Command 2 Command 1 Is this a possible order?

Streams Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Stream AStream BTime Command 0 Command 1 Command 2 Command 0 Command 1 Command 2 Is this a possible order?

Streams In CUDA, what commands go in a stream?  Kernel launches  Host device memory transfers

Streams Code Example 1. Create two streams 2. Each stream: 1. Copy page-locked memory to device 2. Launch kernel 3. Copy memory back to host 3. Destroy streams

Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size);

Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size); Create two streams

Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size); Allocate two buffers in page-locked memory

Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/*... */, cudaMemcpyHostToDevice, stream[i]); kernel >> (/*... */); cudaMemcpyAsync(/*... */, cudaMemcpyDeviceToHost, stream[i]); }

Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/*... */, cudaMemcpyHostToDevice, stream[i]); kernel >> (/*... */); cudaMemcpyAsync(/*... */, cudaMemcpyDeviceToHost, stream[i]); } Commands are assigned to, and executed by streams

Stream Example (Step 3 of 3) for (int i = 0; i < 2; ++i) { // Blocks until commands complete cudaStreamDestroy(stream[i]); }

Streams Assume compute capabilities:  Overlap of data transfer and kernel execution  Concurrent kernel execution  Concurrent data transfer How can the streams overlap? See G.1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities

Streams Time Kernel execution Stream AStream B Host device memory Device to host memory Kernel execution Host device memory Device to host memory Can we have more overlap than this?

Streams Time Kernel execution Stream AStream B Host device memoryDevice to host memory Kernel execution Host device memoryDevice to host memory Can we have this?

Streams Implicit Synchronization  An operation that requires a dependency check to see if a kernel finished executing: Blocks all kernel launches from any stream until the checked kernel is finished See in the NVIDIA CUDA C Programming Guide for all limitations

Streams Time Kernel execution Stream A Stream B Host device memoryDevice to host memory Kernel execution Host device memoryDevice to host memory Can we have this? Dependent on kernel completion Blocked until kernel from Stream A completes

Streams Performance Advice  Issue all independent commands before dependent ones  Delay synchronization (implicit or explicit) as long as possible

Streams for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/*... */, stream[i]); kernel >>(); cudaMemcpyAsync(/*... */, stream[i]); } Rewrite this to allow concurrent kernel execution

Streams for (int i = 0; i < 2; ++i) // to device cudaMemcpyAsync(/*... */, stream[i]); for (int i = 0; i < 2; ++i) kernel >>(); for (int i = 0; i < 2; ++i) // to host cudaMemcpyAsync(/*... */, stream[i]);

Streams Explicit Synchronization  cudaThreadSynchronize() Blocks until commands in all streams finish  cudaStreamSynchronize() Blocks until commands in a stream finish See in the NVIDIA CUDA C Programming Guide for more synchronization functions

Timing with Stream Events Events can be added to a stream to monitor the device’s progress An event is completed when all commands in the stream preceding it complete.

Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...)

Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Create two events. Each will record the time

Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Record events before and after each stream is assigned its work

Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Delay addition commands in stream until after the stop event

Timing with Stream Events cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop) cudaEventRecord(start, 0); for (int i = 0; i < 2; ++i) //... cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); // cudaEventDestroy(...) Compute elapsed time

Graphics Interoperability What applications use both CUDA and OpenGL/Direct3D?  CUDA GL  GL CUDA If CUDA and GL cannot share resources, what is the performance implication?

Graphics Interoperability Graphics Interop: Map GL resource into CUDA address space  Buffers  Textures  Renderbuffers

Graphics Interoperability OpenGL Buffer Interop 1. Assign device with GL interop cudaGLSetGLDevice() 2. Register GL resource with CUDA cudaGraphicsGLRegisterBuffer() 3. Map it cudaGraphicsMapResources() 4. Get mapped pointer cudaGraphicsResourceGetMappedPointer()