1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Outline Reading Data From Files Double Buffering GMAC ECE
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Computer Organization and Architecture
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA Grids, Blocks, and Threads
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
Cuda Streams Presented by Savitha Parur Venkitachalam.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
UNIX Files File organization and a few primitives.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 2 Computer-System Structures Slide 1 Chapter 2 Computer-System Structures.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Synchronization These notes introduce:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
CUDA programming Performance considerations (CUDA best practices)
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
Stack and Heap Memory Stack resident variables include:
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Programming Model
Heterogeneous Programming
Device Routines and device variables
System Structure and Process Model
Operating Systems Chapter 5: Input/Output Management
CUDA Grids, Blocks, and Threads
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Device Routines and device variables
Measuring Performance
CUDA Execution Model – III Streams and Events
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Measuring Performance
CUDA Programming Model
Presentation transcript:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. First need to introduce paged-locked memory as streams need page-locked memory These materials come from Chapter 10 of “CUDA by Example” by Jason Sanders and Edwards Kandrot.

2 Page-locked host memory (also called “pinned host” memory) Page-locked memory is not paged in and out main memory by the OS through paging but will remain resident. Allows: Concurrent host/device memory transfers with kernel operations (Compute capability 2.x) Host memory can be mapped to device address space (Compute capability > 1.0) Memory bandwidth is higher Uses real addresses rather than virtual addresses Does not need to intermediate copy buffering

3 Questions What is paging? What are real and virtual addresses?

A process is stored as one or more distributed pages One process (application) 4 Paging and virtual memory recap Main memory Hard drive (disk) Page Real address– the actual physical address of the location Virtual address – the address, allocated to a process by the paging/virtual memory mechanism to allow the pages to reside anywhere, allocated to a process Real-virtual address translation done by a look up table, partly in hardware (translation look aside buffer, TLB) for recently used pages and partly in software Page - a block of memory using with virtual memory Pages are transferred to and from disk to make space Paging RA = 0, VA = 45 say RA = 2, VA = 46 say More information in an undergraduate Computer Architecture and Operating system courses

5 Note on using page-locked memory Using page-locked memory will reduce memory available to the OS for paging and so need to be careful in allocating it

6 Allocating page locked memory cudaMallocHost ( void ** ptr, size_t size ) Allocates page-locked host memory that is accessible to device. cudaHostAlloc (void ** ptr, size_t size, unsigned int flags) Allocates page-locked host memory that is accessible to device – seems to have more options Notes: “The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy () Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().” b43c91ebad.html

7 Freeing page locked memory cudaFreeHost (void * ptr) “Frees the memory space pointed to by ptr, which must have been returned by a previous call to cudaMallocHost() or cudaHostAlloc().” Parameters: ptr - Pointer to memory to free eb2708ad3f74d5b417ee1874ec84a.html#gedaeb2708ad3f74d5b417ee1874ec84a

8 //Pinned memory test written by Barry Wilkinson, UNC-Charlotte. Feb 10, #include #define SIZE (10*1024*1024) // number of bytes in arrays 10 MBytes int main(int argc, char *argv[]) { int i; // loop counter int *a; int *dev_a; cudaEvent_t start, stop; // using cuda events to measure time cudaEventCreate(&start); // create events cudaEventCreate(&stop); float elapsed_time_ms1, elapsed_time_ms3; /* ENTER INPUT PARAMETERS AND DATA */ cudaMalloc((void**)&dev_a, SIZE);// allocate memory on device /* COPY USING PINNED MEMORY */ cudaHostAlloc((void**)&a, SIZE,cudaHostAllocDefault); // allocate page-locked memory on host cudaEventRecord(start, 0); for(i = 0; i < 100; i++) { // make transfer 100 times cudaMemcpy(dev_a, a, SIZE,cudaMemcpyHostToDevice);//copy to device cudaMemcpy(a,dev_a, SIZE,cudaMemcpyDeviceToHost);//copy back to host } cudaEventRecord(stop, 0); // instrument code to measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms1, start, stop ); printf("Time to copy %d bytes of data 100 times on GPU, pinned memory: %f ms\n", SIZE, elapsed_time_ms1); // exec. time Test of Pinned Memory CPU memory GPU memory No address translation needed (no paging) Should have used cudaFreeHost() here! Pointer a re-used on next slide

9 /* COPY USING REGULAR MEMORY */ a = (int*) malloc(SIZE);// allocate regular memory on host cudaEventRecord(start, 0); for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a, SIZE,cudaMemcpyHostToDevice);//copy to device cudaMemcpy(a,dev_a, SIZE,cudaMemcpyDeviceToHost);//copy back to host } cudaEventRecord(stop, 0); // instrument code to measue end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms3, start, stop ); printf("Time to copy %d bytes of data 100 times on GPU: %f ms\n", SIZE, elapsed_time_ms3); // exec. time /* SPEEDUP */ printf("Speedup of using pinned memory = %f\n", (float) elapsed_time_ms3 / (float) elapsed_time_ms1); /* clean up */ free(a); cudaFree(dev_a); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; }

10 My code

11 Coit-grid06./bandwidthTest Starting... Running on... Device 0: Tesla C2050 Quick Mode Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) [bandwidthTest] - Test results: PASSED Press to Quit Using NVIDIA bandwidthTest Coit-grid07 bandwidthTest Starting... Running on... Device 0: Tesla C2050 Quick Mode Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes)Bandwidth(MB/s) Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes)Bandwidth(MB/s) Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes)Bandwidth(MB/s) [bandwidthTest] - Test results: PASSED Press to Quit

12 CUDA Streams A CUDA Stream is a sequence of operations (commands) that are executed in order. Multiple CUDA streams can be created and executed together and interleaved although the “program order” is always maintained within each stream. Streams provide a mechanism to overlap memory transfer and computations operations in different stream for increased performance if sufficient resources are available.

13 Creating a stream Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. CUDA commands have a stream pointer as an argument: cudaStream_t stream1; cudaStreamCreate(&stream1); cudaMemcpyAsync(…, stream1); MyKernel >>(…); cudaMemcpyAsync(…, stream1); Cannot use regular cudaMemcpy with streams. Need asynchronous commands for concurrent operation see next Stream stream1

14 cudaMemcpyAsync( …, stream) Asynchronous version of cudaMemcpy that copies date to/from host and the device May return before copy complete A stream argument specified. Needs “page-locked” memory

15 #define SIZE (N*20) … int main(void) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; cudaMalloc( (void**)&dev_a, N * sizeof(int) ); cudaMalloc( (void**)&dev_b, N * sizeof(int) ); cudaMalloc( (void**)&dev_c, N * sizeof(int) ); cudaHostAlloc((void**)&a,SIZE*sizeof(int),cudaHostAllocDefault); // paged-locked cudaHostAlloc((void**)&b,SIZE*sizeof(int),cudaHostAllocDefault); cudaHostAlloc((void**)&c,SIZE*sizeof(int),cudaHostAllocDefault); for(int i=0;i<SIZE;i++) { // load data a[i] = rand(); b[i] = rand(); } for(int i=0;I < SIZE;i+= N { // loop over data in chunks cudaMemcpyAsync(dev_a,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream); cudaMemcpyAsync(dev_b,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream); kernel >>(dev_a,dev-b,dev_c); cudaMemcpyAsync(c+1,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost,stream); } cudaStreamSynchronise(stream); // wait for stream to finish return 0; } Code Example Page CUDA by Example, without error detection macros One stream

16 Multiple streams Assuming device can support it (can check in code if needed), create two streams with: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); and then duplicate stream code for each stream

17 int *dev_a1, *dev_b1, *dev_c1; // stream 1 mem ptrs int *dev_a2, *dev_b2, *dev_c2; // stream 2 mem ptrs //stream 1 cudaMalloc( (void**)&dev_a1, N * sizeof(int) ); cudaMalloc( (void**)&dev_b1, N * sizeof(int) ); cudaMalloc( (void**)&dev_c1, N * sizeof(int) ); //stream 2 cudaMalloc( (void**)&dev_a2, N * sizeof(int) ); cudaMalloc( (void**)&dev_b2, N * sizeof(int) ); cudaMalloc( (void**)&dev_c2, N * sizeof(int) ); … for(int i=0;I < SIZE;i+= N*2 { // loop over data in chunks // stream 1 cudaMemcpyAsync(dev_a1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_b1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); kernel >>(dev_a,dev-b,dev_c); cudaMemcpyAsync(c+1,dev_c1,N*sizeof(int),cudaMemcpyDeviceToHost,stream1); //stream 2 cudaMemcpyAsync(dev_a2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); cudaMemcpyAsync(dev_b2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); kernel >>(dev_a,dev-b,dev_c); cudaMemcpyAsync(c+1,dev_c2,N*sizeof(int),cudaMemcpyDeviceToHost,stream2); } cudaStreamSynchronise(stream1); // wait for stream to finish cudaStreamSynchronise(stream2); // wait for stream to finish First attempt described in book concatenate statements of each stream

18 Simply concatenating statements does not work well because of the way the GPU schedules work Page 206 CUDA by Example,

19 Page 207 CUDA by Example,

20 Page 208 CUDA by Example

21 for(int i=0;I < SIZE;i+= N*2 { // loop over data in chunks // interleave stream 1 and stream 2 cudaMemcpyAsync(dev_a1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_a2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); cudaMemcpyAsync(dev_b1,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream1); cudaMemcpyAsync(dev_b2,a+i,N*sizeof(int),cudaMemcpyHostToDevice,stream2); kernel >>(dev_a,dev-b,dev_c); cudaMemcpyAsync(c+1,dev_c1,N*sizeof(int),cudaMemcpyDeviceToHost,stream1); cudaMemcpyAsync(c+1,dev_c2,N*sizeof(int),cudaMemcpyDeviceToHost,stream2); } Second attempt described in book Interleave statements of each stream

22 Page 210 CUDA by Example

Questions