1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS179: GPU Programming Lecture 8: More CUDA Runtime.
Threads. Readings r Silberschatz et al : Chapter 4.
More on threads, shared memory, synchronization
Chapter 7 Process Environment Chien-Chung Shen CIS, UD
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA Grids, Blocks, and Threads
Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
Programming of multiple GPUs with CUDA and Qt library
Cuda Streams Presented by Savitha Parur Venkitachalam.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Martin Kruliš by Martin Kruliš (v1.0)1.
PRINCIPLES OF OPERATING SYSTEMS Lecture 6: Processes CPSC 457, Spring 2015 May 21, 2015 M. Reza Zakerinasab Department of Computer Science, University.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Synchronization These notes introduce:
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA programming Performance considerations (CUDA best practices)
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Chapter 7 Process Environment Chien-Chung Shen CIS/UD
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CUDA Programming Model
Heterogeneous Programming
Device Routines and device variables
CUDA Grids, Blocks, and Threads
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Device Routines and device variables
Measuring Performance
CUDA Execution Model – III Streams and Events
CUDA Grids, Blocks, and Threads
CUDA Programming Model
Programming with Shared Memory
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Measuring Performance
CUDA Programming Model
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy” memory requires page lock-memory. These materials comes from Chapter 11 of “CUDA by Example” by Jason Sanders and Edwards Kandrot.

2 Zero-copy refers to the GPU accessing the host memory without explicitly copying the data from the host memory to the GPU memory i.e. zero copying Depending upon the hardware structure the data may get copied though! Integrated GPUs that are part of the system chipset and share system memory do not. --- example MacBook Pro Discrete GPU cards with their own device memory do. Zero-copy memory

3 CUDA routines for zero-copy memory Use page-locked memory. Allocate with: cudaHostAlloc (void ** ptr, size_t size, unsigned int flags) Allocates page-locked memory and accessible to the device. Set flags to: cudaHostAllocMapped - Map allocation into CUDA address space. Reference: NVIDIA CUDA library

4 Flags continued cudaHostAllocWriteCombined -- Allocates memory as “write-combined”, which can be transferred more quickly across PCIe bus on some system configurations, but cannot be read efficiently by most CPUs. Use for memory written by CPU and read by device via mapped pinned memory. Combining flags: cudaHostAllocMapped || cudaHostAllocWriteCombined Reference: NVIDIA CUDA library

5 Device pointer to allocated memory Device pointer to memory obtained by calling: cudaHostGetDevicePointer() “Passes back device pointer corresponding to mapped, pinned host buffer allocated by cudaHostAlloc() or …” Needed to account for different address spaces. Parameters cudaHostGetDevicePointer( void ** pDevice, void * pHost, unsigned int flags) Reference: NVIDIA CUDA library Returned device pointer for mapped memory Requested host pointer mapping Flags for extensions (must be 0 for now)

6 Code to allocate memory and get pointer for device int *a;// host pointer int *dev_a;// device pointer to host memory size = … ; // number of bytes to allocate cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped || cudaHostAllocWriteCombined ); cudaHostGetDevicePointer(&dev_a, a, 0); If desired Allocate pinned memory on host: Get device point to it: Now do not need to copy memory from host to device:

7 Using pointer to host memory Simply use returned pointer in kernel call where one would otherwise have used a device memory pointer: MyKernel >> (dev_a, … ); without needing to modify the kernel code at all!

8 #include #define N 32 // size of vectors __global__ void add(int *a,int *b, int *c) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N) c[tid] = a[tid]+b[tid]; } int main(int argc, char *argv[]) { int T = 32, B = 1; // threads per block and blocks per grid int *a,*b,*c;// host pointers int *dev_a, *dev_b, *dev_c;// device pointers to host memory cudaEvent_t start, stop; // to measure time float elapsed_time_ms; cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped || cudaHostAllocWriteCombined ); cudaHostAlloc( (void**)&b, size, cudaHostAllocMapped || cudaHostAllocWriteCombined ); cudaHostAlloc( (void**)&c, size, cudaHostAllocMapped ); … // load arrays with some numbers cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead cudaHostGetDevicePointer(&dev_b, b, 0); cudaHostGetDevicePointer(&dev_c,c, 0); …// start time add >>(dev_a,dev_b,dev_c); cudaThreadSynchronize();// copy back not needed but now need thread synchronization …// end time … // print results printf("Time to calculate results: %f ms.\n", elapsed_time_ms); // print out execution time cudaFreeHost(a); // clean up cudaFreeHost(b); cudaFreeHost(c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } Example Vector addition without host- device transfers Note flag book seems to miss out this special free routine when using cudaHostAlloc

9 MyKernel >> (dev_a, … ); __global__ void add(int *a, … ) { … } Host memory Device (GPU) Host (CPU) Host Memory pointed to from device cudaHostAlloc( (void**)&a, … ); cudaHostGetDevicePointer(&dev_a, a, 0);

10 Code to determine whether GPU has the capability of features being used Look at device properties: cudaDeviceProp prop; int myDevice; cudaGetDevice(&myDevice); cudaGetDeviceProperties(&prop, myDevice); If (prop.property_name != 1) printf(“Feature not available\n”); Returns device executing thread Returns a structure, see next Various property names, see next

11 Properties struct cudaDeviceProp { char name[256]; size_t totalGlobalMem; size_t sharedMemPerBlock; int regsPerBlock; int warpSize; size_t memPitch; int maxThreadsPerBlock; int maxThreadsDim[3]; int maxGridSize[3]; size_t totalConstMem; int major; int minor; int clockRate; size_t textureAlignment; int deviceOverlap; int multiProcessorCount; int kernelExecTimeoutEnabled; int integrated; int canMapHostMemory; int computeMode; int concurrentKernels; }

12 Checking can map page-locked host memory into device address space … cudaDeviceProp prop; int myDevice; cudaGetDevice(&myDevice); cudaGetDeviceProperties(&prop, myDevice); If (prop.canMapHostMemory != 1) { printf(“Feature not available\n”); return 0; } … Very likely as only needs compute capability > 1.0

13 Zero-copy memory particularly interesting with integrated GPU systems where system memory is shared between CPU and GPU. Increased performance will always result when using zero-copy memory (according to the course textbook) CPUGPU Main memory NVIDIA GeForce 320M Example: My 13” MacBook Pro, MB DDR3 SDRAM 4 GB Shared between CPU and GPU 2.4 GHz Intel Core 2 Duo Integrated GPU systems DDR3 SDRAM Intel Graphics Media Accelerator (GMA ) shared bus on 15/17” models

14 Using multiple GPU on one system Each GPU needs to be controlled by a separate thread: Code GPU 1 GPU 2 Thread 1Thread 2 So need to write a multi-threaded program using thread APIs/tools such as Pthreads, WinThreads, OpenMP, ….

15 Textbook utility routines for multi-threading Found in../common/book.h Provides for Win32 Threads for Windows or Pthreads for Linux thread = start_thread(funct,ptr) Used to start a new thread Takes as arguments: void* funct (void*) void* ptr Returns CUTThread type thread identifier To terminate thread (join to main thread): end_thread(thread) … #if _WIN32 //Windows threads. #include typedef HANDLE CUTThread; typedef unsigned (WINAPI *CUT_THREADROUTINE)(void *); #define CUT_THREADPROC unsigned WINAPI #define CUT_THREADEND return 0 #else //POSIX threads. #include typedef pthread_t CUTThread; typedef void *(*CUT_THREADROUTINE)(void *); #define CUT_THREADPROC void #define CUT_THREADEND #endif //Create thread. CUTThread start_thread( CUT_THREADROUTINE, void *data ); //Wait for thread to finish. void end_thread( CUTThread thread ); //Destroy thread. void destroy_thread( CUTThread thread ); //Wait for multiple threads. void wait_for_threads( const CUTThread *threads, int num ); #if _WIN32 //Create thread CUTThread start_thread(CUT_THREADROUTINE func, void *data){ return CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)func, data, 0, NULL); } //Wait for thread to finish void end_thread(CUTThread thread){ WaitForSingleObject(thread, INFINITE); CloseHandle(thread); } //Destroy thread void destroy_thread( CUTThread thread ){ TerminateThread(thread, 0); CloseHandle(thread); } //Wait for multiple threads void wait_for_threads(const CUTThread * threads, int num){ WaitForMultipleObjects(num, threads, true, INFINITE); for(int i = 0; i < num; i++) CloseHandle(threads[i]); } #else //Create thread CUTThread start_thread(CUT_THREADROUTINE func, void * data){ pthread_t thread; pthread_create(&thread, NULL, func, data); return thread; } //Wait for thread to finish void end_thread(CUTThread thread){ pthread_join(thread, NULL); } //Destroy thread void destroy_thread( CUTThread thread ){ pthread_cancel(thread); } //Wait for multiple threads void wait_for_threads(const CUTThread * threads, int num){ for(int i = 0; i < num; i++) end_thread( threads[i] ); } #endif …

16 Pinned memory on multiple GPUs Pinned memory only pinned by thread allocating the pinned memory Other threads see it as pageable and access slower. These threads cannot use cudaMemcpyAsync, which requires pinned memory “Portable” pinned memory Memory allowed to move between host threads and any thread to see it as pinned memory Use cudaHostAlloc and include cudaAllocPortable flag

Questions More information – See Chapter 11 of “CUDA by Example” by Jason Sanders and Edwards Kandrot, Addison-Wesley, 2011