Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008
Previously CUDA Runtime Component CUDA Runtime Component –Common Component Data types, math functions, timing, textures Data types, math functions, timing, textures –Device Component Math functions, warp voting, atomic functions, synch function, texturing Math functions, warp voting, atomic functions, synch function, texturing –Host Component High-level runtime API High-level runtime API Low-level driver API Low-level driver API
Previously CUDA Runtime Component CUDA Runtime Component –Host Component APIs Mutually exclusive Mutually exclusive Runtime API is easier to program, hides some details from programmer Runtime API is easier to program, hides some details from programmer Driver API gives low level control, harder to program Driver API gives low level control, harder to program Provide: device initialization, management of device, streams and events Provide: device initialization, management of device, streams and events
Today CUDA Runtime Component CUDA Runtime Component –Host Component APIs Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered) Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered) Runtime API provides: emulation mode for debugging Runtime API provides: emulation mode for debugging Driver API provides: management of contexts & modules, execution control Driver API provides: management of contexts & modules, execution control Final Projects Final Projects
Memory Management: Linear Memory Memory Management: Linear Memory –CUDA Runtime API Declare: TYPE* Allocate: cudaMalloc, cudaMallocPitch Copy: cudaMemcpy, cudaMemcpy2D Free: cudaFree –CUDA Driver API Declare: CUdeviceptr Allocate: cuMemAlloc, cuMemAllocPitch Copy: cuMemcpy, cuMemcpy2D Free: cuMemFree Host Runtime Component
Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – expected: // host code float *array2D; cudaMallocPitch ((void**) array2D, width*sizeof (float), height); // device code int size = width * sizeof (float); for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component
Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – expected, WRONG: // host code float *array2D; cudaMallocPitch ((void**) array2D, width*sizeof (float), height); // device code int size = width * sizeof (float); for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component
Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – CORRECT: // host code float *array2D; int pitch; cudaMallocPitch ((void**) array2D, &pitch, width*sizeof (float), height); // device code for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*pitch; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component
Memory Management: Linear Memory Memory Management: Linear Memory –Pitch (stride) – why? Allocation using pitch functions appropriately pads memory for efficient transfer and copy Allocation using pitch functions appropriately pads memory for efficient transfer and copy Width of allocated rows may exceed width*sizeof(float) Width of allocated rows may exceed width*sizeof(float) True width given by pitch True width given by pitch Host Runtime Component
Memory Management: CUDA Arrays Memory Management: CUDA Arrays –CUDA Runtime API Declare: cudaArray* Channel: cudaChannelFormatDesc, cudaCreateChannelDesc Allocate: cudaMallocArray Copy (from linear): cudaMemcpy2DToArray Free: cudaFreeArray Host Runtime Component
Memory Management: CUDA Arrays Memory Management: CUDA Arrays –CUDA Driver API Declare: CUarray Channel: CUDA_ARRAY_DESCRIPTOR object Allocate: cuArrayCreate Copy (from linear): CUDA_MEMCPY2D object Free: cuArrayDestroy Host Runtime Component
Memory Management: various other functions to copy from Memory Management: various other functions to copy from –Linear memory to CUDA arrays –Host to constant memory –See Reference Manual Host Runtime Component
Texture Management Texture Management –Run-time API: texture type derived from struct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; } – normalized : 0: false, otherwise true Host Runtime Component
Texture Management Texture Management – filterMode: cudaFilterModePoint: no filtering, returned value is of nearest texel cudaFilterModeLinear: filters 2/4/8 neighbors for 1D/2D/3D texture, floats only – addressMode: (x,y,z) cudaAddressModeClamp, cudaAddressModeWrap: normalized coordinates only Host Runtime Component
Texture Management Texture Management – channelDesc : texel type struct cudaChannelFormatDesc { int x,y,z,w; enum cudaChannelFormatKind f; } x,y,z,w : #bits per component x,y,z,w : #bits per component f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat Host Runtime Component
Texture Management Texture Management –Run-time API: texture type derived from struct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; } –Apply only to texture references bound to CUDA arrays Host Runtime Component
Texture Management Texture Management –Binding a texture reference to a texture Runtime API: Runtime API: –Linear memory: cudaBindTexture –CUDA Array: cudaBindTextureToArray Driver API: Driver API: –Linear memory: cuTexRefSetAddress –CUDA Array: cuTexRefSetArray Host Runtime Component
Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –No native debug support for device code –Code should be compiled either for device emulation OR execution: mixing not allowed –Device code is compiled for the host Host Runtime Component
Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Features Each CUDA thread is mapped to a host thread, plus one master thread Each CUDA thread is mapped to a host thread, plus one master thread Each thread gets 256KB on stack Each thread gets 256KB on stack Host Runtime Component
Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Advantages Can use host debuggers Can use host debuggers Can use otherwise disallowed functions in device code, e.g. printf Can use otherwise disallowed functions in device code, e.g. printf Device and host memory are both readable from either device or host Device and host memory are both readable from either device or host Host Runtime Component
Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Advantages Any device or host specific function can be called from either device or host code Any device or host specific function can be called from either device or host code Runtime detects incorrect use of synch functions Runtime detects incorrect use of synch functions Host Runtime Component
Runtime API: debugging using the emulation mode Runtime API: debugging using the emulation mode –Some errors may still remain hidden Memory access errors Memory access errors Out of context pointer operations Out of context pointer operations Incorrect outcome of warp vote functions as warp size is 1 in emulation mode Incorrect outcome of warp vote functions as warp size is 1 in emulation mode Result of FP operations often different on host and device Result of FP operations often different on host and device Host Runtime Component
Driver API: Context management Driver API: Context management –A context encapsulates all resources and actions performed within the driver API –Almost all CUDA functions operate in a context, except those dealing with Device enumeration Device enumeration Context management Context management Host Runtime Component
Driver API: Context management Driver API: Context management –Each host thread can have only one current device context at a time –Each host thread maintains a stack of current contexts – cuCtxCreate() Creates a context Creates a context Pushes it to the top of the stack Pushes it to the top of the stack Makes it the current context Makes it the current context Host Runtime Component
Driver API: Context management Driver API: Context management – cuCtxPopCurrent() Detaches the current context from the host thread – makes it “uncurrent” Detaches the current context from the host thread – makes it “uncurrent” The context is now floating The context is now floating It can be pushed to any host thread's stack It can be pushed to any host thread's stack Host Runtime Component
Driver API: Context management Driver API: Context management –Each context has a usage count cuCtxCreate creates a context with a usage count of 1 cuCtxCreate creates a context with a usage count of 1 cuCtxAttach increments the usage count cuCtxAttach increments the usage count cuCtxDetach decrements the usage count cuCtxDetach decrements the usage count Host Runtime Component
Driver API: Context management Driver API: Context management –A context is destroyed when its usage count reaches 0. cuCtxDetach, cuCtxDestroy cuCtxDetach, cuCtxDestroy Host Runtime Component
Driver API: Module management Driver API: Module management –Modules are dynamically loadable packages of device code and data output by nvcc Similar to DLLs Similar to DLLs Host Runtime Component
Driver API: Module management Driver API: Module management –Dynamically loading a module and accessing its contents CUmodule cuModule; cuModuleLoad(&cuModule, “myModule.cubin”); CUfunction cuFunction; cuModuleGetFunction(&cuFunction, cuModule, “myKernel”); Host Runtime Component
Driver API: Execution control Driver API: Execution control –Set kernel parameters cuFuncSetBlockShape() cuFuncSetBlockShape() –#threads/block for the function –How thread IDs are assigned cuFuncSetSharedSize() cuFuncSetSharedSize() –Size of shared memory cuParam*() cuParam*() –Specify other parameters for next kernel launch Host Runtime Component
Driver API: Execution control Driver API: Execution control –Launch kernel cuLaunch(), cuLaunchGrid() cuLaunch(), cuLaunchGrid() –Example in Prog Guide Host Runtime Component
Final Projects Ideas? Ideas? –DES cracker –Image editor Resize and smooth an image Resize and smooth an image Gamut mapping? Gamut mapping? –3D Shape matching
All for today Next time Next time –Memory and Instruction optimizations
On to exercises!