Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008
Previously CUDA API CUDA API –Extensions to C Function & variable type qualifiers Function & variable type qualifiers Built-in variables Built-in variables –Compilation with NVCC Synchronization & optimization Synchronization & optimization CUDA Runtime Component CUDA Runtime Component
Today CUDA API CUDA API –CUDA Runtime Component
CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component
CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component
Common Runtime Component Used by both host and device Used by both host and device Built-in vector types Built-in vector types – char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2 –Default constructors float a,b,c,d; float4 f4 = make_float4 (a,b,c,d); // f4.x=a f4.y=b f4.z=c f4.w=d
Common Runtime Component Built-in vector types Built-in vector types – dim3 Based on uint3 Based on uint3 Uninitialized values default to 1 Uninitialized values default to 1 Math functions Math functions –Full listing in Appendix B of programming guide –Single and double (>= 1.3) precision floating point functions
Common Runtime Component Timing: clock_t clock() Timing: clock_t clock() –number of clock ticks since program launch clock_t start = clock(); // execute A threadSynchronize(); clock_t end = clock(); double gap = (end – start) / CLOCKS_PER_SEC; –Measures clock cycles taken during execution of A, not necessarily actual execution time, which may be lesser because of time slicing
Common Runtime Component Textures Textures –GPU has texturing hardware for graphics output –Some part is supported by CUDA –Texture memory can be faster than global memory –Texture fetch = reading texture memory –One parameter to a texture fetch is a texture reference.
Common Runtime Component Texture reference Texture reference –Defines area of texture memory to be fetched –Bound by host to some memory region, termed the texture –Textures may overlap –May be 1D, 2D or 3D and thus use 1, 2 or 3 texture coordinates –Elements in fetched texture are called texels, “texture elements”.
Common Runtime Component Texture reference Texture reference –Declaration: texture texRef; – Type : texel type, int, float, vector type – Dim : dimensionality of the texture reference Optional, defaults to 1 Optional, defaults to 1 May be 1, 2 or 3 May be 1, 2 or 3
Common Runtime Component Texture reference Texture reference –Declaration: texture texRef; – ReadMode : convert fetched texels? cudaReadModeElementType no conversion cudaReadModeElementType no conversion cudaReadModeNormalizedFloat integer types converted to cudaReadModeNormalizedFloat integer types converted to – [0.0, 1.0] for unsinged integer – [-1.0, 1.0] for signed integer
Common Runtime Component Texture reference Texture reference –Declaration: texture texRef; – ReadMode : convert fetched texels? cudaReadModeElementType cudaReadModeElementType cudaReadModeNormalizedFloat cudaReadModeNormalizedFloat Optional, defaults to cudaReadModeElementType Optional, defaults to cudaReadModeElementType
Common Runtime Component Texture reference Texture reference –Can be unnormalized: [0, N) Clamped Clamped –Can be normalized: [0, 1) Independent of actual texture size Independent of actual texture size Clamped by default Clamped by default Also offers a “wrap” mode, good for periodic textures Also offers a “wrap” mode, good for periodic textures
Common Runtime Component Texture reference Texture reference –Declaration: texture texRef; –Linear texture filtering Returns interpolated value Returns interpolated value Type should be floating point Type should be floating point e.g. texture of size 4 normalized idx: 0, 0.25, 0.5, 0.75 with filtering: full range [0,1) e.g. texture of size 4 normalized idx: 0, 0.25, 0.5, 0.75 with filtering: full range [0,1)
All for today Next time Next time –Device Runtime Component –Host Runtime Component –Performance & Optimization
On to exercises!