The Missouri S&T CS GPU Cluster Cyriac Kandoth
Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their GPUs as general purpose devices (GPGPUs) They have donated 8 Tesla GPUs to the Missouri S&T Computer Science Department This presentation will introduce you to the concept of creating applications that take advantage of the massive parallelism inherent in GPUs
Tech Specs 4 desktop-style cases, each housing: Intel Core i7 (Quad core with HyperThreading i.e. 8 logical processors) 8GB DDR3 (The Core i7 under-clocks the RAM to a speed that it supports) 500 GB 3.0Gbps 7200rpm SATA Hard disk drive Two Tesla C1060 cards on PCI-Express 2.0 x16 (A Tesla C1060 is a compute-only GPU that has no video output ports) ATI Radeon HD2400 on standard PCI (for display)
The Tesla C1060 Form Factor10.5" x 4.376", Dual Slot # of Streaming Processor Cores240 Frequency of processor cores1.3GHz Single Precision floating point performance (peak) 933 GFLOP/s Double Precision floating point performance (peak) 78 GFLOP/s Floating Point PrecisionIEEE 754 single & double Total Dedicated Memory4GB GDDR3 Memory Speed800MHz Memory Interface512-bit Memory Bandwidth102GB/sec Max Power Consumption200 W peak, 160 W typical System InterfacePCIe x16 Source:
eth0 Cluster networking gpu0gpu1gpu2gpu3 The Missouri S&T Network Gigabit Switch The 4 nodes are named gpu0 thru gpu3 gpu0 is the frontend that acts as the gateway into the cluster from the MST-USERS domain eth1
The CUDA programming model CUDA (Compute Unified Device Architecture) is a C programming model and API (Application Programming Interface) introduced by NVIDIA to enable software developers to code general purpose apps that run on the massively parallel hardware on GPUs. GPUs are optimal for data parallel apps aka SIMD (Single Instruction Multiple Data). CUDA allows us to also code MIMD apps, but at a reduced efficiency. Threads running in parallel use extremely fast shared memory for communication. There is no MPI_Send(), but the equivalent of MPI_Barrier() is __syncthreads().
The CUDA programming model In your code, you can create a kernel (a function) that will run many instances of itself on parallel threads on the GPU. Threads running in parallel are collectively known as a grid. Kernels are run on the device (GPU) while the rest of the code runs on the host (CPU).
The CUDA programming model A grid is organized into blocks, and each block is organized into threads. Only threads within the same block can communicate via shared memory; and sync. This type of organization helps the GPU parallelize thread execution using it’s inbuilt hardware protocols.
Built-in Variables accessible in a Kernel dim3 gridDim Contains the dimensions of the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3 blockIdx Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3 blockDim Contains the dimensions of the block ( blockDim.x, blockDim.y, and blockDim.z ) uint3 threadIdx Contains the thread index within the block ( threadIdx.x, threadIdx.y, and threadIdx.z )
E.g. Host invokes kernel on a device // Kernel definition, runs a copy on every thread __global__ void vectorAdd( float* A, float* B, float* C ) {... } int main(int argc, char** argv) { dim3 blockSize(16, 16); // 256 threads per block (up to 3D) dim3 gridSize(4, 2); // 8 blocks in the grid (up to 2D) // Invoke the kernel on the device (GPU) vectorAdd >>(A, B, C);... // Continue running on host (CPU) when device is done }
CUDA Type Qualifiers Function type qualifiers __device__ Executed on the device Callable from the device only __global__ Executed on the device Callable from the host only __host__ Executed on the host Callable from the host only Default type if unspecified Variable type qualifiers __device__ global memory space Is accessible from all the threads within the grid __constant__ constant memory space Is accessible from all the threads within the grid __shared__ space of a thread block Is only accessible from all the threads within the block
Template of a typical main() int main(int argc, char** argv) { // Allocate memory on the host for input data - malloc() // Initialize input data from file, user input, etc. // Allocate memory on the device - cudaMalloc() // Send input data to the device - cudaMemcpy() // Set up grid and block dimensions - dim3 variables // Invoke the kernel on the device (GPU) - kernelName >>(input_params); // Copy results from device to host - cudaMemcpy() // Free up device memory - cudaFree() // Print results at the host, because device can’t. // printf() from kernel only works in emulation mode }
CUDA apps in emulation mode Compile the program with the emu parameter enabled: make emu=1 The program emulates a GPU on the host CPU. Usually much slower. Helps with debugging, because you are allowed to use printf() statements in device code (from CUDA kernels)
Types of shared memory Registers: Fastest form of memory on the GPU. Is only accessible by individual threads and has the lifetime of a thread. We don’t need to deal with it directly (but we can). Shared Memory: Can be as fast as a register when there are no bank conflicts (when threads read from the same address). Accessible by any thread of the block from which it was created. Has the lifetime of the block. Global memory: Potentially 150x slower than register or shared memory because of un-coalesced reads and writes. Accessible from either the host or device. Has the lifetime of the application. Read-only global memory is called constant memory. Local memory: Resides in global memory and can be 150x slower than register/shared memory. Is only accessible by the thread. Has the lifetime of the thread.
A few CUDA API functions cudaSetDevice(int dev) - Sets the device to run the kernel. __syncthreads() - Blocks execution of all threads within a block until they synchronize. cudaMalloc(void** devPtr, size_t count) - Allocates count bytes in GPU memory and returns a pointer to it in the parameter *devPtr. cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) - copies count bytes from src to dst where kind is A complete listing of the CUDA API functions can be found in the Reference Manual.Reference Manual
Tips for speedy code Have the kernel use the whole card - Have a multiple of 32 threads per block and at least as many blocks as multiprocessors (240 on the Tesla C1060s). Access global memory properly. Coalescing - Memory read by consecutive threads are combined by the hardware into several, wide memory reads. Avoid shared memory bank conflicts. Have as few branching conditional loops as possible. Have small loops unrolled. Have no unnecessary __syncthreads() calls. See the CUDA Programming Guide for further discussion on all of the above.CUDA Programming Guide
Demo: helloWorld using CUDA Within the NVIDIA_CUDA_SDK projects folder, you will find the helloWorld project. Compile it in emulation mode using make emu=1 Execute the binary which gets stored at ~/NVIDIA_CUDA_SDK/bin/linux/emurelease/helloWorld Now go back to the code and take a closer look at it. Files: helloWorld.cu, helloWorld_kernel.cu Next week, we will see how to perform block matrix multiplication using CUDA (see the matrixMul project in the SDK)
Questions?