Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu. B. Wilkinson, Nov 10, 2014 SharedMem.ppt
Experiment Load thread ID (flattened global threadID) into array element so one can tell which thread accesses which location when array printed out. Do it a large number of time times. This simulates a calculation. For comparison purposes, access done: Using global memory only Using shared memory with copying back to global memory Time of execution compared. GPU structure -- one or more 2-D blocks in a 2-D grid. Each block 2-D 32x32 threads fixed (1024, max. compute cap. 2/3)
1. Using global memory only __global__ void gpu_WithoutSharedMem (int *h, int N, int T) { // Array loaded with global thread ID that accesses that location // Coalescing should be possible int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID }
2. Using shared memory __global__ void gpu_SharedMem (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem. }
Main program Computation 2 and 3 similar … /*------------------------- Allocate Memory-----------------------------------*/ int size = N * N * sizeof(int); // number of bytes in total in array int *h, *dev_h; // ptr to arrays holding numbers on host and device h = (int*) malloc(size); // Array on host cudaMalloc((void**)&dev_h, size); // allocate device memory /* ------------------------- GPU Computation without shared memory -----------------------------------*/ gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing cudaEventRecord( start, 0 ); gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check printf("\nComputation without shared memory\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 and 3 similar
Some results A grid of one block and one iteration Array 32x32 Shared memory Speedup = 1.18
A grid of one block and 1000000 iterations Array 32 x 32 Shared memory Speedup = 1.24
Repeat just to check results are consistent
A grid of 16 x 16 blocks and 10000 iterations Array 512x512 Speedup = 1.74 Different numbers of iterations produce similar results
Different Array Sizes Array size Speedup 32 x 32 1.24 64 x 64 1.37 128 x 128 1.36 256 x 256 1.78 512 x 512 1.75 1024 x 1024 1.82 2048 x 2048 1.79 4096 x 4096 1.77 1000 iterations. Block size 32 x 32. Number of blocks to suit array size
Questions