Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Similar presentations


Presentation on theme: "Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu."— Presentation transcript:

1 Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu. B. Wilkinson, Nov 10, SharedMem.ppt

2 Experiment Load thread ID (flattened global threadID) into array element so one can tell which thread accesses which location when array printed out. Do it a large number of time times. This simulates a calculation. For comparison purposes, access done: Using global memory only Using shared memory with copying back to global memory Time of execution compared. GPU structure -- one or more 2-D blocks in a 2-D grid. Each block 2-D 32x32 threads fixed (1024, max. compute cap. 2/3)

3 1. Using global memory only
__global__ void gpu_WithoutSharedMem (int *h, int N, int T) { // Array loaded with global thread ID that accesses that location // Coalescing should be possible int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID }

4 2. Using shared memory __global__ void gpu_SharedMem (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem. }

5 Main program Computation 2 and 3 similar …
/* Allocate Memory */ int size = N * N * sizeof(int); // number of bytes in total in array int *h, *dev_h; // ptr to arrays holding numbers on host and device h = (int*) malloc(size); // Array on host cudaMalloc((void**)&dev_h, size); // allocate device memory /* GPU Computation without shared memory */ gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing cudaEventRecord( start, 0 ); gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check printf("\nComputation without shared memory\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 and 3 similar

6 Some results A grid of one block and one iteration Array 32x32
Shared memory Speedup = 1.18

7 A grid of one block and 1000000 iterations
Array 32 x 32 Shared memory Speedup = 1.24

8 Repeat just to check results are consistent

9 A grid of 16 x 16 blocks and 10000 iterations
Array 512x512 Speedup = 1.74 Different numbers of iterations produce similar results

10 Different Array Sizes Array size Speedup 32 x 32 1.24 64 x 64 1.37 128 x 128 1.36 256 x 256 1.78 512 x 512 1.75 1024 x 1024 1.82 2048 x 2048 1.79 4096 x 4096 1.77 1000 iterations. Block size 32 x 32. Number of blocks to suit array size

11 Questions


Download ppt "Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu."

Similar presentations


Ads by Google