Download presentation
Presentation is loading. Please wait.
Published byἩρώδης Λιακόπουλος Modified over 6 years ago
1
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B. Wilkinson, Nov 10, 2014, MemCoalescing.ppt
2
Memory coalescing is combining separate memory accesses into one combined access – it is done by the GPU when the locations are sequential locations in global memory banks. Consider setting the elements of two-dimensional array to given data values. This could be done across rows or down columns In the following code, we will demonstrate the effects of each approach
3
Experiment Load thread ID (flattened global threadID) into array element so one can tell which thread accesses which location when array printed out. Do it a large number of time times. This simulates a calculation. For comparison purposes, access done: Access done across rows Access done across column Time of execution compared. In practice, a problem may dictate the access order GPU structure -- one or more 2-D blocks in a 2-D grid. Each block, 2-D 32x32 threads fixed (1024, max. compute cap. 2/3)
4
One way __global__ void gpu_Comput1 (int *h, int N, int T) {
int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; // thread ID int index = col + row * N; // array index for (int t = 0; t < T; t++) // loop to reduce other time effects h[index] = threadID; // load array with global thread ID }
5
Another way __global__ void gpu_Comput2 (int *h, int N, int T) {
int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; // thread ID int index = row + col * N; // array index for (int t = 0; t < T; t++) // loop to reduce other time effects h[index] = threadID; // load array with global thread ID }
6
/* ------------------------- GPU Computation 1 -----------------------------------*/
gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); // launch once kernel outside timing cudaEventRecord( start, 0 ); gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); // measure end time cudaEventSynchronize( stop ); // wait for event recording cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Results to check printf("\nComputation with memory coalescing possible\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 similar
7
Some results A grid of one block and one iteration Array 32x32
No speedup recorded because time of other operations dominate execution time
8
A grid of one block and 1000000 iterations
Array 32 x 32 Speedup = 17.16
9
Repeat just to check results are consistent
10
A grid of 16 x 16 blocks and 10000 iterations
Array 512x512 Speedup = 12.08 Different numbers of iterations produce similar results
11
Different Array Sizes Array size Speedup 32 x 32 16.70 64 x 64 15.11 128 x 128 15.12 256 x 256 11.85 512 x 512 12.03 1024 x 1024 11.75 2048 x 2048 11.80 4096 x 4096 11.90 1000 iterations. Block size 32 x 32. Number of blocks to suit array size
12
Effects of memory access in matrix multiplication
One thread is responsible for computing one result Cij and needs access a row of A and a column of B: Thread Each thread access one row of A and one column of B N2 row/column combinations, N2 threads
13
– Bad cannot do memory coalescing.
Seen another way, in first time period, each thread accesses the first element in a row of A: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different rows Given the row-major order of how A is stored, those threads will locations are not in consecutive locations – Bad cannot do memory coalescing. Question: how many threads access the same location?
14
Next, each thread accesses the first element in a column of B:
Thread I, … Thread N-1, … Consider those threads that access different columns Given the row-major order of how A is stored, those threads will locations are in consecutive locations. – Good! Can do memory coalescing. Question: how many threads access the same location?
15
How can we get better memory accesses and memory coalcesing?
Transpose one array Copy all rows of A to columns and all columns of A to rows before access A and modify program according. Personally I have not found this to help because of the overhead of doing the transpose sequentially
16
Sequential code for a transpose using same array:
for (i=0; i < N; i++) for (j=0; j < i; j++) { temp = B[i][j]; B[i][j] = b[j][i]; B[j][i] = temp; } (In my code, I use separate arrays) Could be done on host prior to copying to device. How would the code look like if on device?
17
/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/
transposeArray(a, a_T, N); // transpose array cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose // cudaEventSynchronize(start); // Needed? cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms2, start, stop ); printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time
18
Some results 8 x 8 array 1 block of 8 x 8 threads
Speedup = 1.62 over not transposing array
19
Some results 32 x 32 array 1 block of 32 x 32 threads
Speedup = 1.17 over not transposing array
20
Some results 256 x 256 array 8 blocks of 32 x 32 threads
Speedup = 0.89!! over not transposing array
21
Some results 1024 x 1024 array 32 blocks of 32 x 32 threads
Speedup = 0.93!! over not transposing array
22
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.