Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Zhang Gang School of Computer Sci. & Tech.

Similar presentations


Presentation on theme: "Prof. Zhang Gang School of Computer Sci. & Tech."— Presentation transcript:

1 Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 10 Stride
Prof. Zhang Gang School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

2 Stride Stride is the distance separating elements in memory that will be adjacent in a vector register. Consider: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } When an array is allocated memory, it is linearized and must be laid out in either row-major (as in C) or column-major (as in Fortran) order.

3 Stride This linearization means that either the elements in the row or the elements in the column are not adjacent in memory For example The elements of D that are accessed by iterations in the inner loop are separated by the row size times 8 (the number of bytes per entry) for a total of 800 bytes. The C code allocates in row-major order This distance separating elements to be gathered into a single register is called the stride For vector processors without caches, we need another technique to fetch elements of a vector that are not adjacent in memory.

4 Stride Must vectorize multiplication of rows of B with columns of D
Use non-unit stride Strides greater than one Bank conflict (stall) occurs when the same bank is hit faster than bank busy time: #banks / LCM(stride, #banks) < bank busy time (in # of cycles) Memory operations Load/store operations move groups of data between registers and memory

5 Stride Unit stride Non-unit (constant) stride Indexed (gather-scatter)
Three types of addressing Unit stride Fastest Non-unit (constant) stride Indexed (gather-scatter) Vector equivalent of register indirect Good for sparse arrays of data Increases number of programs that vectorize

6 Example 8 memory banks with a bank busy time of 6 cycles and a total memory latency of 12 cycles. How long will it take to complete a 64-element vector load with a stride of 1? With a stride of 32? How to understand the sentence that a total memory latency of 12 cycles? 对比Memory access time和Memory access cycle 应该理解为在获得存储器读出数据之前等待了12个周 期 不应该理解为获得存储器读出数据用了12个周期

7 Example (Answer) Stride of 1: number of banks is greater than the bank busy time, so it takes 12+64=76 clock cycles76/64=1.2 cycle per element Stride of 32: the worst case scenario happens when the stride value is a multiple of the number of banks, which this is! Every access to memory will collide with the previous one! Thus, the total time will be: 12+1+6*63=391 clock cycles or 391/64=6.1 clock cycles per element!

8 Exercises What is the meaning of the stride?
What is the meaning of the row-major order for an array when storage the array into memory? What is the meaning of the column-major order for an array when storage the array into memory? What is the meaning of unit stride? What is the meaning of non-unit stride? When will occur a memory bank conflict?


Download ppt "Prof. Zhang Gang School of Computer Sci. & Tech."

Similar presentations


Ads by Google