Download presentation
Presentation is loading. Please wait.
Published byLoreen Jackson Modified over 6 years ago
1
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 10 Stride
Prof. Zhang Gang School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China
2
Stride Stride is the distance separating elements in memory that will be adjacent in a vector register. Consider: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } When an array is allocated memory, it is linearized and must be laid out in either row-major (as in C) or column-major (as in Fortran) order.
3
Stride This linearization means that either the elements in the row or the elements in the column are not adjacent in memory For example The elements of D that are accessed by iterations in the inner loop are separated by the row size times 8 (the number of bytes per entry) for a total of 800 bytes. The C code allocates in row-major order This distance separating elements to be gathered into a single register is called the stride For vector processors without caches, we need another technique to fetch elements of a vector that are not adjacent in memory.
4
Stride Must vectorize multiplication of rows of B with columns of D
Use non-unit stride Strides greater than one Bank conflict (stall) occurs when the same bank is hit faster than bank busy time: #banks / LCM(stride, #banks) < bank busy time (in # of cycles) Memory operations Load/store operations move groups of data between registers and memory
5
Stride Unit stride Non-unit (constant) stride Indexed (gather-scatter)
Three types of addressing Unit stride Fastest Non-unit (constant) stride Indexed (gather-scatter) Vector equivalent of register indirect Good for sparse arrays of data Increases number of programs that vectorize
6
Example 8 memory banks with a bank busy time of 6 cycles and a total memory latency of 12 cycles. How long will it take to complete a 64-element vector load with a stride of 1? With a stride of 32? How to understand the sentence that a total memory latency of 12 cycles? 对比Memory access time和Memory access cycle 应该理解为在获得存储器读出数据之前等待了12个周 期 不应该理解为获得存储器读出数据用了12个周期
7
Example (Answer) Stride of 1: number of banks is greater than the bank busy time, so it takes 12+64=76 clock cycles76/64=1.2 cycle per element Stride of 32: the worst case scenario happens when the stride value is a multiple of the number of banks, which this is! Every access to memory will collide with the previous one! Thus, the total time will be: 12+1+6*63=391 clock cycles or 391/64=6.1 clock cycles per element!
8
Exercises What is the meaning of the stride?
What is the meaning of the row-major order for an array when storage the array into memory? What is the meaning of the column-major order for an array when storage the array into memory? What is the meaning of unit stride? What is the meaning of non-unit stride? When will occur a memory bank conflict?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.