Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of Modeling and Evaluation for Gather/Scatter Operations in Vector-SIMD architectures Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of Defense Technology Good morning, everyone. Today, I would like to present the paper” Modeling and Evaluation for Gather/Scatter Operations in Vector-SIMD architectures”.I am the corresponding author Liu Sheng. 1
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion The report consist of four parts. The first one is Introduction. 2 2
Gather-Scatter in Vector-SIMD architectures Gathers: vector of addrs vector register In the picture, we can see … Scatters: vector register vector of addrs – Reads and writes to different sub-banks performed in parallel – Multiple reads or writes to same sub-bank address combined into single access – Reads overlapped across different gathers; writes overlapped across different scatters 3 3
Gather-Scatter in application As present, vector-SIMD has become the popular architecture in modern processors. In which, vector memory provides sufficient data bandwidth through multiple memory banks. However, there are many applications, such as sparse-matrix, image warping, image histogram, whose access locations are random and can not be predicted in advance. The vector memory which only supports the common access patterns will a long unhidden memory latency. Gather/scatter instructions can directly realize the random vector access modes, which can satisfy the data requirements of these applications. 4 4
Definition of Gather-Scatter Gather and scatter are dual operations. A scatter indexed writes to an array, and a gather performs indexed reads from an array. In above equation, the array L for the scatter contains distinct write locations or read locations. Scatter: 5 5
what ‘s the probability of each distribution Gather/scatter has the stochastic and complicated properties, the hardware design of gather/scatter operations lacks theoretical analysis and modeling. what’s the possible distributions of access locations to the given PE and memory bank counts, what ‘s the probability of each distribution how to detailedly optimize the hardware implemation The proposed model will give the answers. 6
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion Next, Models and Verification 7 7
Example -Both the SIMD width and the number of memory banks(sing-port) are 4 -The Maximum Conflicts Per Cycle (MCPC) is equal to the maximum element of access location distributions. The Distribution of Access Location, We call DAL for short (1) MCPC=4, {4,0,0,0} (2) MCPC=3, {3,1,0,0} (3) MCPC=2, {2,2,0,0} (5) MCPC=1, {1,1,1,1} (4) MCPC=2, {2,1,1,0} To facilitate understanding, We take a simple example to explain our model. First, We assume both the SIMD width and the number of memory banks is 4, The Maximum Conflicts Per Cycle (MCPC) is equal to the maximum element of access location distributions. For this case, we list all 5 DALs. 4 access locations divide into 2 groups and distribute in two different memory banks 8 8
Relation among Distribution of Access Location (DAL) f(4,2): f(4,1);{2,1,1,0};{2,2,0,0} f(4,3): f(4,2);{3,1,0,0} f(4,4): f(4,3);{4,0,0,0} f(7,7): f(7,1);{2,f(5,2)}; {3,f(4,3)};{4,f(3,3)}; {5,f(2,2)};{6,f(1,1)}; g(7); The case of 4 memory banks is simple, so we can list all DALs easily. But it hard to list all DALs with large number of memory banks. Luckily, we find some recursive relations among DALs. Firstly, we create a function f, f(a,b) is the DALs whose MCPC<=b with a memory banks. It easy to see, f(7,7)can be represented by the small scale function f. f(a,b) is a set which include all the DALs whose maximum element less than or equal to b with a PEs 9 9
Modeling the DAL According to the relations among different DALs ,We can deduce a equation to calculate the DALs with different number of memory banks. f(a,b) is a set which include all the DALs whose maximum element less than or equal to b with a PEs is the integer portion of the quotient of A divided by B 10 10
Modeling the Probability of Access Conflict(PAC) (1) MCPC=4, {4,0,0,0} (2) MCPC=3, {3,1,0,0} (3) MCPC=2, {2,2,0,0} (5) MCPC=1, {1,1,1,1} (4) MCPC=2, {2,1,1,0} Through the DAL model, We can obtain all the DAL, but we don’t know the probability of each DAL. PAC is the probability of each DAL. It can generate by the common konwledge of permutation and combination. All possible permutation of the j-th DAL The probability of the j-th DAL 11 11
Modeling the PAC The data used in this equation come from D, D[i,j] is the i-th element in j-th row; O[j] is the number of non-zero elements in j-th row; G(i,j) is the sum of the front of i elements in j-th row; M(j) is an intermediate variable for calculation; F(m,j) is the number of elements m in the j-th row. 12
Model verification (By Matlab) Measured Estimated 4 PEs 5 8 PEs 22 16 PEs(total) 231 MCPC<=2(16 PEs) 9 MCPC=3(16 PEs) 21 MCPC=4(16 PEs) 34 MCPC=5(16 PEs) 37 MCPC=6(16 PEs) 35 MCPC=7(16 PEs) 28 MCPC=8(16 PEs) MCPC=9(16 PEs) 15 MCPC=10(16 PEs) 11 MCPC>=11(16 PEs) 19 32 PEs 8349 64 PEs 1741630 We validate our model by MATLAB program. For the DAL model, The results show all the measured and estimated results are totally same. For the PAL model, The average accuracy is over 98% Validating the PAC model The average accuracy of our model on the gather/ scatter is over 98% (min: 97.3%, max: 100%) when read/write locations are totally random Validating the DAL model The results show all the measured and estimated results are totally same 13 13
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 14
Evaluation and Results To hardware designers, two common methods can improve gather/scatter performance (1) Organizing memory bank into separate sub-banks (2) Adding buffers to cache memory requests 15
Evaluation and Results Analysis for MCPC with the PE:Bank varied Because of the performance of gather/scatter is closely related to the depth of buffer array and the ratio of PEs to memory banks, we make analysis the MCPC with the ratio of PE counts to Bank counts, through the proposed model. From the picture we can see: When PE:Bank=1:1. more than 90% DALs, their MCPC<=4 When PE:Bank=1:2. more than 90% DALs, their MCPC<=3 When PE:Bank=1:4. more than 90% DALs, their MCPC<=2 more than 80% DALs, their MCPC<=4 more than 90% DALs, their MCPC<=3 more than 90% DALs, their MCPC<=2 The performance of gather/scatter is closely related to the ratio of PEs to memory banks 16 16
Evaluation and Results Analysis for selecting the proper number of memory banks NAC=2.12 NAC=1.64 NAC=1.32 The picture shows the Average Number of Access Conflict with the ratio of the number of PEs to the number of memory banks varied. From the picture, We can draw two conclusions. First, if the ratio of PEs to memory banks is fixed, the larger number of memory banks will result in the bigger conflicts. Second, the conlfict reduces as the decrease of PE:Bank when the number of PEs is fixed Average Number of Access Conflict (NAC) 2.12 2.59 3.05 3.45 3.76 φ(k) stands the DALs whose MCPC is k 17 17
Evaluation and Results 3.05 1.98 1.34 This picture gives the relation with the runtime and buffer depth and memory bank counts. The runtime reduce a lot after adding buffer array, and The runtime time reduced as the ratio of PEs to memory banks deceased Buffer array deeper, Run time more less The runtime time reduced as the ratio of PEs to memory banks deceased 18 18
Evaluation and Results The effect of performance improvement with the depth of buffer array varied The picture shows the improvement of gather/scatter after adding buffers. The effect of buffer depth on the performance improvement become smaller as the ratio of PEs to memory banks deceased. When PE: bank =1:1, three curves are far away, this means the performance improvement are very different when the depth of buffer array is 2, 4, 8, respectively. What’s more, when PE: bank =1:4, the three curves of are very closely, because the depth of buffer array set to 2 is enough. Dispersive Very close 19 19
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 20
Conclusion -This model can give all the possible DAL, PAC and so on for gather/scatter operation in various situations. -This model can help users to select the optimum number of memory banks and guide the designers to select the proper number of buffers.(For example, if SIMD=16,each bank consist of 2 sub-banks,and buffer depth set to 4 is recommended) 21
Thank you! 22