Download presentation
Presentation is loading. Please wait.
Published byἩρακλείδης Παπαδόπουλος Modified over 6 years ago
1
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging Systems (CenSSIS) Electrical and Computer Engineering Department, Northeastern University, Boston, MA {drivera, Introduction and Goals a barrier encountered in many CenSSIS applications is the computational time required to generate a solution this is most critical in sparse algorithms: because of indirect indexing, the number of non-contiguous memory system accesses increases it can increase the number of memory stalls and overall application execution time modifying to the right data layout can significantly improve the performance of solving the system but predicting and improving locality is complicated due to irregular patterns of references and the non-zeros structure of matrix is unknown a priori in addition, the selection of a preconditoner as an efficient solution for a particular application is now more difficult there are a large amount of options and often there is little knowledge of the characteristics of those methods by the people who will use them we are working on the development and implementation of a framework that will allow us to: predict the best function to predict locality for distributed shared memory machines, predict execution times of computations and communications modify data layout to increase locality choose Iterative method and preconditioner, including their dynamic parameters Jagged Diagonal format Second cache hierarchy simulation Second set of matrices replacement policy LRU level 1 (data and instructions) cache size 8 KB line size 64 B associativity 4-way level 2 cache size 256 KB line size 128 B associativity 8-way a11 a12 a14 a22 a23 a25 a31 a33 a34 a42 a44 a45 a46 a55 a56 a65 a66 a42 a44 a45 a46 a11 a12 a14 a22 a23 a25 a31 a33 a34 a55 a56 a65 a66 Name: s3dkq4m2 N = 90449 Nnz = WB = 615 Name: Fidap011 N = 16614 Nnz = WB = 9680 value list = a42 | a11 | a22 | a31 | a55 | a65 || a44 | a12 | a23 | a33 | a56 | a66 || a45 | a14 | a25 | a34 || a46 column indexes = 2 | 1 | 2 | 1 | 5 | 5 || 4 | 2 | 3 | 3 | 6 | 6 || 5 | 4 | 5 | 4 || 6 start positions = 1 | 7 | 13 | 17 | 18 perm vector = 4 | 2 | 3 | 1 | 5 | 6 for i ← 1 to N Temp[perm vector[i]] := Y [i] Y [i] := Temp[i] Distance vs. Cache misses disp := 0 for i ← 1 to num jdiag for j ← 1 to (start position[i+1] - start position[i] - 1) Y [j] := Y [j] + value list[disp] × X[column indexes[disp]] disp := disp + 1 end for Distances x y x x x x x y aelemes(x,y) = 2 D1(x,y) = max( nelems (x) , nelems(y) ) - aelemes(x,y) D2(x,y) = nelems (x) + nelems(y) - 2 × aelemes(x,y) Then: Note: JAD indirect addressing on X using column indexes, then distance by row is correct!. But since JAD storages by column, the column indexes should be compared by column. Clustering - Optimizing x x Choosing parameters and function to predict locality, also iterative method and preconditioner, including their dynamic parameters. New matrix Feature extraction Data Mining If column indexes = 2 | 1 | 2 | 1 | 5 | 5 || 4 | 2 | 3 | 3 | 6 | 6 || 5 | 4 | 5 | 4 || 6 then: Note: CSC indirect addressing on Y using row indexes, then distance by column is correct JAD indirect addressing on X using column indexes, then distance by row is correct! aelemes(x,y) = 1 D3 = ( 6 – 3 ) + ( 6 – 2 ) + ( 4 – 2 ) + ( 1 – 0 ) How many different indexes by Jagged diagonal are there? Modifying locality and choosing Iterative method and preconditioner, including their dynamic parameters. System solved Models and rules, samples. Conclusions Experiments and results an indirect measure of locality is the reuse distance, which can be calculated for example counting the number of matches between sets of columns/row; but the way that it is measured depends on the storage scheme selected minimizing a distance function lends to a locality maximization different distance functions can be definite for a specific format, but their accuracy for describing the locality for a specific matrix could depend on non-zeros structure of the matrix, because in order to measure locality, these functions have to be defined by that structure First set of matrices Studying locality First cache hierarchy simulation Compressed Column Storage format (CCS) one level cache size 16 KB line size 64 B associativity 2-way replacement policy LRU a11 a12 a14 for i ← 1 to N for j ← start positions [i] to start positions [i+1] -1 Y [row indexes [j]] := Y [row indexes [j]] + value list[j] × X[i] end for a22 a23 a25 a31 a33 a34 Name: Fidap008 N = 3096 Nnz = 90766 WB = 256 Name: Bcsstk16 N = 4884 Nnz = WB = 282 a42 a44 a45 a46 Generating permutations a55 a56 Where: N, number of row/columns Nnz, number of non-zero elements a65 a66 Plans and future work: Defining a framework that allows us to: predict the best function to predict locality for distributed shared memory machines, predict execution times of computations and communications modify data layout to increase locality choose Iterative method and preconditioner, including their dynamic parameters 4000 Permut. 2000 Permut. 200 Permut. 20 Permut. value list = a11 | a31 | a12 | a22 | a42 | a23 | a33 | a14 | a34 | a44 | a25 | a45 | a55 | a65 | a46 | a56 | a66 row indexes = 1 | 3 | 1 | 2 | 4 | 2 | 3 | 1 | 3 | 4 | 2 | 4 | 5 | 6 | 4 | 5 | 6 start positions = 1 | 3 | 6 | 8 | 11 | 15 | 18 arrays Y, row indexes and value list have Nnz references and arrays X and start positions have 2×N therefore total number of references 3×Nnz+4×N X and start positions have good spatial locality : compulsory misses, but poor contribution because usually N << Nnz arrays value list and row indexes have no temporal locality, but have good spatial locality : compulsory misses mayor contribution (2×Nnz) Y [row indexes [j]] := Y [row indexes [j]] + value list[j] × X[i] complex behavior conflict and capacity misses Remarks: main source of cache misses are misses due to access arrays value list and row indexes, but it can be solved with prefetching; assuming continuous storage misses due to access array Y are hard to estimate focusing on data and not instructions a closer grouping of Nnz in a particular row higher temporal locality on Y a closer grouping of Nnz in consecutive rows higher spatial locality on Y Name: Bcsstk18 N = 11948 Nnz = WB = 2488 Name: E30R5000 N = 9661 Nnz = WB = 686 Points to consider: the associativity and size of each element of the memory hierarch the sparse format selected classic techniques of optimization such as blocking and tiling Value Added to CenSSIS This research addresses an important problem in accelerating the execution of applications in the area of Subsurface Sensing and Imaging System. This has a high value for the CenSSIS community. References D. Heras, V. Blanco, J. Cabaleiro, and F. Rivera. Modeling and improving locality for the sparse matrix-vector product on cache memories . Future Generation Computer Systems, 18(1):55-67, 2001 Shuting Xu, Jun Zhang, A Data Mining Approach to Matrix Preconditioning Problem, Laboratory for High Performance Scientific Computing and Computer Simulation, Department of Computer Science, University of Kentucky, March, 2005 Distance vs. Cache misses Acknowledgement This work is affiliated with CenSSIS, the Center for Subsurface Sensing and Imaging Systems. Support is provided by National Science Foundation under Grant No. NFS ACR:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.