"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Simulations of Memory Hierarchy LAB 2: CACHE LAB.

" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

Data Locality CS 524 – High-Performance Computing.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.

1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,

Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

JAVA AND MATRIX COMPUTATION

A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.

Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

IMP: Indirect Memory Prefetcher

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.

An Analysis of the n- Queens problem Saleem Karamali.

Introduction to Machine Learning, its potential usage in network area,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Dr. Ofer Hadar Communication Systems Engineering Department

Auburn University

Auburn University

Two Dimensional Highly Associative Level-Two Cache Design

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.

CS2100 Computer Organization

An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Yuanrui Zhang, Mahmut Kandemir

Associativity in Caches Lecture 25

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Replacement Policy Replacement policy:

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

How will execution time grow with SIZE?

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Richard Dorrance Literature Review: 1/11/13

Morgan Kaufmann Publishers

5.2 Eleven Advanced Optimizations of Cache Performance

A Closer Look at Instruction Set Architectures

Cache Memory Presentation I

Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance

Lecture: Cache Hierarchies

CSCE 990: Advanced Distributed Systems

CS61C : Machine Structures Lecture 6. 2

Lecture 21: Memory Hierarchy

Bojian Zheng CSCD70 Spring 2018

Linchuan Chen, Peng Jiang and Gagan Agrawal

Memory Hierarchies.

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Supported by the National Science Foundation.

Adapted from slides by Sally McKee Cornell University

Shared Memory Accesses

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Cache - Optimization.

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging Systems (CenSSIS) Electrical and Computer Engineering Department, Northeastern University, Boston, MA 02115 {drivera, kaeli}@ece.neu.edu, www.ece.neu.edu/students/drivera/tlg/tunlib.htm Introduction and Goals a barrier encountered in many CenSSIS applications is the computational time required to generate a solution this is most critical in sparse algorithms: because of indirect indexing, the number of non-contiguous memory system accesses increases it can increase the number of memory stalls and overall application execution time modifying to the right data layout can significantly improve the performance of solving the system but predicting and improving locality is complicated due to irregular patterns of references and the non-zeros structure of matrix is unknown a priori in addition, the selection of a preconditoner as an efficient solution for a particular application is now more difficult there are a large amount of options and often there is little knowledge of the characteristics of those methods by the people who will use them we are working on the development and implementation of a framework that will allow us to: predict the best function to predict locality for distributed shared memory machines, predict execution times of computations and communications modify data layout to increase locality choose Iterative method and preconditioner, including their dynamic parameters Jagged Diagonal format Second cache hierarchy simulation Second set of matrices replacement policy LRU level 1 (data and instructions) cache size 8 KB line size 64 B associativity 4-way level 2 cache size 256 KB line size 128 B associativity 8-way a11 a12 a14 a22 a23 a25 a31 a33 a34 a42 a44 a45 a46 a55 a56 a65 a66 a42 a44 a45 a46 a11 a12 a14 a22 a23 a25 a31 a33 a34 a55 a56 a65 a66 Name: s3dkq4m2 N = 90449 Nnz = 2455670 WB = 615 Name: Fidap011 N = 16614 Nnz = 1091362 WB = 9680 value list = a42 | a11 | a22 | a31 | a55 | a65 || a44 | a12 | a23 | a33 | a56 | a66 || a45 | a14 | a25 | a34 || a46 column indexes = 2 | 1 | 2 | 1 | 5 | 5 || 4 | 2 | 3 | 3 | 6 | 6 || 5 | 4 | 5 | 4 || 6 start positions = 1 | 7 | 13 | 17 | 18 perm vector = 4 | 2 | 3 | 1 | 5 | 6 for i ← 1 to N Temp[perm vector[i]] := Y [i] Y [i] := Temp[i] Distance vs. Cache misses disp := 0 for i ← 1 to num jdiag for j ← 1 to (start position[i+1] - start position[i] - 1) Y [j] := Y [j] + value list[disp] × X[column indexes[disp]] disp := disp + 1 end for Distances x y x x x x x y aelemes(x,y) = 2 D1(x,y) = max( nelems (x) , nelems(y) ) - aelemes(x,y) D2(x,y) = nelems (x) + nelems(y) - 2 × aelemes(x,y) Then: Note: JAD indirect addressing on X using column indexes, then distance by row is correct!. But since JAD storages by column, the column indexes should be compared by column. Clustering - Optimizing x x Choosing parameters and function to predict locality, also iterative method and preconditioner, including their dynamic parameters. New matrix Feature extraction Data Mining If column indexes = 2 | 1 | 2 | 1 | 5 | 5 || 4 | 2 | 3 | 3 | 6 | 6 || 5 | 4 | 5 | 4 || 6 then: Note: CSC indirect addressing on Y using row indexes, then distance by column is correct JAD indirect addressing on X using column indexes, then distance by row is correct! aelemes(x,y) = 1 D3 = ( 6 – 3 ) + ( 6 – 2 ) + ( 4 – 2 ) + ( 1 – 0 ) How many different indexes by Jagged diagonal are there? Modifying locality and choosing Iterative method and preconditioner, including their dynamic parameters. System solved Models and rules, samples. Conclusions Experiments and results an indirect measure of locality is the reuse distance, which can be calculated for example counting the number of matches between sets of columns/row; but the way that it is measured depends on the storage scheme selected minimizing a distance function lends to a locality maximization different distance functions can be definite for a specific format, but their accuracy for describing the locality for a specific matrix could depend on non-zeros structure of the matrix, because in order to measure locality, these functions have to be defined by that structure First set of matrices Studying locality First cache hierarchy simulation Compressed Column Storage format (CCS) one level cache size 16 KB line size 64 B associativity 2-way replacement policy LRU a11 a12 a14 for i ← 1 to N for j ← start positions [i] to start positions [i+1] -1 Y [row indexes [j]] := Y [row indexes [j]] + value list[j] × X[i] end for a22 a23 a25 a31 a33 a34 Name: Fidap008 N = 3096 Nnz = 90766 WB = 256 Name: Bcsstk16 N = 4884 Nnz = 290378 WB = 282 a42 a44 a45 a46 Generating permutations a55 a56 Where: N, number of row/columns Nnz, number of non-zero elements a65 a66 Plans and future work: Defining a framework that allows us to: predict the best function to predict locality for distributed shared memory machines, predict execution times of computations and communications modify data layout to increase locality choose Iterative method and preconditioner, including their dynamic parameters 4000 Permut. 2000 Permut. 200 Permut. 20 Permut. value list = a11 | a31 | a12 | a22 | a42 | a23 | a33 | a14 | a34 | a44 | a25 | a45 | a55 | a65 | a46 | a56 | a66 row indexes = 1 | 3 | 1 | 2 | 4 | 2 | 3 | 1 | 3 | 4 | 2 | 4 | 5 | 6 | 4 | 5 | 6 start positions = 1 | 3 | 6 | 8 | 11 | 15 | 18 arrays Y, row indexes and value list have Nnz references and arrays X and start positions have 2×N therefore total number of references 3×Nnz+4×N X and start positions have good spatial locality : compulsory misses, but poor contribution because usually N << Nnz arrays value list and row indexes have no temporal locality, but have good spatial locality : compulsory misses  mayor contribution (2×Nnz) Y [row indexes [j]] := Y [row indexes [j]] + value list[j] × X[i] complex behavior conflict and capacity misses Remarks: main source of cache misses are misses due to access arrays value list and row indexes, but it can be solved with prefetching; assuming continuous storage misses due to access array Y are hard to estimate focusing on data and not instructions a closer grouping of Nnz in a particular row  higher temporal locality on Y a closer grouping of Nnz in consecutive rows  higher spatial locality on Y Name: Bcsstk18 N = 11948 Nnz = 149098 WB = 2488 Name: E30R5000 N = 9661 Nnz = 306002 WB = 686 Points to consider: the associativity and size of each element of the memory hierarch the sparse format selected classic techniques of optimization such as blocking and tiling Value Added to CenSSIS This research addresses an important problem in accelerating the execution of applications in the area of Subsurface Sensing and Imaging System. This has a high value for the CenSSIS community. References D. Heras, V. Blanco, J. Cabaleiro, and F. Rivera. Modeling and improving locality for the sparse matrix-vector product on cache memories . Future Generation Computer Systems, 18(1):55-67, 2001 Shuting Xu, Jun Zhang, A Data Mining Approach to Matrix Preconditioning Problem, Laboratory for High Performance Scientific Computing and Computer Simulation, Department of Computer Science, University of Kentucky, March, 2005 Distance vs. Cache misses Acknowledgement This work is affiliated with CenSSIS, the Center for Subsurface Sensing and Imaging Systems. Support is provided by National Science Foundation under Grant No. NFS ACR:0342555.