Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa 2, Sriram Krishnamoorthy 2, Antonino Tumeo 2 and Xiaoming Li 1 1 University of Delaware 2 Pacific Northwest National Laboratory 1 September 24 th, 2010

Overview Introduction Cluster level Node level Results Conclusion Future Work 2

Sparse Matrix-Matrix Multiply - Challenges The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges: Large size of input matrices E.g. 10 6 ×10 6 with 30×10 6 nonzero elements Compressed representation Partitioning Density of the output matrices Load balancing large differences in density and computation times 4 Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.

Sparse Matrix-Matrix Multiply Cross Cluster implementation: Partitioning Data Distribution Load Balancing Communication/Scaling Result handling In-Node implementation: Multiple efficient SpGEMM algorithms CPU/GPU implementation Double buffering Exploiting heterogeneity 5 Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.

Sparse Matrix-Matrix Multiply - Cluster level Blocking Block size depends on sparsity of input matrices and # processing elements. NumOfBlocksX × NumOfBlocksY >> NumOfProcessingElements Data Layout What format and order to allow for easy and fast access Communication and storage implemented using Global Arrays (GA) Offers a set of primitives for non-blocking operations, contiguous and non-contiguous data transfers. 7

Sparse Matrix-Matrix Multiply - Data representation and Tiling 8 A B C C=A×B Blocked Matrix representation: Each block is stored in CSR* form 1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0 -2 0 2 7 0 0 0 0 0 5 data (1 -1 5 4 6 -2 2 7 5) col (0 1 1 2 3 0 2 3 4) row (0 2 3 5 8 9) *CSR: Compressed Sparse Row

Sparse Matrix-Matrix Multiply - Data representation and Tiling 9 A B C C=A×B datacolumnrowdatacol… Tile 0Tile 2 … Matrix A: The single CSR tiles are stored serialized into the GA space. Tile sizes and offsets are stored in a 2D array Tiles with 0 nonzero elements are not represented in the GA dataset.

Sparse Matrix-Matrix Multiply - Data representation and Tiling 10 B Matrix B: tiles are serialized in a transposed way. depending on the algorithm used to calculate the single tiles the data in the tiles can be stored transposed or not transposed. For the Gustavson algorithm the representation of the data in the tiles themselves is not transposed. 1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0 -2 0 2 7 0 0 0 0 0 5 1 0 0 -2 0 -1 5 0 0 0 0 0 4 2 0 0 0 6 7 0 0 0 0 0 5 not transposed or transposed

Sparse Matrix-Matrix Multiply - Tasking and Data Movement 11 01234 5678.. 1 C Each Block in C represents a Task. Nodes grab tasks and additional needed data when they have computational power available Results are stored locally meta data of the result blocks in each node is distributed to determine the offsets of the tiles in the GA space. Tiles are put into the GA space in right order 01N-1 … 3 4 02 5

Sparse Matrix-Matrix Multiply - Tasking and Data Movement 12 A B C=A×B Each node fetches the data needed by the task to handle: E.g. here for task/tile 5 the node has to load the data of Stripes s a = 1 and s b = 0 N-1 2 5 0 1 2 … S a -1 012…S b -1

Sparse Matrix-Matrix Multiply - Next Step: Locality aware Tasking 13 A B C C=A×B Assign tasks depending on how the global array is distributed over the cluster. The task queue should be aware of what data is already available in a node and based on that assign the follow up task. Tasks that should have a higher priority to be assigned to the node that handled task 5

2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4 1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0 -2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2 Sparse Matrix-Matrix Multiply - Gustavson 15 The algorithm is based on the equation: i-th row of C is a linear combination of the v rows of B for which a iv is nonzero. Where A has the dimensions p×q and B q×r 0 -5 0 0 0 -4 -5 0 14 -8-4 -2 0 14 7 0 0 0 0 0 × data(2,3,-1,2,3,-3,1,2,3,1,2,2,2,-1,4) col (0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5) row (0,2,5,7,9,12,15) data(1,-1,5,4,6,-2,7,-4,1,5,1,2) col (0, 1,1,2,3, 0,3, 4,1,4,3,4) row (0,2,3,5,8,10,12) AC B × i=1i=1, v=1i=1, v=3i=1, v=4 + + × +

2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4 1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0 -2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2 Sparse Matrix-Matrix Multiply - Gustavson 16 AC B In the CUDA implementation: each result row c i is handled by the 16 threads of a half warp ( 1/2W ) For each nonzero elements a iv in A one 1/2W performs the multiplications for each row v· in parallel The results are kept in dense form until all calculations are complete Then the results get compressed on the device. 00000 00000 00000 00000 00000 00000 213000 -4-20147 -20-12-11-4 -60833-12 -30142 -4018-5 half-warp 0 half-warp 1 half-warp 2 …

Sparse Matrix-Matrix Multiply – Case Study Midsize matrix from the University of Florida Sparse Matrix Collection* 2D/3D problem size 72, 000 × 72, 000 28, 715, 634 nonzero Blocked into 5041 tiles. Multiplying matrix with itself. 18 *http://www.cise.ufl.edu/davis/sparse Darker colors represent higher densities of nonzero elements.

Sparse Matrix-Matrix Multiply - Results 19 Scaling of SpGEMM with the different approaches

Sparse Matrix-Matrix Multiply - Results 20

Sparse Matrix-Matrix Multiply - Results Even inside a node where different compute elements are used the load balancing mechanism still performs well The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes. 21

Sparse Matrix-Matrix Multiply We presented a parallel framework using a co-design approach which takes into account characteristics of: The selected application (here SpGEMM) The underlying hardware (heterogeneous cluster) The difficulties of using static partitioning approaches show that a global load balancing method is needed Different optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute element For the selected case study optimal load balancing with uniform computation time across all processing elements is achieved 23

Future Work – General Tasking Framework for Heterogeneous GPU Clusters More General Task definition More flexibility in Input and output data definition Exploring limits imposed on Tasks by a Heterogeneous system Feedback loop during execution that allows more efficient assignment of tasks. Introducing heterogeneous execution on GPU and CPU in one process/core. Locality aware Task queue(s) and work stealing Task reinsertion or generation at the node level. 25

Thank you Questions? 26

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Similar presentations

Presentation on theme: "Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Similar presentations

Presentation on theme: "Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa."— Presentation transcript:

Similar presentations

About project

Feedback