Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden Tel.: +49 (0)
Software libraries MTL4 FEniCS Compressed sparse matrices Insertion Benchmarks Vision Overview
Generic library for high-performance numeric operations in mathematical notation Many new techniques as implicit enable-if and meta- tuning Most modern iterative solvers Focus on high-performance simulation: FEM/XFEM/FVM/FDM Commercial version in preparation Parallel version in progress Multi-core, GPU support and multigrid in near future Matrix Template Library 4
Innovative Produktentwicklung durch Finite-Elemente-Methode (FEM) Innovative Produktentwicklung durch template < class LinearOperator, class HilbertSpaceX, class HilbertSpaceB, class Preconditioner, class Iteration > int cg(const LinearOperator& A, HilbertSpaceX& x, const HilbertSpaceB& b, const Preconditioner& M, Iteration& iter) { typedef typename mtl::Collection ::value_type Scalar; Scalar rho, rho_1, alpha, beta; HilbertSpaceX p(size(x)), q(size(x)), r(size(x)), z(size(x)); r = b - A*x; while (! iter.finished(r)) { z = solve(M, r); rho = dot(r, z); if (iter.first()) p = z; else { beta = rho / rho_1; p = z + beta * p; } q = A * p; alpha = rho / dot(p, q); x += alpha * p; r -= alpha * q; rho_1 = rho; ++iter; } return iter; } Linearer Gleichungslöser
Free software for solving differential equations FFC – FEniCS Form Compiler High-level math language for formulating differential equations Generate C++ code DOLFIN – generic FEM kernel C++ library for FEM cores: assembler, mesh and function abstraction Interface to uBLAS, PETSc, Trillinos, and MTL4 Paper focus in matrix assembly FEniCS
Compressed Sparse Row Format Most common general-purpose sparse format Entries sorted Kind of run- length encoding on rows
In-Flight Insertion Very simple use Like dense matrices Simple realization Extremely expensive All following entries are changed Quadratic complexity A[0][1]= 6;
Dedicated insertion phase Matrix is available after terminating insertion Later modification impossible Works for distributed matrices as well Used in PETSc, includes construction of communication buffers for dist. SpMVP Janus derives its name from it (two faces) Two-phase Insertion
Inserter = object providing operations to set up other objects, e.g. matrices or vectors, efficiently Insertion phase lasts as long as inserter lives Insert within a scope (block, function) Matrix ready when inserter destroyed Later insertion possible with another inserter Extends to distributed matrices and vectors MTL4 inserters have minimal memory usage Inserter Concept in MTL4
int main(int argc, char* argv[]) { compressed2D A(3, 5); { matrix::inserter > ins(A); ins[0][0] << 1.0; ins[0][2] << 2.0; ins[1][3] << 3.0; ins[2][1] << 4.0; ins[2][4] << 5.0; } std::cout << "A is\n" << A << '\n'; return 0; } Using Inserters
Direct Insertion Reserve s entries per row Find insert position By linear or binary search Move remainder in row Linear in s That is constant A[0][1]= 6;
Indirect Insertion For saturated rows use “spare” container std::map of index pair Logarithmic in number of spare entries Additional allocation About 10 times slower than direct insertion A[0][4]= 7;
Assemble CRS matrix Row order important, and order within row Performance measure: number of non-zeros inserted per second Reassembly Three libraries: uBLAS (including vector-of-vector), MTL4, PETSc Ordinary workstation (Intel) All benchmarks run in a simple interface routine for each library, e.g. Benchmark void insert row(Matrix& A, int row_idx, int ∗ cols_idx, double ∗ a, int n) { for(int j=0; j<n; j++) A(row_idx, cols_idx[j]) += a[j]; }
10,000 rows, 5 non-zeros/row MTL4: 46 million entries per second uBLAS: 5.9 million entries per second uBLAS (gov): 2 million entries per second PETSc: 22 million entries per second Benchmark: Assembly rate with ascending rows
100,000 rows, 50 non-zeros/row MTL4: 29.6 million entries per second uBLAS: 6.5 million entries per second uBLAS (gov): 2.8 million entries per second PETSc: 32.3 million entries per second Benchmark: Assembly rate with ascending rows
10,000 rows, 5 non-zeros/row MTL4: 41.4 million entries per second uBLAS: 31,300 entries per second uBLAS (gov): 1.9 million entries per second PETSc: 19.9 million entries per second Benchmark: Assembly rate with random rows
100,000 rows, 50 non-zeros/row MTL4: 25.6 million entries per second uBLAS: measuring abandonned uBLAS (gov): 2.7 million entries per second PETSc: 25.6 million entries per second Benchmark: Assembly rate with random rows
10,000 rows, 5 non-zeros/row MTL4: 4.8 million entries per second uBLAS: 16,700 entries per second uBLAS (gov): 1.8 million entries per second PETSc: 15,900 entries per second Benchmark: Assembly rate with entirely random entries
10,000 rows, 50 non-zeros/row MTL4: 2.9 million entries per second uBLAS: 3,340 entries per second uBLAS (gov): 1.7 million entries per second PETSc: 13,400 entries per second Benchmark: Assembly rate with random rows
How to do Science in Silicon? Graphic application CPU GPU
Scientific Software Scientific application CPU GPUMulti-CorePar. Arch. Scien. Proc.
Introduced new approach for setting and modifying compressed sparse matrices Does not need preparation phase Minimal memory footprint Optimal performance Tuned block-insertion under progress Extends to distributed data structures Conclusions