A computational loop k k Integration Newton Iteration Linear system solvers k k t
SPIKE: A Parallel Banded System Solver – an introduction after RCM reordering Large sparse linear systems arise often in various computational science & engineering applications. Banded, or low-rank perturbations of banded, systems (dense or sparse within the band) are sometimes obtained after reordering. SPIKE is proposed as a parallel solver for banded systems with the potential of exhibiting multilevel parallelism
SPIKE design principles Reducing memory references and interprocessor communication at the cost of extra arithm. operations compared to LAPACK. Allowing multiple levels of parallelism. Creating a polyalgorithm – versions vary from direct to preconditioned iterative schemes.
Ax = f Next Generation Sparse Solvers: The SPIKE Algorithm A = D S B1 C2 C3 C4 B2 B3 x1 x4 x3 x2 f1 f4 f3 f2 = Ax = f Solve Dy = f Solve Sx = y A = D S D = diag (A1, A2, A3, A4)
The Spike Matrix “S” Reduced System . . . = Sx = y := m m . = Sx = y The Spike Matrix “S” := m m := m m Reduced System I o Order: 2m (p-1) =
SPIKE: A Polyalgorithm Different choices depending on the properties of the matrix and platform architecture (towards an adaptive library) The diagonal blocks can be solved: Directly (LU, Cholesky, or sparse counterparts) Iteratively (with a preconditioning strategy) The spikes can be computed: Explicitly (fully or partially) Approximately On the Fly The reduced system can be solved: Directly (Recursive SPIKE) Approximately (Truncated SPIKE) Iteratively (with a preconditioning scheme)
SPIKE vs ScaLapack ScaLapack SPIKE U L A1 A2 A3 A4 I V W S AX=F and A=L*U Reduced system V1 V2 V3 W2 W3 W4 AX=F and A=D*S A1 A2 A3 A4 C2 C3 C4 B1 B2 B3 Retrieve solution Spike matrix SPIKE Algorithm design: no LU factorization, no reordering, no Schur complement. New banded primitives using BLAS-3 Polyalgorithm implementation
Multilevel Parallelism: SPIKE calling MKL-Pardiso for banded systems that are sparse within the band Node 1 Node 2 Node 3 Node 4 Pardiso SPIKE SPIKE uses Pardiso on each cluster node.
SPIKE Options (dense within the band) Solving the reduced system R = recursive E = explicit F = on-the-fly T = truncated 2. Factorization (diagonal blocks) No pivoting (diagonal boosting, if necessary): L = LU U = LU & UL A = alternate LU or UL Pivoting: P = LU 3. Solution improvement: 0 direct solver only 2 iterative refinement 3 outer Bicgstab iterations
Hierarchy of Computational Modules The SPIKE algorithm Hierarchy of Computational Modules Level Description 3 SPIKE 2 Lapack Pardiso, SuperLU, MUMPS Iterative solvers 1 Primitives for banded matrices (our own): banded triangular solve banded UL BLAS3 (dense matrix-matrix primitives) Sparse BLAS
SPIKE algorithms Algorithm E Explicit R Recursive T Truncated F Factorization E Explicit R Recursive T Truncated F on the Fly P LU w/ pivoting Explicit generation of spikes- reduced system is solved iteratively with a preconditioner. EP Explicit generation of spikes- reduced system is solved directly using recursive SPIKE RP Implicit generation of reduced system which is solved on-the-fly using an iterative method. FP L LU w/o pivoting EL Explicit generation of spikes- reduce system is solved directly using recursive SPIKE RL Truncated generation of spike tips: Vb is exact, Wt is approx.- reduced system is solved directly TL FL U LU and UL w/o pivot. Truncated generation of spike tips: Vb, Wt are exact- reduced system is solved directly TU Implicit generation of reduced system which is solved on-the-fly using an iterative method with precond. FU A alternate LU / UL Explicit generation of spikes using new partitioning- reduced system is solved iteratively with a preconditioner. EA Truncated generation of spikes using new partitioning- reduced system is solved directly TA