A Factored Sparse Approximate Inverse software package (FSAIPACK) for the parallel preconditioning of linear systems Massimiliano Ferronato, Carlo Janna,

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Algebraic, transcendental (i.e., involving trigonometric and exponential functions), ordinary differential equations, or partial differential equations...
Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Rayan Alsemmeri Amseena Mansoor. LINEAR SYSTEMS Jacobi method is used to solve linear systems of the form Ax=b, where A is the square and invertible.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Solution of linear system of equations
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Multilevel Incomplete Factorizations for Non-Linear FE problems in Geomechanics DMMMSA – University of Padova Department of Mathematical Methods and Models.
1cs542g-term Notes  Assignment 1 is out (questions?)
Numerical Optimization
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
ECIV 301 Programming & Graphics Numerical Methods for Engineers REVIEW II.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
CSE 245: Computer Aided Circuit Simulation and Verification
A Solenoidal Basis Method For Efficient Inductance Extraction H emant Mahawar Vivek Sarin Weiping Shi Texas A&M University College Station, TX.
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Qualifier Exam in HPC February 10 th, Quasi-Newton methods Alexandru Cioaca.
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
C GasparAdvances in Numerical Algorithms, Graz, Fast interpolation techniques and meshless methods Csaba Gáspár Széchenyi István University, Department.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Section 2.3 Properties of Solution Sets
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
JAVA AND MATRIX COMPUTATION
“On Sizing and Shifting The BFGS Update Within The Sized-Broyden Family of Secant Updates” Richard Tapia (Joint work with H. Yabe and H.J. Martinez) Rice.
CSE 245: Computer Aided Circuit Simulation and Verification Matrix Computations: Iterative Methods I Chung-Kuan Cheng.
Parallel Solution of the Poisson Problem Using MPI
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
1 Chapter 7 Numerical Methods for the Solution of Systems of Equations.
1 Spring 2003 Prof. Tim Warburton MA557/MA578/CS557 Lecture 34.
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Ultra-high dimensional feature selection Yun Li
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
F. Fairag, H Tawfiq and M. Al-Shahrani Department of Math & Stat Department of Mathematics and Statistics, KFUPM. Nov 6, 2013 Preconditioning Technique.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Parallel Direct Methods for Sparse Linear Systems
Solving Systems of Linear Equations: Iterative Methods
Deflated Conjugate Gradient Method
GPU Implementations for Finite Element Methods
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Performance Optimization
Linear Algebra Lecture 16.
Presentation transcript:

A Factored Sparse Approximate Inverse software package (FSAIPACK) for the parallel preconditioning of linear systems Massimiliano Ferronato, Carlo Janna, Giuseppe Gambolati, Flavio Sartoretto Department ICEA Sparse Days 2014 June 5-6

Outline  Introduction: preconditioning techniques for high performance computing  Approximate inverse preconditioning for Symmetric Positive Definite matrices: the FSAI-based approach  FSAIPACK: a software package for high performance FSAI preconditioning  Numerical results  Conclusions and future work

Introduction Preconditioning techniques for high performance computing  The implementation of large models is becoming quite a popular effort in several applications, with the the use of parallel computational resources almost mandatory  One of the most expensive and memory-consuming tasks in any numerical application is the solution of large and sparse linear systems  Conjugate Gradient-like solution methods can be efficiently implemented on parallel computers provided that an effective parallel preconditioner is available  Algebraic preconditioners: robust algorithms that generate a preconditioner from the knowledge of the system matrix only, independently of the problem it arises from  Most popular and successful classes of preconditioners:  Incomplete LU factorizations  Approximate inverses  Algebraic multigrid

Introduction Preconditioning techniques for high performance computing  For parallel computations the Factorized Sparse Approximate Inverse (FSAI) approach is quite attractive, as it is «naturally» parallel  FSAIPACK: a parallel software package for high performance FSAI preconditioning in the solution of Symmetric Positive Definite linear systems  Collection of routines that implement several different existing methods for computing an FSAI-based preconditioner  Allows for a very flexible user-specified construction of a parallel FSAI preconditioner  General purpose package easy to be included as an external library into any existing code  Currently coded in FORTRAN90 with Open MP directives for shared memory machines  Freely available online at

The FSAI-based approach FSAI definition  Factorized Sparse Approximate Inverse (FSAI): an almost perfectly parallel factored preconditioner for SPD problems [Kolotilina & Yeremin, 1993] : with G a lower triangular matrix such that: over the set of matrices with a prescribed lower triangular sparsity pattern S L, e.g. the pattern of A or A 2, where L is the exact Cholesky factor of A L is not actually required for computing G!  Computed via the solution of n independent small dense systems and applied via matrix-vector products  Nice features: (1) ideally perfect parallel construction and application of the preconditioner; (2) preservation of the positive definiteness of the native matrix

The FSAI-based approach FSAI definition  The key property for the quality of any FSAI-based parallel preconditioner is the selection of the sparsity pattern S L  Historically, the first idea to build S L is to define it a priori, but more effective strategies can be developed dynamically selecting the position of the non-zero entries in S L  Static FSAI: S L is defined a priori, e.g., as the pattern of A k, possibly after a sparsification of A [Huckle 1999; Chow 2000, 2001]  Dynamic FSAI: S L is defined dynamically during the computation of G using some optimization algorithm [Huckle 2003; Janna & Ferronato, 2011]  Recurrent FSAI: the FSAI factor G is defined as the product of several factors, computed either statically or dynamically [Wang & Zhang 2003; Bergamaschi & Martinez 2012]  Post-filtration: it is generally recommended to apply an a posteriori sparsification of G dropping the smallest entries [Kolotilina & Yeremin, 1999]

FSAIPACK Static FSAI construction  FSAIPACK is a software library that collects several different ways for computing an FSAI preconditioner in a shared memory environment and allows for combining the construction techniques into original user- specified strategies  Assuming that S L is given, it is possible to compute G  Static FSAI: denote by P i the set of column indices belonging to the i-th row of S L Compute the vector by solving the m i ×m i linear system: and scale to obtain the dense i-th row of G:

FSAIPACK Static pattern generation  Static pattern generation: S L is the lower triangular pattern of a power  of A or of a sparsified A with: and:  User-specified parameters needed:  (integer),  (real)  The non-zero pattern for the Static FSAI computation can be generated with the aid of the following recurrence

FSAIPACK Dynamic FSAI construction  For ill-conditioned problems high values of  may be needed to properly decrease the iteration count, or even to allow for convergence, and the preconditioner construction and application can become quite heavy  A most efficient option relies on selecting the pattern dynamically by an adaptive procedure which uses somewhat the “best” available positions for the non-zero coefficients  The Kaporin conditioning number  of an SPD matrix is defined as: where: andiff

FSAIPACK Dynamic FSAI construction  The Kaporin conditioning number of an FSAI preconditioned matrix reads [Janna & Ferronato 2011; Janna et al. 2014] : where  i depends on the non-zero entries in the i-th row of G:  The scalar  i is a quadratic form of A in  Idea fo generating the pattern dynamically: for each row select the non- zero positions in providing the largest decrease in the  i value  Compute the gradient of  i with respect to and retain the positions containing the largest entries  The procedure can be iterated until either a maximum number of iterations or some exit tolerance is met

FSAIPACK Dynamic FSAI construction  Adaptive FSAI: S L is built dynamically and G immediately computed, choosing s entries per step, with a maximum number of k max steps, into the i-th row such that: until the exit tolerance  is achieved:  Dynamic construction of FSAI by an adaptive pattern generation row-by- row:  User-specified parameters needed: k max (integer), s (integer),  (real)  The default initial guess G 0 is diag(A) -1/2, but any other user-specified lower triangular matrix is possible

FSAIPACK Dynamic FSAI construction  Iterative FSAI: the i-th row of G is computed by minimizing  i with an incomplete Steepest Descent method: retaining the s largest entries per row for k iter iterations until the exit tolerance  is achieved  As  i is a quadratic form of A in the i-th row of G, it can be minimized by using a gradient method  This gives rise to an iterative construction of S L and G, another kind of Dynamic FSAI  User-specified parameters needed: k iter (integer), s (integer),  (real)  The default initial guess G 0 is diag(A) -1/2, but any other user-specified lower triangular matrix is possible  The use of an inner preconditioner M -1 is also allowed

FSAIPACK Recurrent FSAI construction  Recurrent FSAI: the final factor G is obtained as the product of n l factors: where G k is the k-level preconditioning factor for: with A 0 =A and G 0 =I. Even if each factor is very sparse and computationally very cheap, the resulting preconditioner is actually very dense and never formed explicitly:  Implicit construction of the sparsity pattern S L, writing the FSAI preconditioner as a product of factors:

FSAIPACK Numerical results  Analysis of the properties of each single method on a structural test case (size = 190,581, no. of non-zeroes: 7,531,389):  Static FSAI

FSAIPACK Numerical results  Adaptive FSAI

FSAIPACK Numerical results  Iterative FSAI

FSAIPACK Numerical results  Recurrent FSAI

FSAIPACK Numerical results  Comparison between the different methods on a Linux Cluster with 24 processors:  G =0.50  G =1.00  G =2.00 T p [s]# iter.T p [s]# iter.T p [s]# iter. Static Adaptive Iterative Recurrent  The most efficient option is combining the different methods so as to maximize the pros and minimize the cons  FSAIPACK implements all the methods for building a FSAI-based preconditioner following a user-specified strategy that can be prescribed by a pseudo-programming language

FSAIPACK Numerical results  Examples and numerical results (Linux Cluster, 24 processors) # iter.T p [s]T s [s]T t [s] GG Static (  =3,  =1e-2) Adaptive (k max =10, s=5,  =1e-2) Iterative (k iter = 20, s=10) Static + Adaptive Iterative + Static + Adaptive EMILIA (reservoir mechanics): size = 923,136 non-zeroes = 41,005,206 Note: Post-filtration is used anyway

FSAIPACK Numerical results # iter.T p [s]T s [s]T t [s] GG Static (  =4,  =1e-2) Adaptive (k max =20, s=1,  =1e-3) Iterative (k iter =10, s=10) Iterative+Static+S.P. Iterative Static+S.P. Iterative+Adaptive STOCF (porous media flow): size = 1,465,137 non-zeroes = 21,005,389 Note: Post-filtration is used anyway

FSAIPACK Numerical results # iter.T p [s]T s [s]T t [s] GG Static (  =3,  =1e-2) Adaptive (k max =25, s=2,  =1e-3) Iterative (k iter =30, s=10) Static+S.P. Iterative+Adaptive Iterative+Adaptive MECH (structural mechanics): size = 1,102,614 non-zeroes = 48,987,558 Note: Post-filtration is used anyway

FSAIPACK Numerical results  Example of strategy prescribed using the pseudo-programming language > MK_PATTERN [ A : patt ] -t -k 1e-2 2 > STATIC_FSAI [ A, patt : F ] > TRANSP_FSAI [ F : Ft ] > PROJ_FSAI [ A, F, Ft : F ] -n -s -e e-8 > ADAPT_FSAI [ A : F ] -n -s -e e-3 > POST_FILT [ A : F ] -t 0.01 > TRANSP_FSAI [ F : Ft ] > APPEND_FSAI [ F, Ft : PREC ] > MK_PATTERN [ A : patt ] -t -k 1e-2 2 > STATIC_FSAI [ A, patt : F ] > TRANSP_FSAI [ F : Ft ] > PROJ_FSAI [ A, F, Ft : F ] -n -s -e e-8 > ADAPT_FSAI [ A : F ] -n -s -e e-3 > POST_FILT [ A : F ] -t 0.01 > TRANSP_FSAI [ F : Ft ] > APPEND_FSAI [ F, Ft : PREC ] Easy management also of complex strategies

FSAIPACK Numerical results  FSAIPACK scalability on the largest example  Test on an IBM-Bluegene/Q node equipped with 16 cores  Between 16 and 64 threads the ideal profile is flat because all physical cores are saturated  Using more threads than cores is convenient as we hide memory access latencies

Conclusions Results…  FSAI-based approaches are attractive preconditioners for an efficient solution of SPD linear systems on parallel computers  The traditional static pattern generation is fast and cheap, but can give rise to poor preconditioners  The dynamic pattern generation can improve considerably the FSAI quality, especially in ill-conditioned problems, but its cost typically increases quite rapidly with the density of the preconditioner  FSAIPACK is a high performance software package that has been implemented for building a FSAI-based preconditioner using a user- specified strategy that combines different methods for selecting the sparsity pattern  A smart combination of static and dynamic pattern generation techniques is probably the most efficient way to build an effective preconditioner even for very ill-conditioned problems

Conclusions … and future work  Generalizing the results also for non-symmetric linear systems: difficulties with existence and uniqueness of the preconditioner, and with an efficient dynamic pattern generation  Implementing the FSAIPACK library also for distributed memory computers and GPU accelerators mixing OpenMP, MPI and CUDA  Studying in more detail the Iterative FSAI construction:  FSAIPACK is freely available online at:  Analysis of the theoretical properties of Incomplete gradient methods  Replace the Incomplete Steepest Descent method with an Incomplete Self-Preconditioned Conjugate Gradient method  Understand why the pattern is generally good, even though the computed coefficients could be inaccurate

Department ICEA Thank you for your attention Sparse Days 2014 June 5-6