MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal solver MPI / F90 based (C user interface also available) Stability based on partial pivoting Dynamic Distributed Scheduling to accomodate both numerical fill-in and multi- user environment Use of BLAS, LAPACK, ScaLAPACK Main features : MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDL T symmetric or unsymmetric marices (partial pivoting), parallel factorization and solve phases (uniprocessor version also available), Iterative refinement and backward error analysis, various matrix input formats assembled format distributed assembled format sum of elemental matrices Null space functionalities (experimental): rank detection and null space basis Partial factorization and Schur complement matrix, Version for complex arithmetic. A fully asynchronous distributed solver (VAMPIR trace, 8 processors). AVAILABILITY MUMPS is available free of charge for non commercial use. it has been used on a number of platforms (Cray T3E, Origin 2000, IBM SP, Linux clusters, …) by a few hundred current users (finite elements, chemistry, simulation, aeronautics, …) If you are interested in obtaining MUMPS for you own use, please refer to the MUMPS home page. BMW car body unknowns nonzeros MSC.Software. Competitve performance The MUMPS package has a good perfornance relative to other parallel sparse solvers; for example we see in the table below comparisons with the SuperLU code from Demmel and Li. These results are taken from “Analysis and comparison of two general solvers for distributed memory computers”, ACM TOMS, 27, CURRENT RESEARCH : ACTIVE RESEARCH IS FEEDING THE MUMPS SOFTWARE PLATFORM. The MUMPS package has been partially supported by the Esprit IV Project PARASOL and by CERFACS, ENSEEIHT-IRIT, INRIA Rhône-Alpes, LBNL-NERSC,PARALLAB and RAL. The authors are Patrick Amestoy, Jean-Yves L’Excellent, Iain Duff and Jacko Koster. Functionalities related to rank-revealing were first implemented by M. Tuma (Institute of Computer Science, Academy of Sciences of the Czech Republic), while he was at CERFACS. We are also grateful to C. Bousquet, C. Daniel, A. Guermouche, G. Richard, S. Pralet and C. Vömel who have been working on some specific parts of this software. Factorisation time in seconds of large matrices on the CRAY T3E; (1 proc=not enough memory). Reorderings and optimization of the memory usage MUMPS uses state-of-the-art reordering techniques (AMD, AMF, ND, SCOTCH, PORD, METIS). Those techniques have a strong impact on the parallelism and number of operations and we are currently studying their impact of such techniques on the dynamic memory usage of MUMPS. In particular we designed algorithms to optimize the memory occupation of the multifrontal stack. Future work includes dynamic memory load balancing and the design of an out-of-core version. Best decrease obtained using our algorithm to decrease the stack for each reordering technique. Results obtained by A. Guermouche, (PhD student in the INRIA ReMaP project). Mixing dynamic and static scheduling strategies MUMPS uses a completely dynamic approach with distributed scheduling and scales well until around 100 processors. Introducing more static information helps reducing the costs of the dynamic decisions and makes MUMPS more scalable. Matrix Reordering thermal PORD THREAD METIS af23560 AMF xenon2 SCOTCH rma10 AMD Percent. of memory decrease MatrixOrderingSolverNumber of processors bbmatAMD ND(metis) MUMPS SuperLU MUMPS SuperLU ecl32AMD ND(metis) MUMPS SuperLU MUMPS SuperLU Platforms with heterogeneous network (clusters of SMP) In the MUMPS scheduling, work is given to processors according to their load. Giving a penalty to the load of processors on a distant node helps performing tasks with high communication on the same node and improves the performance, as shown in the Table below. This poster was prepared by Jean-Yves L’Excellent Mixing MPI and OpenMP on clusters of SMP We report below on a preliminary experiment of hybrid parallelism on one node (16 procs) of an IBM SP. Best results are obtained when using 8 MPI processes with 2 OpenMP threads each. Regular problem from an 11pt discretization (Cubic grid 64x64x64), ND used. Results obtained by S.Pralet (PhD Cerfacs). Standard MUMPSModified MUMPS Time for factorization 49.2 seconds44.0 seconds Total volume of communication 3957 MB3600 MB Total volume of internode communication 2017 MB1004 MB Effect of taking the hybrid network into account. Matrix PRE2, SCOTCH, 2 nodes of 16 processors of an IBM SP. Results obtained by S. Pralet (PhD CERFACS). Effect of a injecting more static information to the dynamic scheduling of MUMPS. Rectangular grids of increasing size, ND. Results obtained by C. Vömel (PhD Cerfacs) on a CRAY T3E.