Download presentation
Presentation is loading. Please wait.
Published byBarrie Waters Modified over 9 years ago
1
PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix, INRIA Futurs
2
PaStiX solver Current development team (SOLSTICE) P. Hénon (researcher INRIA) F. Pellegrini (assistant professor LaBRI/INRIA) P. Ramet (assistant professor LaBRI/INRIA) J. Roman (professor, leader of ScAlApplix INRIA project) PhD student & engineer M. Faverge (NUMASIS project) X. Lacoste (INRIA) Others contributors since 1998 D. Goudin (CEA-DAM) D. Lecas (CEA-DAM) Main users Electomagnetism & structural mechanics codes at CEA-DAM CESTA MHD Plasma instabilities for ITER at CEA-Cadarache (ASTER) Fluid mechanics at MAB Bordeaux
3
PaStiX solver Functionnalities LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementation Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRES 1D/2D block distribution + Full BLAS3 Support external ordering library (provided Scotch ordering) MPI/Threads implementation (SMP node / Cluster / Multi-core / NUMA) Simple/Double precision + Float/Complexe operations Require only C + MPI + Posix Thread Multiple RHS (direct factorization) Incomplete factorization ILU(k) preconditionner Available on INRIA Gforge All-in-One source code Easy to install on Linux or AIX systems Simple API (WSMP like) Thread safe (can be called from multiple threads in multiple MPI communicators) Current works Use of parallel ordering (PT-Scotch) and parallel symbolic factorization Dynamic scheduling inside SMP nodes (static mapping) Out-of Core implementation Generic Finite Element Assembly (domaine decomposition associated to matrix distribution)
4
pastix.gforge.inria.fr Latest publication : to appear in Parallel Computing : On finding approximate supernodes for an efficient ILU(k) factorization For more publications, see : http://www.labri.fr/~ramet/
5
Direct solver chain (in PaStiX) Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution Analyze (sequential steps)// fact. and solve
6
Direct solver chain (in PaStiX) Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution Sparse matrix ordering (minimizes fill-in) Scotch: an hybrid algorithm incomplete Nested Dissection the resulting subgraphs being ordered with an Approximate Minimum Degree method under constraints (HAMD)
7
Partition and Block Elimination Tree → domain decomposition
8
Direct solver chain (in PaStiX) Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution The symbolic block factorization Q(G,P) → Q(G,P)*=Q(G*,P) => linear in number of blocks! Dense block structures → only a extra few pointers to store the matrix
9
Direct solver chain (in PaStiX) Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution 1 2 3 4 5 6 7 8 1 2 3 4 5 5 6 7 8 5 1 2 3 4 5 6 7 8 44 1 2 2 38 6 7 2 321 677 CPU time prediction Exact memory ressources
10
Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation (all processors in the same SMP node work directly in share memory) Less MPI communication and lower the parallel memory overcost Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution Direct solver chain (in PaStiX)
11
Ways to reduce memory overhead Partial aggregation Aims at reducing the memory overhead due to the aggregations Additional step at the end of the pre-processing step by using a given memory upper bound Anticipated emission of partial aggregated blocks On almost test cases: 50% memory → 30% factorization time Need more ? Out-of-Core implementation Manage IO disk / computation overlap The static scheduling → a good prefetch IO scheme Need first an efficient SMP implementation to share and schedule IO requests
12
Mapping by processor Static scheduling by processor Each processor owns its local part of the matrix (private user space) Message passing (MPI or MPI_shared_memory) between any processors Aggregation of all contributions is done per processor Data coherency insured by MPI semantic Mapping by SMP node Static scheduling by thread All the processors on a same SMP node share a local part of the matrix (shared user space) Message passing (MPI) between processors on different SMP nodes Direct access to shared memory (pthread) between processors on a same SMP node Aggregation of non local contributions is done per node Data coherency insured by explicit mutex MPI/Threads implementation for SMP clusters
13
MPI only Processors 1 and 2 belong to the same SMP node Data exchanges when only MPI processes are used in the parallelization
14
MPI/Threads Thread 1 and 2 are created by one MPI process Data exchanges when there is one MPI process per SMP node and one thread per processor much less MPI communications (only between SMP nodes) no aggregation for modifications of all blocks belonging to processors inside of a SMP node
15
AUDI : 943.10 3 (symmetric) requires MPI_THREAD_MULTIPLE !
16
AUDI : 943.10 3 (symmetric) SP4 : 32 ways Power4+ (with 64Go) 51,02100,44 47,2896,318 47,8093,1416 60,0394,2132 6432Procs Num. of threads / MPI process 686 - - - 2 373 368 - - 4 185 182 177 - 8 - - 31,3 34,1 64 -952 55,399,74 57,798,68 59,79116 3216Procs Num. of threads / MPI process SP5 : 16 ways Power5 (with 32Go) pb. alloc. no meaning
17
MHD1 : 485.10 3 (unsymmetric) SP4 : 32 ways Power4+ (with 64Go) 117,80202,684 115,79197,998 111,99187,8916 115,97199,1732 6432Procs Num. of threads / MPI process 977 - - - 2 506 505 - - 4 - - - 63,4 64 - - 78,3 84,2 32 -2652 1362624 1412618 139-16 8Procs Num. of threads / MPI process SP5 : 16 ways Power5 (with 32Go)
18
How to reduce memory resources Goal: we want to adapt a (supernodal) parallel direct solver (PaStiX) to build an incomplete block factorization and benefit from all the features that it provides: Algorithmic is based on linear algebra kernels (BLAS) Load-balancing and task scheduling are based on a fine modeling of computation and communication Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation
19
Main steps for incomplete factorization algorithm 1. find the partition P 0 induced by the supernodes of A 2. compute the block symbolic incomplete factorization Q(G,P 0 ) k =Q(G k,P 0 ) 3. find the exact supernode partition in Q(G,P 0 ) k 4. given a extra fill-in tolerance α, construct an approximated supernode partition P α to improve the block structure of the incomplete factors 5. apply a block incomplete factorization using the parallelization techniques developed for our direct solver PaStiX
24
Parallel Time: AUDI (Power5) 63.30.7821.2679.37.80258.140 %5 123.11.1652.31058.98.86518.520 %5 65.70.6618.6732.07.57194.640 %3 108.70.9139.2936.87.97331.120 %3 67.00.4212.7620.34.4456.440 %1 91.50.5121.4690.14.5974.520 %1 Total TR SolvFactTotalTR solvFactαK 16 processors1 processor
25
ASTER project ANR CIS 2006 Adaptive MHD Simulation of Tokamak ELMs for ITER (ASTER) The ELM (Edge-Localized-Mode) is MHD instability which localised at the boundary of the plasma The energy losses induced by the ELMs within several hundred microseconds are a real concern for ITER The non-linear MHD simulation code JOREK is under development at the CEA to study the evolution of the ELM instability To simulate the complete cycle of the ELM instability, a large range of time scales need to be resolved to study: The evolution of the equilibrium pressure gradient (~seconds) ELM instability (~hundred microseconds) a fully implicit time evolution scheme is used in the JOREK code This leads to a large sparse matrix system to be solved at every time step. This scheme is possible due to the recent advances made in the parallelized direct solution of general sparse matrices (MUMPS, PaStiX, SuperLU,...)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.