1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006
2 Introduction Objectives : Scheduling for NUMA architecture Used with direct and incomplete solver Relax static scheduling on each SMP node Integrate MPICH-mad library Work on Out of Core aspect Application : PaStiX
3 Direct Factorization techniques PaStiX key points Load-balancing and scheduling are based on a fine modeling of computation and communication Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation Control of memory overhead due to aggregation of contributions in the supernodal block solver Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution
4 Matrix partitioning and mapping Manage parallelism induced by sparsity (block elimination tree). Split and distribute the dense blocks in order to take into account the potential parallelism induced by dense computations. Use optimal block size for pipelined BLAS3 operations.
5 Supernodal Factorization Algorithm FACTOR(k): factorize diagonal block k Factorize A kk into L kk L t kk ; BDIV(j,k): update L jk (BLAS 3) Solve L kk L jk t = A t jk ; BMOD(i,j,k): compute contribution of L ik and L jk for block L ij (BLAS 3) A ij = A ij – L ik L jk t ; k LjkLjk LikLik AijAij j
6 Parallel Factorization Algorithm COMP1D(k): factorize column-block k and compute all contributions to column-block in BCol(k) Factorize A kk into L kk L t kk ; Solve L kk L t * = A t *k ; For j BCol(k) Do Compute C [j] =L [j]k L jk t ; If map([j],j) == p Then A [j]j = A [j]j – C [j] ; Else AUB [j]j =AUB [j]j + C [j] ;
Local aggregation of block updates Column-block k1 and k2 are mapped on processor P1 Column-block j is mapped on processor P2 Contributions from the processor P1 to the block Aij of processor P2 are locally summed in AUBij Processors communicate using aggregate update blocks only Critical Memory Overhead in particular for 3D Problems
Matrix Partitioning Tasks Graph Block Symbolic Matrix Costs Modeling (Comp/Comm) Number of Processors Mapping and Scheduling Local data Tasks Scheduling Communicati on Scheme Parallel Factorization and Solver Computation Time (estimate) Memory Allocation (during factorization)
9 Matrix partitioning and mapping
10 Hybrid 1D/2D Block Distribution Yield 1D and 2D block distributions BLAS efficiency on small supernodes 1D Scalability on larger supernodes 2D Switching criterion 1D block distribution 2D block distribution
11 « 2D » to « 2D » communication scheme Dynamic technique is used to improve « 1D » to « 1D/2D » communication scheme Matrix partitioning and mapping
MPI/Thread for SMP implementation Mapping by processor Static scheduling by processor Each processor owns its local part of the matrix (private user space) Message passing (MPI or MPI_shared_memory) between any processors Aggregation of all contributions is done per processor Data coherency insured by MPI semantic Mapping by SMP node Static scheduling by thread All the processors on a same SMP node share a local part of the matrix (shared user space) Message passing (MPI) between processors on different SMP nodes Direct access to shared memory (pthread) between processors on a same SMP node Aggregation of non local contributions is done per node Data coherency insured by explicit mutex
MPI only Processor 1 and 2 belong to the same SMP node Data exchanges when only MPI processes are used in the parallelization
MPI/Threads Thread 1 and 2 are created by one MPI process Data exchanges when there is one MPI process per SMP node and one thread per processor
15 MPICH-Madeleine A communication support for clusters and multi-clusters Multiple network protocols : MPI, TCP, SCI, Myrinet,... Priority management Dynamic Aggregation of transfers Packet reordering Non deterministic
16 Library Marcel Thread user library from execution environment PM2 developed by Runtime project Portable Modular Efficient Extensible Monitorable Based on a user level scheduler thanks to criteria defined by application (memory affinity, load, task priority...)
17 Future SMP implementation Mapping by SMP node Static scheduling by thread No localisation of threads on SMP nodes One task = reception + compute + emission Aggregation of non local contributions is done per node Mapping by SMP node Dynamic scheduling by thread marcel Threads spread on SMP node by memory affinity or other criteria Separation of communication and computing tasks Aggregation made at mpich-mad level Utilisation of an adapted scheduling
18 Matrix partitioning and mapping 1 noeud SMP 1thread
19 Threads Mapping
20 Perspectives of Madeleine Utilization Aggregation made by Madeleine for a same destination Asynchronous management Adaptive packet size Managing communication priorities Separation of communication and computation tasks
21 Out-Of-Core Sparse Direct Solvers Sparse direct methods have large memory requirements Memory becomes the bottleneck Use of out-of-core techniques to treat larger problems Write unnecessary data to disk and prefetch it when needed Design efficient I/O mechanisms and sophisticated preteching schemes Sparse solver Prefetching layer I/O abstraction layer fread/fwri te read/write Async. I/O Comp. thread I/O thread … Sync. layer I/O req. queue
22 Links Scotch : PaStiX : MUMPS : ScAlApplix : RUNTIME : ANR CIGCNumasis ANR CIS Solstice & Aster