October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST) Integrated Predictive Simulation System for Earthquake and Tsunami Disaster Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era” Introduction In this work, parallel preconditioning methods based on “Hierarchical Interface Decomposition (HID)” and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster)” using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI. Parallel Programming Models on Multi-core Clusters In order to achieve minimal parallelization overheads on SMP (symmetric multiprocessors) and multi-core clusters, a multi-level hybrid parallel programming model is often employed. In this method, coarse-grained parallelism is achieved through domain decomposition by message passing among nodes, while fine-grained parallelism is obtained via loop-level parallelism inside each node using compiler-based thread parallelization techniques, such as OpenMP. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each core. Generally, the efficiency of each programming model depends on hardware performance, application features, and problem size. Fig.1 Hybrid and Flat MPI Parallel Programming Models core memory core memory core memory Target Application In the present work, linear elasticity problems in simple cube geometries of media with heterogeneous material properties are solved using a parallel finite-element method (FEM). Poisson’s ratio is set to 0.25 for all elements, while a heterogeneous distribution of Young’s modulus in each element is calculated by a sequential Gauss algorithm, which is widely used in the area of geostatistics. The minimum and maximum values of Young’s modulus are and 10 3, respectively, with an average value of 1.0. The GPBi-CG (Generalized Product-type methods based on Bi-CG) solver with SGS (Symmetric Gauss-Seidel) preconditioner (SGS/GPBi-CG) was applied. The code is based on the framework for parallel FEM procedures of GeoFEM, and the GeoFEM’s local data structure is applied. Fig.2 Heterogeneous Distribution of Material Property HID (Hierarchical Interface Decomposition) Localized block Jacobi ILU/IC preconditioners are widely used for parallel iterative solvers. They provide excellent parallel performance for well-defined problems, although the number of iterations required for convergence gradually increases according to the number of processors. However, this preconditioning technique is not robust for ill-conditioned problems with many processors, because it ignores the global effect of external nodes in other domains. The Parallel Hierarchical Interface Decomposition Algorithm (PHIDAL) [Henon & Saad 2007] provides robustness and scalability for parallel ILU/IC preconditioners. PHIDAL is based on defining “hierarchical interface decomposition (HID)”. The HID process starts with a partitioning of the graph, with one layer of overlap. The “levels” are defined from this partitioning, with each level consisting of a set of vertex groups. Each vertex group of a given level is a separator for vertex groups of a lower level. (to be continued to the back page) core
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST) HID (Hierarchical Interface Decomposition) (cont.) If the unknowns are reordered according to their level numbers, from the lowest to highest, the block structure of the reordered matrix is as shown in Fig.3. This block structure leads to a natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.. (a) Domain Decomposition Fig.3 Domain/block decomposition of the matrix according to the HID reordering (b) Matrix Block(c) Forward Substitution Proc. of SGS T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo) The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)” at the University of Tokyo. The “T2K/Tokyo” was developed by Hitachi under “T2K Open Supercomputer Alliance”. T2K/Tokyo is an AMD Quad-core Opteron-based combined cluster system with 952 nodes, 15,232 cores and 31TB memory. Total peak performance is TFLOPS. Each node includes four “sockets” of AMD Quad-core Opteron processors (2.3GHz), as shown in Fig.4. Each node is connected via Myrinet-10G network. In the present work, 32 nodes of the system have been evaluated. Because T2K/Tokyo is based on CC/NUMA architecture, careful design of software and data structure is required for efficient access to local memory. Fig.4 Overview of T2K/Tokyo (Entire System and Each Node) Results & Future Works Preliminary tests of the developed code have been conducted on the T2K/Tokyo using up to 512 cores of T2K/Tokyo system. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.HID-based preconditioning with the hybrid parallel programming model is expected to be a good choice for excellent scalable performance and robustness for implementations on more than 10 4 cores on clusters of multi-core processors, if each MPI process is assigned to each socket, as is in Hybrid 4x4 in this work. In the present work, no fill-in processes have been considered in HID procedures. The results of using HID in comparison with conventional localized block Jacobi preconditioning have therefore not been so significant. More robust preconditioning methods based on HID may be developed by considering fill-ins inside and between connectors for realistic applications with ill-conditioned matrices. Moreover, the developed methods may further be evaluated on various types of clusters with more cores. Fig.5 Strong Scaling Test up to 512 cores of T2K/Tokyo Relative Performance of HID normalized by results of Localized Block Jacobi Relative Performance of three parallel programming models (Flat MPI, Hybrid 4x4 and Hybrid 8x2), normalized by results of Flat MPI