Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C. Smith, M. Zhou Scientific Computation Research Center, RPI K.E. Jansen University of Colorado Outline Status of Parallel Adaptive Simulation Tools to support parallel mesh operations Partition improvement via mesh adjacencies
Parallel Adaptive Analysis Components Solver Form the system of equations Solve the system of equations Parallel mesh representation/manipulation Partitioned mesh management Mesh adaptation procedure Driven by error estimates and/or correction indicators Mesh improvement procedures Dynamic load balancing Need to regain load balance All components must operate in parallel Prefer to exercise parallel control of all operations with same overall parallel structure – improves scalability initial mesh adapted mesh
Parallel Adaptive Analysis Tools Used An implicit stabilized finite element code (PHASTA) Supports simulations on general unstructured meshes including boundary layers and anisotropic meshes Stabilized methods can provide high order convergence while maintaining good numerical properties Equations solved using iterative methods Parallel mesh representation/manipulation ITAPS iMeshP/FMDB Utilities for partition management and predictive load balancing Mesh adaptation based on mesh modification Avoid the accuracy and data problems associated with full remeshing With sufficient number of modification operators and control algorithms mesh modification can support general changes to the mesh “Graph-based” dynamic load balancing Classic partition graph for initial distribution Graph defined by mesh topology for partition improvement
Implicit non-linear FEM solver with two phases of computation: Equation formation (Eqn. form.) – depends on elements Equation solution – depends on degrees-of-freedom (dofs): NS Flow Solver Weak form – Defined as sum of element integrals Quadrature – Simply adds loop inside element loop Assembly – As elements processes element matrices and vectors processes Iterative solver …
Parallel strategy: Both compute stages operate off the same mesh partition – critical to scalability Partition defines inter-part relations (part-to-part comm.) Parallelization PartA PartB PartC PartA PartB PartC Eqn. form. Eqn. sol. Locally, incomplete values ( in b, A, q, etc.) for shared dofs. Apply communications to complete values/entries (in b, q only) during Eqn. form. during Eqn. sol.
Communications to complete values/entries and norms (same overall calculations as methods based on global matrix): Current Approach – Parallelization values accumulated on ownersvalues updated on to non-owners dofs are shared between parts PartA PartB PartC control relationship between images (solid dots indicate owner images) complete b complete q (for on-part q=Ap) on-part norm for q and all-reduce (use complete q)
7 Parallel Implicit Flow Solver – Incompressible Abdominal Aorta Aneurysm (AAA) 105 Million Elements Cores (avg. elems./core) IBM BG/L RPI-CCNI t (secs.)scale factor 512 (204800) (base) 1024 (102400) (51200) (25600) (12800) (6400) (3200) K parts show modest degradation due to 15% node imbalance (with only about 600 mesh-nodes/part) Rgn./elem. ratio i = rgns i /avg_rgns Node ratio i = nodes i /avg_nodes
Strong Scaling – 1B Mesh up to 160k Cores AAA 1B elements: further scaling analysis (t tot =t comm +t comp ) comm-to-comp ratio increases
AAA 1B elements: three supercomputers up to full-system scale Strong Scaling – 1B Mesh up to 288k Cores on XT5 and on BGP secs for 20 time steps
AAA 5B elements: full-system scale on Jugene (IBM BG/P) Strong Scaling – 5B Mesh up to 288k Cores without PIMA strong scaling factor is 0.88 (time is 70.5 secs), for production runs savings can be in 43 cpu-years
initial mesh (20,067 tets) adapted mesh (approx. 2M tets) Mesh Adaptation by Local Mesh Modification Controlled application of mesh modification operations including dealing with curved geometries, anisotropic meshes Base operators swap collapse Split move Compound operators chain single step operators Double split collapse operator Swap(s) followed by collapse operator Split, then move the created vertex Etc. Edge collapse Edge split face split Double split collapse to remove the red sliver
Matching Boundary Layer and Interior Mesh A modification operation on any layer is propagated through the stack and interface elements to preserve attributes of layered elements.
Parallelization of refinement: perform on each part and synchronize at inter-part boundaries. Parallelization of coarsening and swapping: migrate cavity (on- the-fly) and perform operation locally on one part. Support for parallel mesh modification requires update of evolving communication-links between parts and dynamic mesh partitioning. Parallel Mesh Adaptation
Tightly coupled Adv: Computationally efficient if done well Disadv: More complex code development Example: Explicit solution of cannon blasts Loosely coupled Adv: Ability to use existing analysis codes Disadv: Overhead of multiple structures and data conversion Example: Implicit high-order FE code developed by DOE for EM simulations of linear accelerators t=0.0 t=2e-4 t=5e-4 Adaptive Loop Construction
The initial mesh has 7.1 million regions The local size mesh size is between 0.03cm and 0.1cm The initial mesh is isotropic Initial mesh BL with 0.004cm as the height of each Patient Specific Vascular Surgical Planning
4 mesh adaptation iterations The adapted mesh:42.8 million regions 7.1M->10.8M->21.2M->33.0M->42.8M Boundary layer based mesh adaptation Mesh is anisotropic The minimum local size: 0.004cm, maximum local size: 0.8cm, and the height of the boundary layer: 0.004cm. Note: the inflow diameter is 3cm, and the total model length is more than 150cm. Mesh adaptation driven by 2 nd derivatives of appropriate solution field (velocity and pressure in current case) Anisotropic adapted mesh Brain-left Spleen SMA Patient Specific Vascular Surgical Planning
Parallel Boundary Layer Adaptation Initial mesh of 450k elements Final mesh of 4.5m elements
Parallel Boundary Layer Adaptation Initial mesh of 450k elements Final mesh of 4.5m elements
Parallel Boundary Layer Adaptation Initial mesh of 450k elements Final mesh of 4.5m elements
Example of Anisotropic Adaptation
Tools to Support Parallel Mesh Operations Unstructured mesh tools used to support parallel adaptive analysis on massively parallel computers Distributed mesh representation Mesh migration Dynamic load balancing Multiple parts per process Predictive load balancing Partition improvement
Distributed Mesh Mesh divided into parts for distribution onto nodes/cores Part P i consists of mesh entities assigned to the i th part. Partition Object Basic unit assigned to a part A mesh entity to be partitioned An entity set (where the set of entities are required to be kept on the same part). Residence Part Operator P [M i d ] returns a set of part id’s where M i d exists. (e.g. P [M 1 0 ] = {P 0, P 1, P 2 } ) Distributed Mesh Data Structure i M 0 j M 1 1 P 0 P 2 P partition boundary
A Partition Model Interoperable partition model implementation - iMeshP Defined by the DOE SciDAC center on Interoperable tools for Advanced Petascale Simulations Supports unstructured meshes on distributed memory parallel computers Focuses on supporting parallel interactions of serial (iMesh) on part meshes Multiple implementations underway Implementation that support parallel mesh adaptation used here.
Mesh Migration Movement of mesh entities between parts Dictated by operation - in swap and collapse it’s the mesh entities on other parts needed to complete the mesh modification cavity Information determined based on mesh adjacencies Complexity of mesh migration a function of mesh representation Complete representation can provide any adjacency without mesh traversal - a requirement for satisfactory efficiency Both full and reduced representations can be complete Full representation - all mesh entities explicitly represented Reduced representation - requires all mesh vertices and mesh entities of the same order as the model (or partition) entity they are classified on
Dynamic Partitioning of Unstructured Meshes Load balance is lost during mesh modification Need to rebalance in parallel on already distributed mesh Want equal “work load” with minimum inter-processor communications Graph based methods are flexible and provide best results Zoltan Dynamic Services ( Supports multiple dynamic partitioners General control of the definition of part objects and weights supported Under active development
Multiple Parts per Process Support more that one part per process Key to changing number of parts Also used to deal with problem with current graph-based partitioners that tend to fail on really large numbers of processors 1 Billion region mesh starts as well balanced mesh on 2048 parts Each part splits to 64 parts - get 128K parts via ‘local’ partitioning Region imbalance: Time usage < 3mins on Kraken This is the partition used for scaling test on Intrepid
Multiple Parts per Process Supports global and local graphs to create mesh partitions at extreme scale (>=100K parts): Mesh and its initial partition (partition from 3 parts to 6 parts) Global graph Local graphLocal partition Global partition
Predictive Load Balancing Mesh modification before load balancing can lead to memory problems Employ predictive load balancing to avoid the problem Assign weights based on what will be refined/coarsened Apply dynamic load balancing using those weights Perform mesh modifications May want to do some local migration
Algorithm Mesh metric field at any point P is decomposed to three unit direction (e 1,e 2,e 3 ) and desired length (h 1,h 2,h 3 ) in each corresponding direction. The volume of desired element (tetrahedron) : h 1 h 2 h 3 /6 Estimate number of elements to be generated: “num” is scaled to a good range before it is specified as a weight to graph nodes Predictive Load Balancing ? # regions ?
Initial: 8.7M; Adapted: 80.5M PredL run on 128 cores Initial: 8.7M; Adapted: 187M Test run out of memory without PredLB on 200 cores Predictive Load Balancing
Partitioning Improvement via Mesh Adjacencies Observations on graph-based dynamic balancing Parallel construction and balancing of graph with small cuts takes reasonable time In the case of unstructured meshes, the graph defined in terms of Graph nodes as mesh entity type to be balanced (e.g., regions, faces, edges or vertices) Graph edges based on mesh adjacencies taken into account (e.g., graph edge between regions sharing a vertex, between regions sharing a face, etc.) Accounting for multiple criteria and or multiple interactions is not obvious Hypergraphs allows edges to connect more that two vertices may be of use – has been used to help account for migration communication costs
Partition Improvement via Mesh Adjacencies Mesh adjacencies for a complete representation can be viewed as a complete graph All entities can be “graph-nodes” at will Any desired graph edge obtained in O(1) time Possible advantages Avoid graph construction (assuming you have needed adjacencies) Account for multiple entity types – important for the solve process - typically the most computationally expensive step Disadvantage Lack of well developed algorithms for parallel graph operations Easy to use simple iterative diffusive procedures Not ideal for “global” balancing Can be fast for improving partitions
Partition Improvement via Mesh Topology (PIMA) Basic idea is to perform iterations exchanging mesh entities between neighboring parts to improve scaling Current algorithm focused on gaining better node balance to improve scalability of the solve by accounting for multiple criteria Example: Reduce down the small number of node imbalance peaks at the cost of some increase in element imbalance Similar approaches can be used to: Improve balance when using multiple parts per process - may be as good as full rebalance for lower total cost Improve balance during mesh adaptation – likely want extensions past simple diffusive methods
Diffusive PIMA-advantages Imbalance in partitions obtained by graph/hyper-graph based methods, even when considering multiple criteria is limited to a small number of heavily loaded parts, referred to as spikes, which limit the scalability of applications Uses mesh adjacencies - Richer information provides chance to provide better multi-criteria partitions All adjacencies are obtainable in O(1) operations (not a function of mesh size) – requirement for efficiency Takes advantages of neighborhood communications - Work well on massively parallel computations, since more controlled communication used even at extreme scale Diffusive PIMA is designed to migrate a small number of mesh entities on inter-part boundaries from heavily loaded parts to lightly loaded neighbors to improve load balance
PIMA: algorithm Input: Types of mesh entities need to be balanced (Rgn, Face, Edge, Vtx) The relative importance (priority) between them (= or >) e.g., “Vtx=Edge>Rgn” or “Rgn>Face=Edge>Vtx”, etc. The balance of entities not specified in the input is not explicitly improved or preserved Mesh with complete representation Algorithm: From high to low priority if separated by “>” (different groups) From low to high dimensions based on entities topologies if separated by “=” (same group) Improve partition balance while reducing edge-cut and data migration costs for current mesh entity e.g., “Rgn>Face=Edge>Vtx” is the user’s input Step 1: improve balance for mesh regions Step 2.1: improve balance for mesh edges Step 2.2: improve balance for mesh faces Step 3: improve balance for mesh vertices
PIMA: algorithm What is a good choice for α ?? Can migrate via p2p communications between procs i and j – does FMDB use collective communication to do migration?
PIMA: Candidate Parts
PIMA: Candidate Mesh Entities
The gain(I) is the total gain of the part boundary; likewise for density(I) [line 5] A region is only marked for migration if it has above average gain and density [line 5] The average gain and density is updated with the definition of the part boundary as entities are marked for migration [line 8] The gain is updated for regions adjacent to the region marked for migration [line 11]
PIMA: Candidate mesh entities Vertex: The vertices on inter-part boundaries bounding a small number of regions on source part P0; tips of ‘spikes’ Edge: The edges on inter-part boundaries bounding a small number of faces; ‘ridge’ edges with (a) 2 bounding faces, and (b) 3 bounding faces on source part P0 Face/Region: Regions which have two or three faces on inter-part boundaries; (a) ‘spike’ region (b) region on a ‘ridge’
Partition Improvement Worst loaded part dictates the performance Element-based partitioning results in spikes of dofs worst dof imbalance: 14.7% 0
Partition Improvement Heaviest loaded part dictates the performance Element-based partitioning results in spikes of dofs Parallel local iterative partition modifications Local mesh migration is applied from relatively heavy to light parts dof imbalance reduced from 14.7% to 4.92% element imbalance increased from 2.64% to 4.54%
Strong Scaling – 1B Mesh up to 160k Cores AAA 1B elements: effective partitioning at extreme scale with and without partition modification (PIMA) Full system Without PIMAwith PIMA PMod (see graph)
PIMA-Tests 133M region mesh on 16k parts Table 1: Users input Table 2:Balance of partitions Table 3: Time usage and iterations (tests on Jaguar Cray XT5 system)
Closing Remarks ***** Needs update Parallel adaptive for unstructured mesh simulations is progressing toward petascale Solver scales well (on good machines) Mesh adaptation can run on petascale machines - How well can we make them scale? How well do they need to scale? Execution on massively parallel computers Introduces several new considerations Specific tools being developed to address these issues