Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C. Smith, M. Zhou Scientific Computation Research Center,

Slides:



Advertisements
Similar presentations
Steady-state heat conduction on triangulated planar domain May, 2002
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
© Fluent Inc. 4/16/ Introductory GAMBIT Notes GAMBIT v2.0 Jan 2002 Fluent User Services Center Edge and Face Meshing.
Parallel Research at Illinois Parallel Everywhere
A Bezier Based Approach to Unstructured Moving Meshes ALADDIN and Sangria Gary Miller David Cardoze Todd Phillips Noel Walkington Mark Olah Miklos Bergou.
A Bezier Based Approach to Unstructured Moving Meshes ALADDIN and Sangria Gary Miller David Cardoze Todd Phillips Noel Walkington Mark Olah Miklos Bergou.
Coupled Fluid-Structural Solver CFD incompressible flow solver has been coupled with a FEA code to analyze dynamic fluid-structure coupling phenomena CFD.
1cs542g-term Notes  Make-up lecture tomorrow 1-2, room 204.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Numerical geometry of non-rigid shapes
Software Version Control SubVersion software version control system WebSVN graphical interface o View version history logs o Browse directory structure.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
CSE351/ IT351 Modeling and Simulation
Steady Aeroelastic Computations to Predict the Flying Shape of Sails Sriram Antony Jameson Dept. of Aeronautics and Astronautics Stanford University First.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
Multi-Scale Finite-Volume (MSFV) method for elliptic problems Subsurface flow simulation Mark van Kraaij, CASA Seminar Wednesday 13 April 2005.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
COMPUTER-AIDED DESIGN The functionality of SolidWorks Simulation depends on which software Simulation product is used. The functionality of different producs.
1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
Grid Generation.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.
 A.C. Bauer, M.S. Shephard, E. Seol and J. Wan,   Scientific Computation Research Center  Rensselaer Polytechnic Institute,
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Hans De Sterck Department of Applied Mathematics University of Colorado at Boulder Ulrike Meier Yang Center for Applied Scientific Computing Lawrence Livermore.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
PIMA-motivation PIMA: Partition Improvement using Mesh Adjacencies  Parallel simulation requires that the mesh be distributed with equal work-load and.
The Finite Element Method A Practical Course
Discontinuous Galerkin Methods Li, Yang FerienAkademie 2008.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Stress constrained optimization using X-FEM and Level Set Description
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Adaptive Meshing Control to Improve Petascale Compass Simulations Xiao-Juan Luo and Mark S Shephard Scientific Computation Research Center (SCOREC) Interoperable.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
BOĞAZİÇİ UNIVERSITY – COMPUTER ENGINEERING Mehmet Balman Computer Engineering, Boğaziçi University Parallel Tetrahedral Mesh Refinement.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
High Performance Computing Seminar
Unstructured Meshing Tools for Fusion Plasma Simulations
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Xing Cai University of Oslo
Construction of Parallel Adaptive Simulation Loops
Convergence in Computational Science
Component Frameworks:
GENERAL VIEW OF KRATOS MULTIPHYSICS
Adaptive Mesh Applications
Comparison of CFEM and DG methods
Parallel Programming in C with MPI and OpenMP
Dynamic Load Balancing of Unstructured Meshes
Presentation transcript:

Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C. Smith, M. Zhou Scientific Computation Research Center, RPI K.E. Jansen University of Colorado  Outline Status of Parallel Adaptive Simulation Tools to support parallel mesh operations Partition improvement via mesh adjacencies

Parallel Adaptive Analysis  Components Solver Form the system of equations Solve the system of equations Parallel mesh representation/manipulation Partitioned mesh management Mesh adaptation procedure Driven by error estimates and/or correction indicators Mesh improvement procedures Dynamic load balancing Need to regain load balance  All components must operate in parallel Prefer to exercise parallel control of all operations with same overall parallel structure – improves scalability initial mesh adapted mesh

Parallel Adaptive Analysis Tools Used An implicit stabilized finite element code (PHASTA) Supports simulations on general unstructured meshes including boundary layers and anisotropic meshes Stabilized methods can provide high order convergence while maintaining good numerical properties Equations solved using iterative methods Parallel mesh representation/manipulation ITAPS iMeshP/FMDB Utilities for partition management and predictive load balancing Mesh adaptation based on mesh modification Avoid the accuracy and data problems associated with full remeshing With sufficient number of modification operators and control algorithms mesh modification can support general changes to the mesh “Graph-based” dynamic load balancing Classic partition graph for initial distribution Graph defined by mesh topology for partition improvement

 Implicit non-linear FEM solver with two phases of computation:  Equation formation (Eqn. form.) – depends on elements  Equation solution – depends on degrees-of-freedom (dofs): NS Flow Solver Weak form – Defined as sum of element integrals Quadrature – Simply adds loop inside element loop Assembly – As elements processes element matrices and vectors processes Iterative solver …

 Parallel strategy: Both compute stages operate off the same mesh partition – critical to scalability Partition defines inter-part relations (part-to-part comm.) Parallelization PartA PartB PartC PartA PartB PartC Eqn. form. Eqn. sol. Locally, incomplete values ( in b, A, q, etc.) for shared dofs. Apply communications to complete values/entries (in b, q only) during Eqn. form. during Eqn. sol.

 Communications to complete values/entries and norms (same overall calculations as methods based on global matrix): Current Approach – Parallelization values accumulated on ownersvalues updated on to non-owners dofs are shared between parts PartA PartB PartC control relationship between images (solid dots indicate owner images) complete b complete q (for on-part q=Ap) on-part norm for q and all-reduce (use complete q)

7 Parallel Implicit Flow Solver – Incompressible Abdominal Aorta Aneurysm (AAA) 105 Million Elements Cores (avg. elems./core) IBM BG/L RPI-CCNI t (secs.)scale factor 512 (204800) (base) 1024 (102400) (51200) (25600) (12800) (6400) (3200) K parts show modest degradation due to 15% node imbalance (with only about 600 mesh-nodes/part) Rgn./elem. ratio i = rgns i /avg_rgns Node ratio i = nodes i /avg_nodes

Strong Scaling – 1B Mesh up to 160k Cores  AAA 1B elements: further scaling analysis (t tot =t comm +t comp ) comm-to-comp ratio increases

 AAA 1B elements: three supercomputers up to full-system scale Strong Scaling – 1B Mesh up to 288k Cores on XT5 and on BGP secs for 20 time steps

 AAA 5B elements: full-system scale on Jugene (IBM BG/P) Strong Scaling – 5B Mesh up to 288k Cores without PIMA strong scaling factor is 0.88 (time is 70.5 secs), for production runs savings can be in 43 cpu-years

initial mesh (20,067 tets) adapted mesh (approx. 2M tets) Mesh Adaptation by Local Mesh Modification  Controlled application of mesh modification operations including dealing with curved geometries, anisotropic meshes  Base operators swap collapse Split move  Compound operators chain single step operators Double split collapse operator Swap(s) followed by collapse operator Split, then move the created vertex Etc. Edge collapse Edge split face split Double split collapse to remove the red sliver

Matching Boundary Layer and Interior Mesh  A modification operation on any layer is propagated through the stack and interface elements to preserve attributes of layered elements.

 Parallelization of refinement: perform on each part and synchronize at inter-part boundaries.  Parallelization of coarsening and swapping: migrate cavity (on- the-fly) and perform operation locally on one part.  Support for parallel mesh modification requires update of evolving communication-links between parts and dynamic mesh partitioning. Parallel Mesh Adaptation

 Tightly coupled Adv: Computationally efficient if done well Disadv: More complex code development Example: Explicit solution of cannon blasts  Loosely coupled Adv: Ability to use existing analysis codes Disadv: Overhead of multiple structures and data conversion Example: Implicit high-order FE code developed by DOE for EM simulations of linear accelerators t=0.0 t=2e-4 t=5e-4 Adaptive Loop Construction

The initial mesh has 7.1 million regions The local size mesh size is between 0.03cm and 0.1cm The initial mesh is isotropic Initial mesh BL with 0.004cm as the height of each Patient Specific Vascular Surgical Planning

4 mesh adaptation iterations The adapted mesh:42.8 million regions 7.1M->10.8M->21.2M->33.0M->42.8M Boundary layer based mesh adaptation Mesh is anisotropic The minimum local size: 0.004cm, maximum local size: 0.8cm, and the height of the boundary layer: 0.004cm. Note: the inflow diameter is 3cm, and the total model length is more than 150cm. Mesh adaptation driven by 2 nd derivatives of appropriate solution field (velocity and pressure in current case) Anisotropic adapted mesh Brain-left Spleen SMA Patient Specific Vascular Surgical Planning

Parallel Boundary Layer Adaptation Initial mesh of 450k elements Final mesh of 4.5m elements

Parallel Boundary Layer Adaptation Initial mesh of 450k elements Final mesh of 4.5m elements

Parallel Boundary Layer Adaptation Initial mesh of 450k elements Final mesh of 4.5m elements

Example of Anisotropic Adaptation

Tools to Support Parallel Mesh Operations  Unstructured mesh tools used to support parallel adaptive analysis on massively parallel computers Distributed mesh representation Mesh migration Dynamic load balancing Multiple parts per process Predictive load balancing Partition improvement

 Distributed Mesh Mesh divided into parts for distribution onto nodes/cores Part P i consists of mesh entities assigned to the i th part.  Partition Object Basic unit assigned to a part A mesh entity to be partitioned An entity set (where the set of entities are required to be kept on the same part).  Residence Part Operator P [M i d ] returns a set of part id’s where M i d exists. (e.g. P [M 1 0 ] = {P 0, P 1, P 2 } ) Distributed Mesh Data Structure i M 0 j M 1 1 P 0 P 2 P partition boundary

A Partition Model  Interoperable partition model implementation - iMeshP Defined by the DOE SciDAC center on Interoperable tools for Advanced Petascale Simulations Supports unstructured meshes on distributed memory parallel computers Focuses on supporting parallel interactions of serial (iMesh) on part meshes Multiple implementations underway Implementation that support parallel mesh adaptation used here.

Mesh Migration  Movement of mesh entities between parts Dictated by operation - in swap and collapse it’s the mesh entities on other parts needed to complete the mesh modification cavity Information determined based on mesh adjacencies  Complexity of mesh migration a function of mesh representation Complete representation can provide any adjacency without mesh traversal - a requirement for satisfactory efficiency Both full and reduced representations can be complete Full representation - all mesh entities explicitly represented Reduced representation - requires all mesh vertices and mesh entities of the same order as the model (or partition) entity they are classified on

Dynamic Partitioning of Unstructured Meshes  Load balance is lost during mesh modification Need to rebalance in parallel on already distributed mesh Want equal “work load” with minimum inter-processor communications Graph based methods are flexible and provide best results Zoltan Dynamic Services ( Supports multiple dynamic partitioners General control of the definition of part objects and weights supported Under active development

Multiple Parts per Process  Support more that one part per process Key to changing number of parts Also used to deal with problem with current graph-based partitioners that tend to fail on really large numbers of processors 1 Billion region mesh starts as well balanced mesh on 2048 parts Each part splits to 64 parts - get 128K parts via ‘local’ partitioning Region imbalance: Time usage < 3mins on Kraken This is the partition used for scaling test on Intrepid

Multiple Parts per Process  Supports global and local graphs to create mesh partitions at extreme scale (>=100K parts): Mesh and its initial partition (partition from 3 parts to 6 parts) Global graph Local graphLocal partition Global partition

Predictive Load Balancing Mesh modification before load balancing can lead to memory problems Employ predictive load balancing to avoid the problem Assign weights based on what will be refined/coarsened Apply dynamic load balancing using those weights Perform mesh modifications May want to do some local migration

 Algorithm  Mesh metric field at any point P is decomposed to three unit direction (e 1,e 2,e 3 ) and desired length (h 1,h 2,h 3 ) in each corresponding direction.  The volume of desired element (tetrahedron) : h 1 h 2 h 3 /6  Estimate number of elements to be generated:  “num” is scaled to a good range before it is specified as a weight to graph nodes Predictive Load Balancing ? # regions ?

 Initial: 8.7M; Adapted: 80.5M PredL run on 128 cores Initial: 8.7M; Adapted: 187M Test run out of memory without PredLB on 200 cores Predictive Load Balancing

Partitioning Improvement via Mesh Adjacencies Observations on graph-based dynamic balancing Parallel construction and balancing of graph with small cuts takes reasonable time In the case of unstructured meshes, the graph defined in terms of  Graph nodes as mesh entity type to be balanced (e.g., regions, faces, edges or vertices)  Graph edges based on mesh adjacencies taken into account (e.g., graph edge between regions sharing a vertex, between regions sharing a face, etc.) Accounting for multiple criteria and or multiple interactions is not obvious  Hypergraphs allows edges to connect more that two vertices may be of use – has been used to help account for migration communication costs

Partition Improvement via Mesh Adjacencies Mesh adjacencies for a complete representation can be viewed as a complete graph All entities can be “graph-nodes” at will Any desired graph edge obtained in O(1) time Possible advantages Avoid graph construction (assuming you have needed adjacencies) Account for multiple entity types – important for the solve process - typically the most computationally expensive step Disadvantage Lack of well developed algorithms for parallel graph operations Easy to use simple iterative diffusive procedures Not ideal for “global” balancing Can be fast for improving partitions

Partition Improvement via Mesh Topology (PIMA)  Basic idea is to perform iterations exchanging mesh entities between neighboring parts to improve scaling Current algorithm focused on gaining better node balance to improve scalability of the solve by accounting for multiple criteria Example: Reduce down the small number of node imbalance peaks at the cost of some increase in element imbalance Similar approaches can be used to: Improve balance when using multiple parts per process - may be as good as full rebalance for lower total cost Improve balance during mesh adaptation – likely want extensions past simple diffusive methods

Diffusive PIMA-advantages  Imbalance in partitions obtained by graph/hyper-graph based methods, even when considering multiple criteria is limited to a small number of heavily loaded parts, referred to as spikes, which limit the scalability of applications  Uses mesh adjacencies - Richer information provides chance to provide better multi-criteria partitions  All adjacencies are obtainable in O(1) operations (not a function of mesh size) – requirement for efficiency  Takes advantages of neighborhood communications - Work well on massively parallel computations, since more controlled communication used even at extreme scale Diffusive PIMA is designed to migrate a small number of mesh entities on inter-part boundaries from heavily loaded parts to lightly loaded neighbors to improve load balance

PIMA: algorithm Input:  Types of mesh entities need to be balanced (Rgn, Face, Edge, Vtx)  The relative importance (priority) between them (= or >)  e.g., “Vtx=Edge>Rgn” or “Rgn>Face=Edge>Vtx”, etc.  The balance of entities not specified in the input is not explicitly improved or preserved  Mesh with complete representation Algorithm:  From high to low priority if separated by “>” (different groups)  From low to high dimensions based on entities topologies if separated by “=” (same group)  Improve partition balance while reducing edge-cut and data migration costs for current mesh entity e.g., “Rgn>Face=Edge>Vtx” is the user’s input Step 1: improve balance for mesh regions Step 2.1: improve balance for mesh edges Step 2.2: improve balance for mesh faces Step 3: improve balance for mesh vertices

PIMA: algorithm What is a good choice for α ?? Can migrate via p2p communications between procs i and j – does FMDB use collective communication to do migration?

PIMA: Candidate Parts

PIMA: Candidate Mesh Entities

The gain(I) is the total gain of the part boundary; likewise for density(I) [line 5] A region is only marked for migration if it has above average gain and density [line 5] The average gain and density is updated with the definition of the part boundary as entities are marked for migration [line 8] The gain is updated for regions adjacent to the region marked for migration [line 11]

PIMA: Candidate mesh entities Vertex: The vertices on inter-part boundaries bounding a small number of regions on source part P0; tips of ‘spikes’ Edge: The edges on inter-part boundaries bounding a small number of faces; ‘ridge’ edges with (a) 2 bounding faces, and (b) 3 bounding faces on source part P0 Face/Region: Regions which have two or three faces on inter-part boundaries; (a) ‘spike’ region (b) region on a ‘ridge’

Partition Improvement  Worst loaded part dictates the performance  Element-based partitioning results in spikes of dofs worst dof imbalance: 14.7% 0

Partition Improvement  Heaviest loaded part dictates the performance  Element-based partitioning results in spikes of dofs Parallel local iterative partition modifications Local mesh migration is applied from relatively heavy to light parts dof imbalance reduced from 14.7% to 4.92% element imbalance increased from 2.64% to 4.54%

Strong Scaling – 1B Mesh up to 160k Cores  AAA 1B elements: effective partitioning at extreme scale with and without partition modification (PIMA) Full system Without PIMAwith PIMA PMod (see graph)

PIMA-Tests 133M region mesh on 16k parts Table 1: Users input Table 2:Balance of partitions Table 3: Time usage and iterations (tests on Jaguar Cray XT5 system)

Closing Remarks ***** Needs update Parallel adaptive for unstructured mesh simulations is progressing toward petascale Solver scales well (on good machines) Mesh adaptation can run on petascale machines - How well can we make them scale? How well do they need to scale?  Execution on massively parallel computers Introduces several new considerations Specific tools being developed to address these issues