Infrastructure for Parallel Adaptive Unstructured Mesh Simulations

Slides:



Advertisements
Similar presentations
CHAPTER 1: COMPUTATIONAL MODELLING
Advertisements

© Fluent Inc. 4/16/ Introductory GAMBIT Notes GAMBIT v2.0 Jan 2002 Fluent User Services Center Edge and Face Meshing.
Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves Csaba Attila Vigh Department of Informatics, TU München JASS 2006,
Software Version Control SubVersion software version control system WebSVN graphical interface o View version history logs o Browse directory structure.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
Parallel Decomposition-based Contact Response Fehmi Cirak California Institute of Technology.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.
COMPUTER-AIDED DESIGN The functionality of SolidWorks Simulation depends on which software Simulation product is used. The functionality of different producs.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Introduction to virtual engineering László Horváth Budapest Tech John von Neumann Faculty of Informatics Institute of Intelligent Engineering.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.
October 2008 Automation components for simulation-based engineering.
Grid Generation.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Presenters: Cameron W. Smith and Glen Hansen Workflow demonstration using Simmetrix/PUMI/PAALS for parallel adaptive simulations FASTMath SciDAC Institute.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
COMPUTER AIDED DESIGN -(CAD)-3
 A.C. Bauer, M.S. Shephard, E. Seol and J. Wan,   Scientific Computation Research Center  Rensselaer Polytechnic Institute,
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Strategic Goals: To align the many efforts at Sandia involved in developing software for the modeling and simulation of physical systems (mostly PDEs):
Discontinuous Galerkin Methods and Strand Mesh Generation
Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.
PIMA-motivation PIMA: Partition Improvement using Mesh Adjacencies  Parallel simulation requires that the mesh be distributed with equal work-load and.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C. Smith, M. Zhou Scientific Computation Research Center,
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Stress constrained optimization using X-FEM and Level Set Description
CFX-10 Introduction Lecture 1.
Adaptive Meshing Control to Improve Petascale Compass Simulations Xiao-Juan Luo and Mark S Shephard Scientific Computation Research Center (SCOREC) Interoperable.
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
1 1  Capabilities: PCU: Communication, threading, and File IO built on MPI APF: Abstract definition of meshes, fields, and their algorithms GMI: Interface.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
High Performance Computing Seminar
Unstructured Meshing Tools for Fusion Plasma Simulations
Parallel Algorithm Oriented Mesh Database
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Parallel Hypergraph Partitioning for Scientific Computing
Data Structures for Efficient and Integrated Simulation of Multi-Physics Processes in Complex Geometries A.Smirnov MulPhys LLC github/mulphys
Parallel Unstructured Mesh Infrastructure
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Construction of Parallel Adaptive Simulation Loops
Convergence in Computational Science
Component Frameworks:
GENERAL VIEW OF KRATOS MULTIPHYSICS
An Introduction to Software Architecture
Adaptive Mesh Applications
Dynamic Load Balancing of Unstructured Meshes
Presentation transcript:

Infrastructure for Parallel Adaptive Unstructured Mesh Simulations M.S. Shephard, C.W. Smith, E.S. Seol, D.A. Ibanez, Q. Lu, O. Sahni, M.O. Bloomfield, B.N. Granzow Scientific Computation Research Center, Rensselaer Polytechnic Institute G. Hansen, K. Devine and V. Leung Sandia National Laboratories K.E. Jansen, M. Rasquin and K.C. Chitale University of Colorado M.W. Beall and S. Tendulkar Simmetrix Inc.

Presentation Outline Meshes of multi-million element meshes needed even with the use of adaptive methods Simulations must be run on massively parallel computers with information (mesh) distributed at all times Need an effective parallel mesh infrastructure and associated utilities to deal with the mesh and its adaptation Presentation outline Unstructured meshes on massively parallel computers Representations and support of a distributed mesh Dynamic load balancing Mesh adaptation using parallel mesh modification Component-based infrastructure for parallel adaptive analysis Albany computational mechanics environment and testbed Comments on hand-on session materials 2

Parallel Adaptive Analysis Components Scalable FE or FV analysis Form the system of equations Solve the system of equations Parallel unstructured mesh infrastructure Including means to move entities Mesh adaptation procedure Driven by error estimates and/or correction indicators Maintain geometric fidelity Support analysis needs (e.g., maintain boundary layer structure) Dynamic load balancing Rebalance as needed Support predictive methods to control memory use and/or load Fast partition improvement (considering multiple entities) All components must operate in parallel Scalability requires using same parallel control structure for all steps – partitioned mesh 3

Geometry-Based Analysis Mesh Background Geometry-Based Analysis Geometry, Attribute: analysis domain Mesh: 0-3D topological entities and adjacencies Field: distribution of solution over mesh Common requirements: data traversal, arbitrarily attachable user data, data grouping, etc. Complete representation: store sufficient entities and adjacencies to get any adjacency in O(1) time Mesh Part Regions Edges Faces Vertices Geometric model Mesh 4

Parallel Unstructured Mesh Infrastructure (PUMI) Capability to partition mesh to multiple parts per process Partition model i M j 1 P 2 inter-process part boundary intra-process part boundary Proc j Proc i Geometric model Distributed mesh 5

Distributed Mesh Data Structure Each part Pi assigned to a process Consists of mesh entities assigned to ith part. Uniquely identified by handle or id plus part number Treated as a serial mesh with the addition of part boundaries Part boundary: groups of mesh entities on shared links between parts Part boundary entity: duplicated entities on all parts for which they bound with other higher order mesh entities Remote copy: duplicated entity copy on non-local part 6

Mesh Migration Purpose: Moving mesh entities between parts Issues Dictated by operation - in swap and collapse it’s the mesh entities on other parts needed to complete the mesh modification cavity Entities to migrate are determined based on adjacencies Issues A function of mesh representation w.r.t. adjacencies, P- set and arbitrary user data attached to them Complete mesh representation can provide any adjacency without mesh traversal - a requirement for satisfactory performance Performance issues synchronization, communications, load balance and scalability How to benefit from on-node thread communication (all threads in a processor share the same memory address space) The mesh migration procedure migrates mesh entities from partitions to partitions and it is called thousands of time during a simulations for load balancing or local mesh migration to support mesh modification operations on or near the partition boundaries. Therefore the efficient migration algorithm is critical in achieving performance in parallel mesh applications. Based on the distributed mesh data structure and the partition model, we developed efficient an mesh migration algorithms. This is the algorithm of it based on full complete representations. First, … Later this algorithm will be extended to work with any representation options. The four issues we confront with mesh migration are - synchronization: the process by which two or more processes (threads in multi-core system) coordinate their activities - communication: the bandwidth and latency issues associated with exchanging data between processes (or threads) - load balancing: the distribution of mesh data load across multiple processes (or threads) so that they all perform roughly the same amount of work - scalability: the challenge of making efficient use of a larger number of processes (or threads) when the s/w is run on more-capable systems. 7

Ghosting Goals: localizing off-part mesh data to avoid inter-process communications for computations Ghost: read-only, duplicate entity copies not on part boundary including tag data Ghosting rule: triplet (ghost dim, bridge dim, # layers) Ghost dim: entity dimension to be ghosted Bridge dim: entity dimension used to obtain entities to be ghosted through adjacency # layers: the number of ghost layers measured from the part boundary E.g, to get two layers of region entities in the ghost layer, measured from faces on part boundary, use ghost_dim=3, bridge_dim=2, and # layers=2 Ghost: Read-only entity copies that are not shared part boundary entities and their attached data Input: bridge dimension, ghost dimension and number of layers. Stale ghosts are updated during mesh synchronization. Exploits neighborhood communication during ghost creation. Efficiency=(time_base*(1+increase in #E/ghosted))/time_test) time_base: Time to execute test with 1 ghost layer time_test: Time to execute test with nth ghost layer Applications with ghosting: 1. SPR Error estimator in parallel Nodal Patch Recovery needs a complete cavity on part boundaries 1-layer of ghost regions is created (bridge vertices) Solution field data is also ghosted On-part SPR procedure is carried out with the ghosted data 2. Mesquite Mesh Smoothing Improves quality through nodal-repositioning (mesh smoothing) In parallel, how to coordinate data between parts? Without coordination, invalid meshes can be created Uses ghosting to coordinate the data Creates ghost regions with bridge vertices 8

Two-Level Partitioning to Use MPI and Threads Exploit hybrid architecture of BG/Q, Cray XE6, etc… Reduced memory usage Approach Partition mesh to processes, then partition to threads Message passing, via MPI, between processes Shared memory, via pthreads, within process Transparent-to-application use of pthreads i M j 1 P 2 intra-process part boundary Process j Process i inter-process part boundary Q – 16 A2 cores/node, 16GB/node XE6 – 2 sockets, 12 MagnyCours cores/socket, 32GB/node XK7 – 16 Interlagos cores/node, 32GB/node (blue waters) , K10 (blue waters) | K20 (titan) NVIDIA GPU Phi – 61 cores, 4 threads core, + host sockets/processors, 8GB mem MPI Process 1 Process 2 Process 3 Process 4 pthreads Part Pi 9

Blue Gene/Q Two Level Partition Results AAA mesh: 2M tets, 32 parts, 2 nodes SLAC mesh: 17M tets, 64 parts, 4 nodes Torus mesh: 610M tets, 4096 parts, 256 nodes Test: local migration, all MPI vs. 1 MPI rank/16 threads per node Speedup: up to 27%

Dynamic Load Balancing Purpose: to rebalance load imbalanced mesh during mesh modification Equal “work load” with minimum inter-process communications Two tools being used Zoltan Dynamic Services supporting multiple dynamic partitioners with general control of partition objects and weights ParMa – Partitioning using mesh adjacencies 11

Dynamic Repartitioning (Dynamic Load Balancing) Compute Solutions & Adapt Initialize Application Partition Data Redistribute Data Output & End Dynamic repartitioning (load balancing) in an application: Data partition is computed. Data are distributed according to partition map. Application computes and, perhaps, adapts. Process repeats until the application is done. Ideal partition: Processor idle time is minimized. Inter-processor communication costs are kept low. Cost to redistribute data is also kept low.

Static vs. Dynamic: Usage and Implementation Must run side-by-side with application. Must be implemented in parallel. Must be fast, scalable. Library application interface required. Should be easy to use. Incremental algorithms preferred. Small changes in input result small changes in partitions. Explicit or implicit incrementally acceptable. Static: Pre-processor to application. Can be implemented serially. May be slow, expensive. File-based interface acceptable. No consideration of existing decomposition required. 13

Zoltan Toolkit: Suite of Partitioners Recursive Coordinate Bisection (Berger, Bokhari) Recursive Inertial Bisection (Taylor, Nour-Omid) Space Filling Curves (Peano, Hilbert) Refinement-tree Partitioning (Mitchell) Graph Partitioning ParMETIS (Karypis, Schloegel, Kumar) Jostle (Walshaw) Hypergraph Partitioning & Repartitioning (Catalyurek, Aykanat, Boman, Devine, Heaphy, Karypis, Bisseling) PaToH (Catalyurek) 14

Geometric Partitioners Goal: Create parts containing physically close data. RCB/RIB: Compute cutting planes that recursively divide work. SFC: Partition linear ordering of data given by space-filling curve. Advantages: Conceptually simple; fast and inexpensive. Effective when connectivity info is not available (e.g., in particle methods). All processors can inexpensively know entire decomposition. RCB: Regular subdomains useful in structured or unstructured meshes. SFC: Linear ordering may improve cache performance. Disadvantages: No explicit control of communication costs. Can generate disconnected subdomains for complex geometries. Geometric coordinates needed.

Topology-based Partitioners Goal: Balance work while minimizing data dependencies between parts. Represent data with vertices of graph/hypergraph Represent dependencies with graph/hypergraph edges Advantages: High quality partitions for many applications Explicit control of communication costs Much software available Serial: Chaco, METIS, Scotch, PaToH, Mondriaan Parallel: Zoltan, ParMETIS, PT-Scotch, Jostle Disadvantages: More expensive than geometric approaches Require explicit dependence info

Partitioning using Mesh Adjacencies (ParMA) Mesh and partition model adjacencies represent application data more completely then standard partitioning graph All mesh entities can be considered, while graph-partitioning models use only a subset of mesh adjacency information. Any adjacency can be obtained in O(1) time (assuming use of a complete mesh adjacency structure). Advantages Directly account for multiple entity types – important for the solve process – most computationally expensive step Avoid graph construction Easy to use with diffusive procedures Applications to Date Partition improvement to account for multiple entity types – improved scalability of solvers Use for improving partitions on really big meshes

ParMA – Multi-Criteria Partition Improvement Improved scalability of the solve by accounting for balance of multiple entity types – eliminate spikes Input: Priority list of mesh entity types to be balanced (region, face, edge, vertex) Partitioned mesh with communication, computation and migration weights for each entity Algorithm: From high to low priority if separated by ‘>’ (different groups) From low to high dimension entity types if separated by ‘=’ (same group) Compute migration schedule (Collective) Select regions for migration (Embarrassingly Parallel) Migrate selected regions (Collective) Ex) “Rgn>Face=Edge>Vtx” is the user’s input Step 1: improve balance for mesh regions Step 2.1: improve balance for mesh edges Step 2.2: improve balance for mesh faces Step 3: improve balance for mesh vertices Mesh element selection

ParMA Application Partition Improvement Example of C0, linear shape function finite elements Assembly sensitive to mesh element imbalances Sensitive to vertex imbalances they hold the dof Heaviest loaded part dictates solver performance Element-based partitioning results in spikes of dofs Diffusive application of ParMA knocks spikes down – common for 10% increase in strong scaling element imbalance increased from 2.64% to 4.54% dof imbalance reduced from 14.7% to 4.92% 19

Predictive Load Balancing Mesh modification before load balancing can lead to memory problems - common to see 400% increase on some parts Employ predictive load balancing to avoid the problem Assign weights based on what will be refined/coarsened Apply dynamic load balancing using those weights Perform mesh modifications May want to do some local migration 120 parts with ~30% of the average load ~20 parts with > 200% imbalance, peak imbalance is ~430% Histogram of element imbalance in 1024 part adapted mesh on Onera M6 wing if no load balancing is applied prior to adaptation. 20

Predictive Load Balancing Algorithm Mesh metric field at any point P is decomposed into three orthogonal direction (e1,e2,e3) and desired length (h1,h2,h3) in each corresponding direction. The volume of desired element (tetrahedron) : h1h2h3/6 Estimate number of elements to be generated: “num” is scaled to a good range before it is specified as a weight to graph nodes

General Mesh Modification for Mesh Adaptation Goal is the flexibility of remeshing with added advantages Supports general changes in mesh size including anisotropy Can deal with any level of geometric domain complexity Can obtain level of accuracy desired Solution transfer can be applied incrementally Given the “mesh size field”: Drive the mesh modification loop at the element level Look at element edge lengths and shape (in transformed space) If not satisfied select “best” modification Elements with edges that are too long must have edges split or swapped out Short edges eliminated Continue until size and shape is satisfied or no more improvement possible Determination of “best” mesh modification Selection of mesh modifications based on satisfaction of the element requirements Appropriate considerations of neighboring elements Choosing the “best” mesh modification 22

Mesh Adaptation by Local Mesh Modification Controlled application of mesh modification operations including dealing with curved geometries, anisotropic meshes Base operators swap collapse Split move Compound operators chain single step operators Double split collapse operator Swap(s) followed by collapse operator Split, then move the created vertex Etc. Edge split face split Edge collapse Double split collapse to remove the red sliver initial mesh (20,067 tets) adapted mesh (approx. 2M tets) 23

Accounting for Curved Domains During Refinement Moving refinement vertices to boundary required mesh modification (see IJNME paper, vol58 pp247-276, 2003 ) Coarse initial mesh and the mesh after multiple refinement/coarsening Operations to move refinement vertices x y z Here we shows complex mechanical part. The left picture show the parasolid geometry model. The middle one is the initial mesh. The right figure shows the refined mesh with all boundary vertices snapped. The table shows local mesh modification aplied. The 2nd column indicates number of vertices to be snapped, 3rd column gives vertex snapped by repositioning, 4th indicates vertex snapped by mesh modification and the last column, vertex snapped by remeshing. It can be seen that most vertices are snapped by reposition, the majority of remaining vertex is snapped after the application of mesh modifications, a few need to be snapped by re-meshing. 24

Matching Boundary Layer and Interior Mesh A modification operation on any layer is propagated through the stack and interface elements to preserve attributes of layered elements. 25

Curved Elements for Higher-Order Methods Requirements Coarse, strongly graded meshes with curved elements Must ensure the validity of curved elements Shape measure for curved elements - standard straight sided measure in 0-1 format - 0-1 curved measure (det. of Jacobian variation) Element geometric order and level of geometric approximation need to be related to geometric shape order Steps in the procedure (for optimum convergence rate) Automatic identification and linear mesh at singular features Generate coarse surface mesh accounting for the boundary layers Curve coarse surface mesh to boundary Curve graded linear feature isolation mesh Generate coarse linear interior mesh Modify interior linear mesh to ensure validity with respect to the curved surface and graded linear feature isolation mesh

Example p-Version Mesh Isolation on model edges Straight-sided mesh with gradient Curved mesh with gradient

Parallel Mesh Adaptation Parallelization of refinement: perform on each part and synchronize at inter-part boundaries. Parallelization of coarsening and swapping: migrate cavity (on-the-fly) and perform operation locally on one part. Support for parallel mesh modification requires update of evolving communication-links between parts and dynamic mesh partitioning. 28

Boundary Layer Mesh Adaptation Boundary Layer stacks in P-sets Mesh entities contained in a set are unique, and are not part of the boundary of any higher dimension mesh entities Migrate a set and constituting entities to another part together heat exchanger manifold model.  A large flow rate comes in the large tube and dumps into a thin rectangular geometry  where the flow is distributed into 20 smaller pipes. another one - trapwing The whole mesh adaptation process uses SCOREC FMDB and mesh adaptation software package, and uses mesh sets to define the small stacks of mesh entities. 29

Parallel Boundary Layer Adaptation Initial mesh of 2k elements Refinement and node repositioning with limited coarsening and swapping Final mesh of 210k elements 30

Mesh Adaptation to an Anisotropic Mesh Size Field Define desired element size and shape distribution following mesh metric Transformation matrix field T(x,y,z) Ellipsoidal in physical space transformed to normalized sphere Unit vectors associated with three principle directions Desired mesh edge lengths in these directions Decomposition of boundary layers into layer surfaces (2D) and a thickness (1D) mesh In-plane adaptation uses projected Hessian, thickness adaptation based on BL theory 31

Example 2 – M6 Wing Overall mesh Close-up to see adaptation in the boundary layer including intersection with shock

Example of Anisotropic Adaptation 33

Example Surface of adapted mesh for human abdominal aorta

Component-Based Construction of Adaptive Loops Building on the unstructured mesh infrastructure Employs a component-based approach interacting through functional interfaces Being used to construct parallel adaptive loops for codes Recently used for a 92B element mesh on ¾ million cores Overall geometry and slice plane shown 11B element mesh 35

Component-Based Unstructured Mesh Infrastructure process parameters Physics and Model Parameters Input Domain Definition non-manifold model construction physical parameters PDE’s and discretization methods Complete Domain Definition Mesh Generation and Adaptation attributed non-manifold topology geometric interrogation mesh with fields Parallel Infrastructure Domain Topology Mesh Topology and Partition Control Dynamic Load Balancing Solution Transfer mesh size field Correction Indicator mesh with fields solution fields discretization parameters mesh with fields attributed mesh and fields Mesh-Based Analysis Postprocessing/visualization

In-Memory Adaptive Loop Mapping data between component data structures and executing memory management Component integrated using functional interfaces Change/Add components with minimal development costs Comparison of file-based and in-memory transfer for PHASTA 85M element mesh on Hopper On 512 cores file based took 49 sec and in-memory 2 sec On 2048 cores file based took 91 sec and in-memory 1 sec Adaptive Loop Driver PHASTA Mesh Adaptation 37 Compact Mesh and Solution Data Mesh Data Base Solution Fields Field API Control Field Data

Active Flow Control Simulations actuator parameters Physics and Model Parameters Parasolid physical parameters NS with turbulence Finite elements Parasolid MeshSim and MeshAdapt or MeshSim Adapt attributed non-manifold topology geometric interrogation mesh with fields PUMI GMI FMDB and Partition Model Zoltan and ParMA Solution Transfer mesh size field Anisotropic correction indication mesh with fields flow field Element order B’dry layer info. mesh with fields attributed mesh and fields PHASTA ParaView

Example of Scalable Solver: PHASTA Excellent strong scaling Implicit time integration Employs the partitioned mesh for system formulation and solution Specific number of ALL-REDUCE communications also required Strong Scaling Results

Mesh Adaptivity for Synthetic Jets (O. Sahni) fact = 2,300Hz = 00 Re ~ O(100,000) 40

Aerodynamics Simulations process parameters Physics and Model Parameters Parasolid physical parameters NS Finite Volumes Parasolid or GeomSim MeshSim and MeshAdapt or MeshSim Adapt attributed non-manifold topology geometric interrogation mesh with fields PUMI and/or Simmetrix GMI or GeomSim FMDB and Partition Model or MeshSim Zoltan and ParMA Solution Transfer mesh size field Goal Oriented Error Estimation mesh with fields flow field FV method B’dry layer info. mesh with fields attributed mesh and fields FUN3D from NASA ParaView

Application Result - Scramjet Engine Initial Mesh Adapted Mesh

Adaptive Two-Phases Flow actuator parameters Physics and Model Parameters Parasolid physical parameters NS and level sets Finite elements Parasolid Mesh Generation and Adaptation attributed non-manifold topology geometric interrogation mesh with fields PUMI GMI FMDB and Partition Model Zoltan and ParMA Solution Transfer mesh size field Anisotropic correction indication mesh with fields flow field Zero level set mesh with fields attributed mesh and fields PHASTA ParaView

Adaptive Simulation of Two-Phase Flow Two-phase modeling using level-sets coupled to structural activation Adaptive mesh control – reduces mesh required from 20 million elements to 1 million elements

Electromagnetics Analysis process parameters Physics and Model Parameters ACIS physical parameters Electromagnetics Edge elements ACIS MeshSim and MeshAdapt or MeshSim Adapt attributed non-manifold topology geometric interrogation mesh with fields PUMI and/or Simmetrix GMI or GeomSim FMDB and Partition Model or MeshSim Zoltan and ParMA Solution Transfer mesh size field Projection-based method mesh with fields Element order Integration rule stresses mesh with fields attributed mesh and fields ACE3P from SLAC ParaView

Adaptive Control Coupled with PIC Method Adaptation based on Tracking particles (needs fine mesh) Discretization errors Full accelerator models Approaching 100 cavities Substantial internal structure Meshes with several hundred million elements 46

Albany Multiphysics Code Targets Several Objectives A finite element based application development environment containing the "typical" building blocks needed for rapid deployment and prototyping A mechanism to drive and demonstrate our Agile Components rapid software development vision and the use of template-based generic programming (TBGP) for the construction of advanced analysis tools A Trilinos demonstration application. Albany uses ~98 Sandia packages/libraries. Provides an open-source computational mechanics environment and serves as a test-bed for algorithms under development by the Laboratory of Computational Mechanics (LCM) destined for Sandia's production codes

Albany – Agile Component Architecture Software Quality Tools Libraries Interfaces Existing Apps Analysis Tools Main Version Control Optimization Build System Input Parser Regression Testing UQ Application Nonlinear Model Problem Discretization M Mesh Tools Mesh Adapt Load Balancing Nonlinear Solvers Albany Glue Code Nonlinear Transient Change: OLD: FMDB Database NEW: PUMI Linear Solve ManyCore Node PDE Assembly Linear Solvers Node Kernels Field Manager Iterative PDE Terms Multi-Core Discretization Multi-Level Accelerators 48

Agile Toolbox: Capabilities Analysis Tools (black-box) Composite Physics MultiPhysics Coupling Optimization Solution Control UQ (sampling) System Models Parameter Studies Utilities System UQ PostProcessing V&V, Calibration Input File Parser Visualization OUU, Reliability Mesh Tools Parameter List Verification Mesh I/O Memory Management Analysis Tools (embedded) Feature Extraction Inline Meshing I/O Management Model Reduction Nonlinear Solver Partitioning Communicators Time Integration Load Balancing Runtime Compiler Mesh Database Continuation Adaptivity MultiCore Parallelization Tools Mesh Database Sensitivity Analysis Remeshing Geometry Database Stability Analysis Grid Transfers Solution Database Constrained Solves Quality Improvement Software Quality Modification Journal Optimization Search Version Control Checkpoint/Restart UQ Solver DOF map Regression Testing Build System Linear Algebra Local Fill Backups Data Structures Part of roadmap: individual capabilities that can be written as reusable libraries. Much of this exists. Discretizations Verification Tests Iterative Solvers Physics Fill Discretization Library Mailing Lists Direct Solvers PDE Eqs Field Manager Unit Testing Eigen Solver Material Models Bug Tracking Preconditioners Derivative Tools Phys-Based Prec. Performance Testing Matrix Partitioning Sensitivities Objective Function Code Coverage Derivatives Constraints Architecture- Dependent Kernels Porting Adjoints Error Estimates Web Pages UQ / PCE Propagation Mulit-Core MMS Source Terms Release Process Accelerators 49

Structural Analysis for Integrated Circuits on BG/Q Projection-based method Albany/Trilinos ParaView GeomSim Physics and Model Parameters PUMI and/or Simmetrix GMI or GeomSim FMDB and Partition Model or MeshSim Zoltan and ParMA gdsII layout/process data gds2 to Parasolid Parasolid to GeomSim mesh with fields Element order Integration rule stresses attributed mesh and fields physical parameters process parameters Solid mechanics Finite elements MeshSim and MeshAdapt or MeshSim Adapt geometric interrogation attributed non-manifold topology mesh size field Solution Transfer with fields

From Design Data to Geometry for Meshing Need complete non-manifold solid model for: Automatic mesh generation Supporting high-level problem specification Maintaining geometric fidelity during mesh adaptation Tool to take design/process data and create solid model Basic design data in 2-D layouts (gdsII/OASIS) 3rd dimension must be added Process “knowledge” critical for constructing full geometry Set structures and methods build solid model using modeling kernel operations ***** some pictures – ideal – left image one 2-D section, right image interesting solid model. Can go to a second slide if desired 51

Parallel Mesh Generation All procedures are fully automatic, user not required to partition Surface Meshing Distributes model faces between processes Requires # model faces > # processors to scale. In practice this isn’t an issue Volume Meshing Load balancing done through spatial decomposition Mesh interior to each part is created, then repartitioning done to mesh unmeshed areas between part boundaries Mesh Improvement Local operations done on each part Local migrations done between parts to improve elements on part boundaries Q – 16 A2 cores/node, 16GB/node XE6 – 2 sockets, 12 MagnyCours cores/socket, 32GB/node XK7 – 16 Interlagos cores/node, 32GB/node (blue waters) , K10 (blue waters) | K20 (titan) NVIDIA GPU 52

Parallel Geometry Simmetrix’ solutions Problems CAD kernels not available on computers like BlueGene Even if they were, keeping full geometric model on each processor doesn’t scale Simmetrix’ solutions Geometry representation that can be used anywhere Geometry is able to be distributed in parallel Only model entities needed for mesh on each processor are on that processor. Model entities migrate with mesh Both discrete and CAD geometry supported Q – 16 A2 cores/node, 16GB/node XE6 – 2 sockets, 12 MagnyCours cores/socket, 32GB/node XK7 – 16 Interlagos cores/node, 32GB/node (blue waters) , K10 (blue waters) | K20 (titan) NVIDIA GPU 53

Parallel Mesh Generation Results Scaling parallel mesh generation is difficult No a-priori knowledge of how to partition Partitioning must be determined as meshing proceeds Results for volume meshing Q – 16 A2 cores/node, 16GB/node XE6 – 2 sockets, 12 MagnyCours cores/socket, 32GB/node XK7 – 16 Interlagos cores/node, 32GB/node (blue waters) , K10 (blue waters) | K20 (titan) NVIDIA GPU 54

****Interesting parallel adaptive result**** *** Give a brief description of the problem, that adaptive control used and show initial and adapted mesh with some data **** 55

Large Deformation Non-Linear Materials Mesh adaptation to control Mesh discretization errors Element shapes due to large deformation ****DESCRIBE A BIT OF SETTING MESH SIZE FIELD DUE TO BOTH SHOW A RESULT. *** 56

Introduction the Hand-On Session *** 57

Tools used to develop parallel adaptive simulations Closing Remarks A set of tools to support parallel unstructured mesh adaptation have been developed Parallel mesh infrastructure Dynamic load balancing Mesh adaptation Support for heterogeneous parallel computers under development Tools used to develop parallel adaptive simulations Both unstructured mesh finite element and finite volume procedures being developed Multiple problems areas – CFD, MHD, EM, solids Can account for semi-structured mesh regions, evolving geometry, high order curved meshes More Information: shephard@rpi.edu Granularity: Computation / Communication Ratio: In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. Periods of computation are typically separated from periods of communication by synchronization events. - Fine-grain Parallelism: Relatively small amounts of computational work are done between communication events Low computation to communication ratio. Facilitates load balancing Implies high communication overhead and less opportunity for performance enhancement If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. - Coarse-grain Parallelism: Relatively large amounts of computational work are done between communication/synchronization events High computation to communication ratio . Implies more opportunity for performance increase Harder to load balance efficiently - Which is Best? The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs. In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity. Fine-grain parallelism can help reduce overheads due to load imbalance Hardware factors play a significant role in scalability. Examples: Memory-cpu bus bandwidth on an SMP machine Communications network bandwidth Amount of memory available on any given machine or set of machines Processor clock speed Parallel Overhead: The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as: Task start-up time, Synchronizations, Data communications, Software overhead imposed by parallel compilers, libraries, tools, operating system, etc., Task termination time 58

PUMI: Parallel Unstructured Mesh Infrastructure Parallel Capabilities Unstructured 3D meshes w/ mixed element topology Support for higher order elements Direct relation to geometric model Parasolid, ACIS, and discrete models supported Solution based mesh adaptation Static and Dynamic partitioning Integration with Zoltan and ParMA Ghosting Functional interfaces for coupling to analysis codes Existing coupling with PHASTA, Albany/Trilinos, NASA FUN-3D, and SLAC ACE3P Download https://redmine.scorec.rpi.edu/projects/pumi More Information https://www.scorec.rpi.edu/pumi 59

Zoltan Toolkit: Suite of Partitioners Capabilities Common interface to multiple partitioning techniques/tools Geometric Partitioners Recursive Coordinate Bisection (Berger, Bokhari) Recursive Inertial Bisection (Taylor, Nour-Omid) Space Filling Curves (Peano, Hilbert) Refinement-tree Partitioning (Mitchell) Topology-based Partitioners ParMETIS (Karypis, Schloegel, Kumar) Jostle (Walshaw) Hypergraph Partitioning & Repartitioning (Catalyurek, Aykanat, Boman, Devine, Heaphy, Karypis, Bisseling) PaToH (Catalyurek)\ Coupled with PUMI Download http://www.cs.sandia.gov/~web1400/1400_download.html More Information http://www.cs.sandia.gov/Zoltan/ 60

ParMA: Partitioning Using Mesh Adjacencies Parallel Capabilities Dynamic partitioning procedures using mesh adjacencies and partition model information Any mesh adjacency can be obtained in O(1) time (assuming use of a complete mesh adjacency structure). Partition improvement to account for multiple entity types Improved scalability of solvers by reducing peak entity imbalance(s) Avoid graph construction – low memory cost Predictive load balancing for mesh adaptation Avoid memory exhaustion Coupled with PUMI Download (as part of PUMI) https://redmine.scorec.rpi.edu/projects/pumi More Information https://redmine.scorec.rpi.edu/projects/parma 61

MeshAdapt Capabilities Download More Information Parallel adaptation of unstructured 3D meshes w/ mixed element topology Supports general changes in mesh size including anisotropy Typically driven by a solution field based size field. Can deal with any level of geometric domain complexity Can obtain level of accuracy desired Solution transfer can be applied incrementally Callbacks for application defined transfer procedures. Coupled with PUMI Download https://redmine.scorec.rpi.edu/projects/pumi More Information https://www.scorec.rpi.edu/meshadapt/ 62

Albany Capabilities Download More Information A FEM based application development environment containing the "typical" building blocks needed for rapid deployment and prototyping. Open-source computational mechanics environment. A Trilinos demonstration application that leverages ~98 Sandia packages/libraries. These packages support: Uncertainty quantification and optimization Nonlinear and linear solvers Multi-physics coupling Kernels for multi-core and accelerator architectures Coupled with PUMI Download https://software.sandia.gov/albany More Information https://software.sandia.gov/albany/gettingStarted.pdf 63

Hands-on Exercise Outline Simmetrix Mesh Generation Video demonstrating Simmetrix mesh generation tools PUMI Video demonstrating mesh adaptation Mesh Partitioning via Zoltan Geometric and graph based (ParMetis) ParMA partition improvement Albany Deformation of a cylindrical bar Visualization with ParView Adaptive elastic deformation Preconditioner control 64