Parallel Algorithm Oriented Mesh Database Jean-François Remacle, Mark S. Shephard & Joseph E. Flaherty Scientific Computation Research Center Rensselaer Polytechnic Institute remacle@scorec.rpi.edu Outline Algorithm Oriented Mesh Data Structure (AOMD) Parallel AOMD Technical issues Parallel adaptive example (DG) http://www.scorec.rpi.edu/AOMD
Motivations of AOMD and PAOMD Aim of AOMD is to provide services to mesh users Geometry based analysis, relation mesh to model is maintained Support of dynamic mesh adjacencies Parallel services: message passing, adaptivity and load balancing capabilities (callback pattern) No MPI calls visible to users, PAOMD hides parallel issues AOMD and PAOMD is a toolbox Standard C++ Iterators, generic design 3000 lines of code for the serial part 1000 lines of code for the parallel part Compiles in 5 minutes with gcc 3.0 Open source (BSD) : http://www.scorec.rpi/AOMD
Basics of the Algorithm Oriented Mesh Database A mesh entity is described by a set of lower dimension entities All vertices always required Vertices are atomic mesh entities, must be differentiated (using iD’s, coordinates or anything else consistent) Two entities are equal if their set of vertices are equal Not absolutely general but key to practical implementation Allows to compare mesh entities (<,>,=) independently of their representation Some associative containers for mesh entities : add, remove search Minimum information Equally dimension classified entities must be present All vertices, all regions, all edges classified on model edges and all faces classified on model faces This is a sufficient minimum: no geometrical checks
Basics of the Parallel AOMD Basics of parallel AOMD Partition boundaries treated like model boundaries Equal order mesh entities must exist on partition boundaries (partition faces, edges and vertices) Mesh vertices must be differentiated among partitions Same iD’s Same coordinates ... On processor: serial AOMD Implementation aspects Simplicity, no master, no owner Round of communication standardized, no MPI calls visible, messages automatically packed
Parallel AOMD - Mesh Adaptation Target is transient applications with thousands of mesh adaptation steps Want fast and simple adaptation Need efficient inter-processor communications Mesh Refinement Apply templates Include support of non-conforming meshes and multigrid Refined entities with remote copies must be split on all partitions Mesh Coarsening Collect all mesh entities involved onto one partition Carry out operation using serial operators on processor
Dynamic Load Balancing and Mesh Migration Need dynamic load balancing after mesh adaptation Procedures build on balancing procedures in Zoltan (from Sandia) Load balancing procedure indicates which mesh entities are to be migrated to which processor PAOMD only migrates minimum set, unless user specifically asks to migrate other entities classification after load balancing and before migration configuration after migration
Mesh Migration Steps in process Message passing Collect the mesh entities to be migrated to another partition Determine needed higher order mesh entities to be migrated (use AOMD to determine minimal set needed) Collect entities and any user attached data Perform communications to send entities and update links Message passing At PAOMD operator level it appears messages are sent one at a time This would lead to unacceptable communication costs Message packing used - AUTOPACK (from Argonne) Automatically controls message packing process Includes information and tools to optimize message size for network architecture used
Implementation issues Design C++ and generic programming STL, efficient hashing function AOMD::iterators follow C++ standard, std::algorithms (>100) may be applied in combination with AOMD::iterators AOMD::algorithms available: adjacency creation, building a graph, building a tree in a mesh, edge collapsing… Some OO Patterns Singleton, Visitor, Memento... Tradeoff efficiency vs. flexibility We believe there is no tradeoff Templates, functors, inlining… C++ can be efficient Classical example, quick sort stl::sort is twice faster (with VC6) than C qsort External libraries for Parallel Autopack, automatic message packing Zoltan, dynamic load balancing and partitioning
Mesh refinement Conformal or not (hanging nodes or mixed meshes) class myAOMD_RefCallback { public : int operator () (const meshEntity *); void callback (std::list<meshEntity *> &before, std::list<meshEntity *> &after); }; Conformal or not (hanging nodes or mixed meshes) The Algorithm AOMD:: RefUnref(theMesh, myAOMD_RefCallback);
Communications Messages are packed (autopack) The Algorithm class myAOMD_RoundOfComm { public : char * sendBuffer (const meshEntity *, int dest_proc, size_t &sizebuf) const; void recvBuffer (const meshEntity *, int src_proc, char *buf) const; }; Messages are packed (autopack) The Algorithm AOMD::roundOfComm(theMesh, myAOMD_RoundOfComm);
Load balancing Messages are packed (autopack) The Algorithm class myAOMD_LBCallback { public : char * sendBuffer (const meshEntity *, int dest_proc, size_t &sizebuf) const; void recvBuffer (const meshEntity *, int src_proc, char *buf) const; }; Messages are packed (autopack) The Algorithm AOMD::LB(theMesh, myAOMD_LBCallback);
Demonstration of Load Balancing Available on http://www.scorec.rpi.edu/AOMD
Demonstration of Load Balancing
2-D Animation of Instability Linear DG elements, 30,000 to 800,000 dof Atwood Number, A = 1/3 10 fourier modes in “random” distribution time for the bubble to reach top of the window (y = 0.5) : 5 sec This calculation: a = 0.06 Experiments: a = 0.058 - 0.065 Theory (Glimm, et al) a = 0.045 - 0.06
Refined 3-D Meshes for Rayleigh Taylor Instability non-conforming hexahedron mesh light fluid 24 steps of refinement heavy fluid 72 steps of refinement 104 steps of refinement
Conclusions PAOMD advantages Future work Quite small piece of software, documented Focused, mesh management only Asks for minimum user knowledge about parallel issues Efficient implementation Future work Terascale computers PAOMD concepts are theoretically scalable Hardware heterogeneity, machine and network models have to be added in partitioners 64 Procs, 40 GDof