Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.

Slides:

Advertisements

Similar presentations

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Adaptive Mesh Applications

OpenFOAM on a GPU-based Heterogeneous Cluster

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.

Software Version Control SubVersion software version control system WebSVN graphical interface o View version history logs o Browse directory structure.

CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.

Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

MGR: Multi-Level Global Router Yue Xu and Chris Chu Department of Electrical and Computer Engineering Iowa State University ICCAD

Infrastructure for Parallel Adaptive Unstructured Mesh Simulations

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.

Grid Generation.

1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.

1 Presenters: Cameron W. Smith and Glen Hansen Workflow demonstration using Simmetrix/PUMI/PAALS for parallel adaptive simulations FASTMath SciDAC Institute.

Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

Approximation Algorithms Department of Mathematics and Computer Science Drexel University.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

PIMA-motivation PIMA: Partition Improvement using Mesh Adjacencies  Parallel simulation requires that the mesh be distributed with equal work-load and.

Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.

CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

1 A Distributed Architecture for Multimedia in Dynamic Wireless Networks By UCLA C.R. Lin and M. Gerla IEEE GLOBECOM'95.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C. Smith, M. Zhou Scientific Computation Research Center,

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Citations & Acknowledgments Kazhdan, Michael, Matthew Bolitho, and Hugues Hoppe. "Poisson surface reconstruction." Proceedings of the fourth Eurographics.

Adaptive Meshing Control to Improve Petascale Compass Simulations Xiao-Juan Luo and Mark S Shephard Scientific Computation Research Center (SCOREC) Interoperable.

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

BOĞAZİÇİ UNIVERSITY – COMPUTER ENGINEERING Mehmet Balman Computer Engineering, Boğaziçi University Parallel Tetrahedral Mesh Refinement.

Presented by Adaptive Hybrid Mesh Refinement for Multiphysics Applications Ahmed Khamayseh and Valmor de Almeida Computer Science and Mathematics Division.

A Framework for Reliable Routing in Mobile Ad Hoc Networks Zhenqiang Ye Srikanth V. Krishnamurthy Satish K. Tripathi.

1 1  Capabilities: PCU: Communication, threading, and File IO built on MPI APF: Abstract definition of meshes, fields, and their algorithms GMI: Interface.

ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.

 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Unstructured Meshing Tools for Fusion Plasma Simulations

Parallel Algorithm Oriented Mesh Database

2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure

Parallel Graph Algorithms

Parallel Unstructured Mesh Infrastructure

Ana Gainaru Aparna Sasidharan Babak Behzad Jon Calhoun

Parallel Programming By J. H. Wang May 2, 2017.

Minimum Spanning Tree 8/7/2018 4:26 AM

Soft Computing Applied to Finite Element Tasks

Construction of Parallel Adaptive Simulation Loops

ParFUM: High-level Adaptivity Algorithms for Unstructured Meshes

Adaptive Mesh Applications

Alan Kuhnle*, Victoria G. Crawford, and My T. Thai

Dynamic Load Balancing of Unstructured Meshes

Presentation transcript:

Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center Rensselaer Polytechnic Institute SIAM CSE 2013 MS105 Frameworks, Algorithms and Scalable Technologies for Mathematics on Next-generation Computers February 26, 2013

Outline 2 Simulation Based Engineering Workflow Parallel Unstructured Mesh Infrastructure (PUMI) ParMA Predictive Load Balancing Estimating change in mesh size Part merging Part splitting Multi-criteria Partition Improvement Closing Remarks

Simulation Based Engineering Workflow 3 Problem Definition Mesh Generation PDE Analysis Adaptation Post Processing

Partition model Geometric model Parallel Unstructured Mesh Infrastructure (PUMI) 4 Fields distribution of solution over mesh Mesh analysis domain 0-3D topological entities and adjacencies distribution of mesh across computing resources Parallel Control communication utilities

PUMI Mesh Representation 5 Mesh entities: vertex (0D), edge (1D), face (2D), or region (3D) Adjacencies: How the mesh entities connect to each other Complete representation: store sufficient entities and adjacencies to get any adjacency in O(1) time Partition Model: Topological partition information Relationship to mesh entities – classification Regions Faces Vertices Edges

Partitioning using Mesh Adjacencies (ParMA) 6 Mesh and partition model adjacencies represent application data more completely then standard graph-partitioning models. All mesh entities can be considered, while graph-partitioning models use only a subset of mesh adjacency information. Any adjacency can be obtained in O(1) time (assuming use of a complete mesh adjacency structure). Advantages Avoid graph construction (assuming you have complete representation) Directly account for multiple entity types – important for the solve process - typically the most computationally expensive step Easy to use with diffusive procedures Disadvantage Lack of well developed algorithms for more global parallel partitioning operations directly from mesh adjacencies Regions Faces Vertices Edges

Parallel unstructured mesh adaptation typically generates parts with 400% or more imbalance on non-trivial geometries due to local coarsening & refinement. Refining then repartitioning can exceed available memory in some processes, even if the system’s total memory is sufficient. Goal: Fast and scalable redistribution procedure to prevent memory exhaustion 120 parts with ~30% of the average load ~20 parts with > 200% imbalance, peak imbalance is ~430% Predictive Load Balancing for Mesh Adaptation 7 Histogram of element imbalance in 1024 part adapted mesh on Onera M6 wing if no load balancing is applied prior to adaptation.

Simulation Based Engineering Workflow 8 PDE Analysis Redistribute Adapt Mesh Application Specific Load Balancing

Graph-based procedure Estimate change in mesh size - set element weights quantifying change Run Zoltan graph/hyper-graph based partitioning on weighted mesh Adapt the mesh Run Zoltan graph/hyper-graph based partitioning and/or a partition improvement procedure to attain optimal partition Observations Parallel construction and balancing of graph with small cuts takes reasonable time Re-partitioning s.t. adaptation does not exceed per-process memory limits should be cheaper then a full re-partition Graph-Based Predictive Load Balancing 9

ParMA: Predictive Load Balancing (PLB) Mesh adjacency-based procedure Estimate change in mesh size Merge parts that will be coarsened to create some empty parts. Split parts with substantial refinement into the empty parts to remove imbalance spikes of refined mesh. Refine/coarsen the mesh. Apply ParMA’s Multi-criteria partition improvement parts with ~30% of the average load ~20 parts with > 200% imbalance, peak imbalance is ~430% Histogram of element imbalance in 1024 part adapted mesh on Onera M6 wing if no load balancing is applied prior to adaptation.

ParMA:PLB – Estimating Mesh Size 11

ParMA:PLB – Part Merging 12 Add example figures!!

ParMA:PLB – Part Splitting 13 Add example figures!!

ParMA:PLB – Status Initial Testing and Ongoing Developments Predictive load balancing tests run on 1024 parts on RPI BlueGene Q (64 nodes, 16 ranks/node) Analytic mesh adaptation size field - 17e6 regions to 214e6 regions Prevents exhaustion of process memory – reduces peak predicted imbalance from 200% to 150% 0.4 seconds to make merging and splitting decisions Optimized migration procedure for part merging – simpler logic then current ‘generalized’ migration procedure 14

ParMA: Multi-Criteria Partition Improvement 15 Parallel simulation requires that the mesh be distributed with equal work-load and minimum inter-part communications Observations on graph-based dynamic balancing Parallel construction and balancing of graph with small cuts takes reasonable time Graph/hyper-graph partitions are powerful for unstructured meshes, however they use one order (as in 0,1,2,3) of mesh entity as the graph nodes, hence the balance of other mesh entities may not be optimal Accounting for multiple criteria and or multiple interactions is not obvious Hypergraphs allows edges to connect more that two vertices – has been used to help account for migration communication costs Schloegel and Karypis (2002) discuss an effective optimization method for three, or fewer, constraints

ParMA: Multi-Criteria Partition Improvement 16 Improve scaling of applications by reducing imbalances through exchange of mesh regions between neighboring parts Current algorithm focused on improved scalability of the solve by accounting for balance of multiple entity types Imbalance is limited to a small number of heavily loaded parts, referred to as spikes, which limit the scalability of applications Application defined priority list of entity types such that imbalance of high priority types is not increased when balancing lower priority types

ParMA: Multi-Criteria Partition Improvement 17 Input: Priority list of mesh entity types to be balanced (region, face, edge, vertex) Partitioned mesh with complete representation and communication, computation and migration weights for each entity Algorithm: From high to low priority if separated by ‘>’ (different groups) From low to high dimension entity types if separated by ‘=’ (same group)  Compute migration schedule (Collective)  Select regions for migration (Embarrassingly Parallel)  Migrate selected regions (Collective) Ex) “Rgn>Face=Edge>Vtx” is the user’s input Step 1: improve balance for mesh regions Step 2.1: improve balance for mesh edges Step 2.2: improve balance for mesh faces Step 3: improve balance for mesh vertices Mesh element selection

ParMA: Multi-Criteria Partition Improvement (Zhou) 133M region mesh on 16k parts Entity imbalances Run time (Jaguar: Cray XT5) 18 PHASTA CFD on 288k cores of JuGene BG/P, 1B rgn mesh (Rasquin) Improves strong scaling from 0.88 to 0.95 with ParMA Vtx>Rgn Application inputs

ParMA: Multi-Criteria Partition Improvement – Status Testing weight based imbalance for predictive load balancing Partition improvement for PHASTA meshes with billions of elements and more then half a million parts Apply ParMA merging and splitting techniques to reduce vertex imbalance spikes – careful consideration of part surface area changes Effect of disconnected parts – sacrifice communication for improved computation balance Diffusion across partition model edges Account for load and communication balance changes when entities are classified on partition model edges and selected for migration 19

Closing Remarks Successfully leveraging mesh adjacency and partition model information Making quick decisions Demonstrated partition improvement Future ParMA Developments Supporting Two-level partitioning Finding efficient combination of procedures to rebalance to compute nodes and then within each node See Dan Ibanez’s MS120 talk at 10 am in Hancock (Lobby Level) for details on the MPI + Pthreads approach for PUMI More Information Online PUMI: ParMA: SCOREC: FASTMath: 20

Thank You 21 For More Information Contact: