An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Adaptive Mesh Applications
OpenFOAM on a GPU-based Heterogeneous Cluster
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
A Parallelisation Approach for Multi-Resolution Grids Based Upon the Peano Space-Filling Curve Student: Adriana Bocoi Advisor: Dipl.-Inf.Tobias Weinzierl.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Slide 1 / 19 Mesh Generation and Load Balancing CS /11/2009 Stan Tomov Innovative Computing Laboratory Computer Science Department The University.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Concurrency and Performance Based on slides by Henri Casanova.
Strong Scalability Analysis and Performance Evaluation of a SAMR CCA-based Reacting Flow Code Sophia Lefantzi, Jaideep Ray and Sameer Shende SAMR: Structured.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Xing Cai University of Oslo
Ioannis E. Venetis Department of Computer Engineering and Informatics
Xiaodong Wang, Shuang Chen, Jeff Setter,
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.
Adaptive Data Refinement for Parallel Dynamic Programming Applications
Adaptive Mesh Applications
Parallel Programming in C with MPI and OpenMP
Computational issues Issues Solutions Large time scale
Presentation transcript:

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001 : European Conference on Parallel Computing

Introduction AMR – Adaptive Mesh Refinement AMR used for solving PDEs for dynamic applications Challenges involved: Dynamic resource allocation Dynamic data distribution and load balancing Communication and co-ordination Partitioning of adaptive grid hierarchy Evaluation of dynamic domain-based partitioning strategies with an application-centric approach

Motivation & Goal Even for a single application, the most suitable partitioning technique depends on input parameters and its run-time state Application-centric characterization of partitioners as a function of number of processors, problem size, and granularity Enable the run-time selection of partitioners based on input parameters and application state

Adaptive Mesh Refinement Start with a base coarse grid with minimum acceptable resolution Tag regions in the domain requiring additional resolution, cluster the tagged cells, and fit finer grids over these clusters Proceed recursively so that regions on the finer grid requiring more resolution are similarly tagged and even finer grids are overlaid on these regions Resulting grid structure is a dynamic adaptive grid hierarchy The Berger-Oliger Algorithm Recursive Procedure Integrate(level) If (RegridTime) Regrid Step  t on all grids at level “level” If (level + 1 exists) Integrate (level + 1) Update(level, level + 1) End if End Recursion level = 0 Integrate(level) Partitioning Adaptive Grid Hierarchies

SAMR 2-D Grid Hierarchy Time Step 40Time Step 80 Time Step 0 Time Step 160Time Step 120Time Step 182 Level 1: Level 0: Level 3:Level 2:Level 4: Legend

Partitioning Techniques Static or Dynamic techniques Geometric or Non-geometric Dynamic partitioning – global or local approaches Partitioners for SAMR grid applications Patch-based Domain-based Hybrid

Partitioners Evaluated SFC: Space Filling Curve based partitioning G-MISP: Geometric Multi-level Inverse Space filling curve Partitioning G-MISP+SP: Geometric Multi-level Inverse Space filling curve Partitioning with Sequence Partitioning pBD-ISP: p-way Binary Dissection Inverse Space filling curve Partitioning SP-ISP: “Pure” Sequence Partitioning with Inverse Space filling curve Partitioning WD: Wavefront Diffusion based on global work load

SFC Recursive linear representation of multi-dimensional grid hierarchy using space-filling mappings (N-to-1D mapping) Computational load determined by segment length and recursion level

G-MISP & G-MISP+SP G-MISP Multi-level algorithm views matrix of workloads from SAMR grid hierarchy as a one-vertex graph, refined recursively Speed at expense of load balance G-MISP+SP “Smarter” variant of G-MISP – uses sequence partitioning to assign consecutive portions of one- dimensional list to processors Load balance improves but scheme is computationally more expensive

pBD-ISP Generalization of binary dissection – domain partitioned into p partitions Each split divides load as evenly as possible, considering processors

SP-ISP Domain sub-divided into p*b equally sized blocks Dual-level algorithm - parameter settings for each level Fine granularity scheme: good load balance but increased overhead, communication and computational cost

WD Part of ParMetis suite based on global workload Used for repartitioning graphs with scattered refinements Results in fine grain partitionings with jagged boundaries and increased communication costs and overheads Metis integration extremely expensive, dedicated SAMR partitioners performed much better Two extra steps needed for Metis in our interface Metis graph generated from grid before partitioning, clustering used to regenerate grid blocks from graph partitions after partitioning

Experimental Setup Application – RM3D 3-D “real world” compressible turbulence application solving Richtmyer-Meshkov instability Fingering instability which occurs at a material interface accelerated by a shock wave Machine – NPACI IBM SP2 Blue Horizon at SDSC Teraflop-scale Power3 based SMP cluster 1152 processors and 512GB of main memory AIX operating system Peak bi-directional data transfer rate of approx. 115 MBps

Experimental Setup (contd.) Base coarse grid – 128 * 32 * 32 3 levels of factor 2 space-time refinements Application ran for 150 coarse level time-steps Experiments consisted of varying – Partitioner (from the set of evaluated partitioners) Number of processors (16 – 128) Granularity, i.e. the atomic unit (2*2*2 – 8*8*8) Metrics used – total run-time, maximum load imbalance, AMR efficiency

Experimental Results RM3D application on 16 processors with granularity 2 PartitionerRun-time (s)Max. Load Imbalance (%) AMR Efficiency (%) SFC G-MISP G-MISP+SP pBD-ISP SP-ISP

Run-times

Max. Load Imbalance

AMR Efficiency

Experimental Evaluation RM3D needs rapid refinement and efficient redistribution pBD-ISP, G-MISP+SP, SFC best suited for RM3D – fast partitioners with low imbalance and maintaining good communication patterns pBD-ISP fastest, but average load imbalance G-MISP+SP and SFC generate lowest imbalance but are relatively slower Evaluated partitioning techniques scale reasonably well

Evaluation (contd.) Coarse granularity produces high load imbalance Fine granularity leads to greater synchronization and coordination overheads and higher execution times Optimal partitioning granularity requires a trade-off between execution speed and load imbalance For RM3D application, granularity of 4 gives lowest execution time with acceptable load imbalance

Conclusions Experimental evaluation of dynamic domain-based partitioning and load-balancing techniques RM3D compressible turbulence application Effect of choice of partitioner and granularity on execution time Formulation of application-centric characterization of the partitioners as a function of number of processors, problem size, and partitioning granularity