Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:



Advertisements
Similar presentations
Dynamic Load Balancing in Scientific Simulation Angen Zheng.
Advertisements

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Towards Autonomic Adaptive Scaling of General Purpose Virtual Worlds Deploying a large-scale OpenSim grid using OpenStack cloud infrastructure and Chef.
Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.
Meshless Elasticity Model and Contact Mechanics-based Verification Technique Rifat Aras 1 Yuzhong Shen 1 Michel Audette 1 Stephane Bordas 2 1 Department.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
Software Version Control SubVersion software version control system WebSVN graphical interface o View version history logs o Browse directory structure.
Parallel Decomposition-based Contact Response Fehmi Cirak California Institute of Technology.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.
Department of Biomedical Informatics Dynamic Load Balancing (Repartitioning) & Matrix Partitioning Ümit V. Çatalyürek Associate Professor Department of.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Infrastructure for Parallel Adaptive Unstructured Mesh Simulations
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.
Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Efficient Data Accesses for Parallel Sequence Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Praveen Chandramohan (ORNL) Al Geist (ORNL) Nagiza Samatova.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.
1 1  Capabilities: Dynamic load balancing and static data partitioning -Geometric, graph-based, hypergraph-based -Interfaces to ParMETIS, PT-Scotch, PaToH.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
LAMMPS Users’ Workshop
New Features in ML 2004 Trilinos Users Group Meeting November 2-4, 2004 Jonathan Hu, Ray Tuminaro, Marzio Sala, Michael Gee, Haim Waisman Sandia is a multiprogram.
Adaptive Meshing Control to Improve Petascale Compass Simulations Xiao-Juan Luo and Mark S Shephard Scientific Computation Research Center (SCOREC) Interoperable.
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Trilinos Strategic (and Tactical) Planning Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Dynamic Load Balancing in Scientific Simulation
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
Parallel Algorithm Oriented Mesh Database
Computer Architecture: Parallel Task Assignment
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Parallel Hypergraph Partitioning for Scientific Computing
Challenges in Electromagnetic Modeling Scalable Solvers
= Michael M. Wolf, Sandia National Laboratories
Ray-Cast Rendering in VTK-m
2009 AAG Annual Meeting Las Vegas, NV March 25th, 2009
Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.
Adaptive Mesh Applications
Dynamic Load Balancing of Unstructured Meshes
Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL SciDAC Collaborations on Partitioning and Dynamic Load Balancing Lie-Quan Lee (SLAC) with E. Boman, K. Devine, L.A. Riesen (Sandia) U. Catalyurek (Ohio St. U.) COMPASS All-Hands Meeting September 2007

Partitioning and Load Balancing Assignment of application data to processors for parallel computation. Applied to elements, particles, matrix rows, grid points, etc.

Collaborations and Goals Collaborations within SciDAC: –CSCAPES Institute: Combinatorial Scientific Computing And Petascale Simulations (A. Pothen, Old Dominion U.) –ITAPS Center for Enabling Technology: Interoperable Tools for Advanced Petascale Simulations (L. Diachin, LLNL) Goals: –Deployment in applications of scalable partitioners on existing architectures. –Development of specialized partitioners for particle-in- cell methods. –Research into partitioning approaches for state-of-the- art architectures (e.g., multicore, 100K processors)

Partitioning in Applications Initialize Application Partition Data Distribute Data Compute Solutions Output & End STATIC PARTITIONING: partition computed once and used throughout simulation. Goals: balance workloads; keep communication costs low. Initialize Application Partition Data Redistribute Data Compute Solutions & Adapt Output & End DYNAMIC REPARTITIONING (a.k.a. load balancing): partition recomputed as application changes dynamically. Goals: balance workloads; keep communication costs low; keep redistribution costs low.

The Zoltan Toolkit: Suite of Partitioning Tools Recursive Coordinate Bisection (Berger, Bokhari) Recursive Inertial Bisection (Taylor, Nour-Omid ) Zoltan Graph Partitioning ParMETIS (U. Minnesota) Jostle (U. Greenwich) Hypergraph Partitioning Hypergraph Repartitioning PaToH (Catalyurek & Aykanat) Geometric (coordinate-based) partitioners Hypergraph and graph (connectivity-based) partitioners Space Filling Curve Partitioning (Warren&Salmon, et al.) Refinement-tree Partitioning (Mitchell)

Geometric Partitioners Recursive Coordinate Bisection (RCB) Space Filling Curve Partitioning (SFC) Regions of geometric space (and all objects in them) are assigned to processors. To maintain load balance, regions are computed so they contain equal object weight. Only object geometric coordinates and weights needed to compute partitions.

Geometric Partitioners: Advantages and Disadvantages Advantages: –Conceptually simple; fast and inexpensive. –All processors can inexpensively know entire partition (e.g., for global search in contact detection). –No connectivity info needed (e.g., particle methods). –Good on specialized geometries. Disadvantages: –No explicit control of communication costs. –Mediocre partition quality. –Need coordinate information. SLAC’S 55-cell Linear Accelerator with couplers: One-dimensional RCB partition reduced runtime up to 68% on 512 processor IBM SP3. (Wolf, Ko)

Connectivity-based Partitioners Graph Partitioning through interfaces to ParMETIS Hypergraph Partitioning and Repartitioning Relationships between data are represented as hypergraph/graph edges. To maintain load balance, partitioners assign equal object weight to processors. To reduce application communication costs, partitioners try to minimize weight of edges across subdomain boundaries.

Connectivity-based Partitioners: Advantages and Disadvantages Advantages: –Highly successful model for mesh-based PDE problems. –Explicit control of communication volume gives higher partition quality than geometric methods. –Excellent software available. Serial: Chaco (SNL)PaToH (Ohio St. U.) Jostle (U. Greenwich) hMETIS (U. Minn.) METIS (U. Minn.)Mondriaan (Utrecht U.) Party (U. Paderborn) Scotch (U. Bordeaux) Parallel: Zoltan (SNL) ParMETIS (U. Minn.) PJostle (U. Greenwich) Disadvantages: –More expensive than geometric methods. –Connectivity info needed; often a limitation for particle methods.

Performance Results Experiments on Sandia’s Thunderbird cluster. –Dual 3.6 GHz Intel EM64T processors with 6 GB RAM. –Infiniband network. Compare RCB, SFC, hypergraph and graph (ParMETIS) partitioners. Measure … –Amount of communication induced by the partition. –Partitioning time. Datasets include finite elements, circuit simulations, DNA electrophoresis, sensor placement, etc. –Show only SLAC datasets here; if interested, ask to see results from other datasets.

Comparing Partitioners: SLAC Test Data SLAC *LCLS Radio Frequency Gun 6.0M element mesh 11.7M graph edges SLAC Linear Accelerator 2.9M element mesh 5.7M graph edges

Communication Volume: Lower is Better SLAC 6.0M LCLS SLAC 2.9M Linear Accelerator Number of parts = number of processors. RCB (geometric) Graph (connectivity-based) Hypergraph (connectivity-based) SFC (geometric)

Partitioning Time: Lower is better SLAC 6.0M LCLS SLAC 2.9M Linear Accelerator 1024 parts. Varying number of processors. RCB (geometric) Graph (connectivity-based) Hypergraph (connectivity-based) SFC (geometric)

Repartitioning Experiments Experiments with 64 parts on 64 processors. Dynamically adjust weights in data to simulate, say, adaptive mesh refinement. Repartition. Measure repartitioning time and total communication volume: Data redistribution volume + Application communication volume Total communication volume Best Algorithms Paper Award at IPDPS07 “Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations” Catalyurek, Boman, Devine, Bozdag, Heaphy, & Riesen

Repartitioning Results: Lower is Better SLAC 6.0M LCLS Repartitioning Time (secs) Data Redistribution Volume Application Communication Volume

Aiming for Petascale Reducing message latency for application. –Minimize maximum number of neighboring parts (with Kumfert, LLNL). –Balancing both computation and communication (Pinar & Hendrickson). Reducing communication overhead for application. –Map parts onto processors to take advantage of network topology. –Minimize distance messages travel in network. Developing specialized partitioning strategies for particle-in-cell applications.

Entire System... Subsystem... Proc... Proc... Aiming for Petascale Hierarchical partitioning in Zoltan v3. –Partition for multicore/manycore architectures. Partition hierarchically with respect to chips and then cores. Similar to strategies for clusters of SMPs (Teresco, Faik). Treat core-level parts as separate threads or MPI processes. –Support 100Ks processors. Reduce collective communication operations during partitioning. Allow more localized partitioning on subsets of processors. Entire System... Processor Core... Core...

Aiming for Petascale Improving scalability of partitioning algorithms. –Hybrid partitioners (particularly for mesh-based applications) combining fast geometric methods with higher-quality connectivity-based methods. –Refactored partitioners for bigger data sets and processor arrays.

For More Information… Zoltan web page: –Download Zoltan v3 (open-source software). –Tutorial and User’s Guide. ITAPS Interface to Zoltan:

Test Data SLAC *LCLS Radio Frequency Gun 6.0M x 6.0M 23.4M nonzeros Xyce 680K ASIC Stripped Circuit Simulation 680K x 680K 2.3M nonzeros Cage15 DNA Electrophoresis 5.1M x 5.1M 99M nonzeros SLAC Linear Accelerator 2.9M x 2.9M 11.4M nonzeros

Communication Volume: Lower is Better Cage15 5.1M electrophoresis Xyce 680K circuitSLAC 6.0M LCLS SLAC 2.9M Linear Accelerator Number of parts = number of processors. RCB Graph Hypergraph SFC

Partitioning Time: Lower is better Cage15 5.1M electrophoresis Xyce 680K circuitSLAC 6.0M LCLS SLAC 2.9M Linear Accelerator 1024 parts. Varying number of processors. RCB Graph Hypergraph SFC

Repartitioning Results: Lower is Better Xyce 680K circuitSLAC 6.0M LCLS Repartitioning Time (secs) Data Redistribution Volume Application Communication Volume