Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:

Advertisements

Similar presentations

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Advertisements

Dynamic Load Balancing in Scientific Simulation Angen Zheng.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.

Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.

OpenFOAM on a GPU-based Heterogeneous Cluster

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.

Parallel Decomposition-based Contact Response Fehmi Cirak California Institute of Technology.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.

Department of Biomedical Informatics Dynamic Load Balancing (Repartitioning) & Matrix Partitioning Ümit V. Çatalyürek Associate Professor Department of.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Infrastructure for Parallel Adaptive Unstructured Mesh Simulations

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.

Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Slide 1 / 19 Mesh Generation and Load Balancing CS /11/2009 Stan Tomov Innovative Computing Laboratory Computer Science Department The University.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Parallel Computing Sciences Department MOV’01 Multilevel Combinatorial Methods in Scientific Computing Bruce Hendrickson Sandia National Laboratories Parallel.

Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

1 1  Capabilities: Dynamic load balancing and static data partitioning -Geometric, graph-based, hypergraph-based -Interfaces to ParMETIS, PT-Scotch, PaToH.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

New Features in ML 2004 Trilinos Users Group Meeting November 2-4, 2004 Jonathan Hu, Ray Tuminaro, Marzio Sala, Michael Gee, Haim Waisman Sandia is a multiprogram.

LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.

STK (Sierra Toolkit) Update Trilinos User Group meetings, 2014 R&A: SAND PE Sandia National Laboratories is a multi-program laboratory operated.

QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA.

Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.

1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Dynamic Load Balancing in Scientific Simulation

On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.

High Performance Computing Seminar

Parallel Hypergraph Partitioning for Scientific Computing

Challenges in Electromagnetic Modeling Scalable Solvers

Ana Gainaru Aparna Sasidharan Babak Behzad Jon Calhoun

Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.

Adaptive Mesh Applications

Chapter 01: Introduction

Dynamic Load Balancing of Unstructured Meshes

Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Partitioning, Load Balancing, and Ordering for Petascale Applications Erik Boman, Cedric Chevalier, Karen Devine, Bruce Hendrickson, Sandia National Labs Umit Çatalyürek, Ohio State University Michael Wolf, UIUC and Sandia CSCAPES Workshop, Santa Fe, June 10-13, 2008

Performance History & Projections 1 Tflop/s 1 Gflop/s1 Eflop/s 1 Pflop/s O(10 6 )Threads O(10 3 ) Threads O(1) ThreadO(10 9 ) Threads Source: Dongarra, top500.org

Interesting Times Supercomputers becoming more complex –Hierarchical: Nodes (CMP), processors, cores –Heterogeneous: Accelerators, e.g., Cell, GPU –How to deal with millions of cores (threads) Multicore impacts desktops to HPC –Flops are cheap, data throughput/latency is key –Rethink algorithms, software, libraries Programming model/environment uncertainty –Is MPI enough? PGAS languages? Hybrid? –How can apps use these computers efficiently? Data distribution increasingly important

Partitioning and Load Balancing Assignment of application data to processors for parallel computation. Applied to grid points, elements, matrix rows, particles, ….

Static Partitioning Static partitioning in an application: –Data partition is computed. –Data are distributed according to partition map. –Application computes. Ideal partition: –Processor idle time is minimized. –Inter-processor communication costs are kept low. Initialize Application Partition Data Distribute Data Compute Solutions Output & End

Dynamic Repartitioning (a.k.a. Dynamic Load Balancing) Initialize Application Partition Data Redistribute Data Compute Solutions & Adapt Output & End Dynamic repartitioning (load balancing) in an application: –Data partition is computed. –Data are distributed according to partition map. –Application computes and, perhaps, adapts. –Process repeats until the application is done. Ideal partition: –Processor idle time is minimized. –Inter-processor communication costs are kept low. –Cost to redistribute data is also kept low.

What makes a partition “good,” especially at petascale? Balanced work loads. –Even small imbalances result in many wasted processors! 100,000 processors with one processor 5% over average workload is equivalent to 4760 idle processors and the rest perfectly balanced. Low interprocessor communication costs. –Processor speeds increasing faster than network speeds. –Partitions with minimal communication costs are critical. Scalable partitioning time and memory use. –Scalability is especially important for dynamic partitioning. Low data redistribution costs for dynamic partitioning. –(Umit will discuss dynamic repartitioning)

Zoltan Toolkit No single method best in all cases –Provide collection of algorithms: Zoltan Data-structure neutral interface –Application callbacks Fully parallel –Based on MPI –Successfully used on up to 20K cores

Partitioning Algorithms in the Zoltan Toolkit Recursive Coordinate Bisection (Berger, Bokhari) Recursive Inertial Bisection (Taylor, Nour-Omid ) Zoltan Graph Partitioning ParMETIS (U. Minnesota) PT-Scotch (Pellegrini & Chevalier) Hypergraph Partitioning Hypergraph Repartitioning PaToH (Catalyurek & Aykanat) Geometric (coordinate-based) methods Hypergraph and graph (connectivity-based) methods Space Filling Curve Partitioning (Warren&Salmon, et al.) Refinement-tree Partitioning (Mitchell)

Recursive Coordinate Bisection: Developed by Berger & Bokhari (1987) for Adaptive Mesh Refinement. Idea: –Divide work into two equal parts using a cutting plane orthogonal to a coordinate axis. –Recursively cut the resulting subdomains. 1st cut 2nd 3rd Geometric Partitioning

Applications of Geometric Methods Parallel Volume Rendering Crash Simulations and Contact Detection Adaptive Mesh Refinement Particle Simulations

Graph Partitioning Represent problem as a weighted graph. –Vertices = objects to be partitioned. –Edges = dependencies between two objects. –Weights = work load or amount of dependency. Partition graph so that … –Parts have equal vertex weight. –Weight of edges cut by part boundaries is small. Kernighan, Lin, Schweikert, Fiduccia, Mattheyes, Simon, Hendrickson, Leland, Kumar, Karypis, et al.

Applications using Graph Partitioning x b A = Linear solvers & preconditioners (square, structurally symmetric systems) Finite Element Analysis Multiphysics and multiphase simulations

Hypergraph Partitioning Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al. Hypergraph model: –Vertices = objects to be partitioned. –Hyperedges = dependencies between two or more objects. Partitioning goal: Assign equal vertex weight while minimizing hyperedge cut weight. A Graph Partitioning Model A Hypergraph Partitioning Model

Hypergraph Applications Circuit Simulations 1 2 Vs SOURCE_VOLTAGE 1 2 Rs R 1 2 Cm012 C 1 2 Rg02 R 1 2 Rg01 R 1 2 C01 C 1 2 C02 C 12 L2 INDUCTOR 12 L1 INDUCTOR 12 R1 R 12 R2 R 1 2 Rl R 1 2 Rg1 R 1 2 Rg2 R 1 2 C2 C 1 2 C1 C 1 2 Cm12 C Linear programming for sensor placement x b A = Linear solvers & preconditioners (no restrictions on matrix structure) Finite Element Analysis Multiphysics and multiphase simulations Data Mining

Hypergraph Partitioning: Advantages and Disadvantages Advantages: –Communication volume reduced 30-38% on average over graph partitioning (Catalyurek & Aykanat). 5-15% reduction for mesh-based applications. –More accurate communication model than graph partitioning. Better representation of highly connected and/or non-homogeneous systems. –Greater applicability than graph model. Can represent rectangular systems and non-symmetric dependencies. Disadvantages: –More expensive than graph partitioning.

Performance Results Experiments on Sandia’s Thunderbird cluster. –Dual 3.6 GHz Intel EM64T processors with 6 GB RAM. –Infiniband network. Compare RCB, graph (ParMETIS) and hypergraph (Zoltan) methods. Measure … –Amount of communication induced by the partition. –Partitioning time.

Test Data SLAC *LCLS Radio Frequency Gun 6.0M x 6.0M 23.4M nonzeros Xyce 680K ASIC Stripped Circuit Simulation 680K x 680K 2.3M nonzeros Cage15 DNA Electrophoresis 5.1M x 5.1M 99M nonzeros SLAC Linear Accelerator 2.9M x 2.9M 11.4M nonzeros

Communication Volume: Lower is Better Cage15 5.1M electrophoresis Xyce 680K circuitSLAC 6.0M LCLS SLAC 2.9M Linear Accelerator RCB Graph Hypergraph Number of parts = number of processors. Results thanks to Karen Devine

Partitioning Time: Lower is better RCB Graph Hypergraph Cage15 5.1M electrophoresis Xyce 680K circuitSLAC 6.0M LCLS SLAC 2.9M Linear Accelerator 1024 parts. Varying number of processors.

Entire System... Subsystem... Proc... Proc... Aiming for Petascale Hierarchical partitioning in Zoltan v3. –Partition for multicore/manycore architectures. Partition hierarchically with respect to chips and then cores. Similar to strategies for clusters of SMPs (Teresco, Faik). Treat core-level partitions as separate threads or MPI processes (application decides) –Support 100Ks processors (millions of cores) Reduce collective communication operations during partitioning. Allow more localized partitioning on subsets of processors. Entire System... Processor Core... Core...

Aiming for Petascale Reducing communication costs for applications. –Reducing communication volume. Two-dimensional sparse matrix partitioning (Catalyurek, Bisseling, Boman, Wolf). Partitioning non-zeros of matrix rather than rows/columns. –Reducing message latency. Minimize maximum number of neighboring parts (messages) Balancing both computation and communication (Pinar & Hendrickson); balance criterion is complex function of the partition instead of simple sum of object weights. –Exploit hierarchical structure, topology Map parts onto processors to take advantage of network topology.

Expanding scope of Zoltan Now supports three combinatorial problems –Partitioning –Ordering –Coloring Focus on large problems, parallel scalability

Sparse Matrix Ordering Fill-reducing ordering for Ax=b SPD case: Nested Dissection Provided in Zoltan via external libraries –PT-Scotch and ParMetis PT-Scotch better for large #cores (Chevalier) –CEA (France) solved 3D system of order 45 million in ~30 minutes on their TERA computer

New Unsymmetric Method: HUND HUND = Hypergraph Unsymmetric Nested Dissection –Grigori, Boman, Donfack, Davis, ‘08 Reduces fill in LU but allows pivoting Based on recursive bisection using hypergraphs Works directly on unsymmetric structure More robust than COLAMD Will implement in Zoltan, make available in SuperLU

HUND Results SuperLU UMFPACK Performance profiles of memory usage – higher is better! (27 unsymmetric test matrices from UF collection)

Zoltan Integration into SciDAC ITAPS –Dynamic services for meshes Trilinos –Matrix partitioning via Isorropia PETSc –Unstructured mesh partitioning in Sieve SuperLU –Matrix ordering (planned) COMPASS/SLAC –Accelerator modeling, PIC Wanted: More application collaborations

Questions to (Potential) Users How can we make our software tools easier to use? What are your current bottlenecks? What new features do you need?

For More Information… CSCAPES: Zoltan web page: –Download Zoltan v3 (open-source software). –Read User’s Guide, try examples Zoltan tutorial Thursday 8:30 am

Thanks S. Attaway (SNL) C. Aykanat (Bilkent U.) A. Bauer (RPI) R. Bisseling (Utrecht U.) D. Bozdag (Ohio St. U.) T. Davis (U. Florida) J. Faik (RPI) J. Flaherty (RPI) L. Grigori (INRIA) R. Heaphy (SNL) M. Heroux (SNL) D. Keyes (Columbia) K. Ko (SLAC) G. Kumfert (LLNL) L.-Q. Lee (SLAC) V. Leung (SNL) G. Lonsdale (NEC) X. Luo (RPI) L. Musson (SNL) S. Plimpton (SNL) L.A. Riesen (SNL) J. Shadid (SNL) M. Shephard (RPI) C. Silva (SNL) J. Teresco (Mount Holyoke) C. Vaughan (SNL) SciDAC, CSCAPES Institute NNSA ASC Program

Geometric Repartitioning Implicitly achieves low data redistribution costs. For small changes in data, cuts move only slightly, resulting in little data redistribution.

RCB Advantages and Disadvantages Advantages: –Conceptually simple; fast and inexpensive. –All processors can inexpensively know entire partition (e.g., for global search in contact detection). –No connectivity info needed (e.g., particle methods). –Good on specialized geometries. Disadvantages: –No explicit control of communication costs. –Mediocre partition quality. –Can generate disconnected subdomains for complex geometries. –Need coordinate information. SLAC’S 55-cell Linear Accelerator with couplers: One-dimensional RCB partition reduced runtime up to 68% on 512 processor IBM SP3. (Wolf, Ko)

Graph Partitioning: Advantages and Disadvantages Advantages: –Highly successful model for mesh-based PDE problems. –Explicit control of communication volume gives higher partition quality than geometric methods. –Excellent software available. Serial: Chaco (SNL) Jostle (U. Greenwich) METIS (U. Minn.) Scotch (U. Bordeaux) Parallel: Zoltan (SNL) ParMETIS (U. Minn.) PJostle (U. Greenwich) PT-Scotch (U. Bordeaux) Disadvantages: –More expensive than geometric methods. –Edge-cut model only approximates communication volume.

Graph Repartitioning Diffusive strategies (Cybenko, Hu, Blake, Walshaw, Schloegel, et al.) –Shift work from highly loaded processors to less loaded neighbors. –Local communication keeps data redistribution costs low. Multilevel partitioners that account for data redistribution costs in refining partitions (Schloegel, Karypis) –Parameter weights application communication vs. redistribution communication Partition coarse graph Refine partition accounting for current part assignment Coarsen graph

Hypergraph Repartitioning Augment hypergraph with data redistribution costs. –Account for data’s current processor assignments. –Weight dependencies by their size and frequency of use. Hypergraph partitioning then attempts to minimize total communication volume: Data redistribution volume + Application communication volume Total communication volume Lower total communication volume than geometric and graph repartitioning. Best Algorithms Paper Award at IPDPS07 “Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations” Catalyurek, Boman, Devine, Bozdag, Heaphy, & Riesen

Communication Volume: Lower is Better Cage15 5.1M electrophoresis Xyce 680K circuitSLAC 6.0M LCLS SLAC 2.9M Linear Accelerator RCB Graph Hypergraph 1024 parts. Varying number of processors.

Partitioning Time: Lower is better RCB Graph Hypergraph Cage15 5.1M electrophoresis Xyce 680K circuitSLAC 6.0M LCLS SLAC 2.9M Linear Accelerator Number of parts = number of processors.

Repartitioning Experiments Experiments with 64 parts on 64 processors. Dynamically adjust weights in data to simulate, say, adaptive mesh refinement. Repartition. Measure repartitioning time and total communication volume: Data redistribution volume + Application communication volume Total communication volume

Repartitioning Results: Lower is Better Xyce 680K circuitSLAC 6.0M LCLS Repartitioning Time (secs) Data Redistribution Volume Application Communication Volume

–Refactor code/algorithms for bigger data sets and processor arrays. Improving scalability of partitioning algorithms. –Hybrid partitioners (for mesh-based apps.): Use inexpensive geometric methods for initial partitioning; refine with hypergraph/graph-based algorithms at boundaries. Use geometric information for fast coarsening in multilevel hypergraph/graph-based partitioners. Aiming for Petascale Partition coarse data Refine partitionCoarsen data using geometric info