Sandia National Laboratories Graph Partitioning Workshop Oct. 15, 1999 1 Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.

Slides:



Advertisements
Similar presentations
Dynamic Load Balancing in Scientific Simulation Angen Zheng.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
Parallel Decomposition-based Contact Response Fehmi Cirak California Institute of Technology.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.
Department of Biomedical Informatics Dynamic Load Balancing (Repartitioning) & Matrix Partitioning Ümit V. Çatalyürek Associate Professor Department of.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Multilevel Hypergraph Partitioning G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar Computer Science Department, U of MN Applications in VLSI Domain.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.
Pamgen A Parallel Finite-Element Mesh Generation Library TUG 2008 Monday, October 21, 2008 David Hensinger (SNL) Sandia is a multiprogram laboratory operated.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
SOME EXPERIMENTS on GRID COMPUTING in COMPUTATIONAL FLUID DYNAMICS Thierry Coupez(**), Alain Dervieux(*), Hugues Digonnet(**), Hervé Guillard(*), Jacques.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Parallel Computing Sciences Department MOV’01 Multilevel Combinatorial Methods in Scientific Computing Bruce Hendrickson Sandia National Laboratories Parallel.
Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Data Structures and Algorithms in Parallel Computing Lecture 2.
PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.
Dynamic Load Balancing Tree and Structured Computations.
Dynamic Load Balancing in Scientific Simulation
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Parallel Algorithm Oriented Mesh Database
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Parallel Hypergraph Partitioning for Scientific Computing
Ana Gainaru Aparna Sasidharan Babak Behzad Jon Calhoun
Parallel Programming By J. H. Wang May 2, 2017.
Component Frameworks:
Course Outline Introduction in algorithms and applications
CS 584.
Adaptive Mesh Applications
Dynamic Load Balancing of Unstructured Meshes
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories Parallel Computing Sciences Dept.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Introduction l Two happy occurrences. »(1) Good graph partitioning tools & software. »(2) Good parallel efficiencies for many applications. l Is the latter due to the former? l Yes, but also no. l We have been lucky! »Wrong objectives. »Models insufficiently general. »Software tools often poorly designed.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Myth 1: The Edge Cut Deceit l Generally believed that “Edge Cuts = Communication Cost”. l This assumption is behind the use of graph partitioning. l In reality: »Edge cuts are not equal to communication volume. »Communication volume is not equal to communication cost.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Edge Cuts Versus Volume Edge cuts = 10. Communication volume: 8 (from left partition to right partition). 7 (from right partition to left partition).

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Communication Volume l Assume graph edges reflect data dependencies. l Correct accounting of communication volume is: »Number of vertices on boundary of partition. l Elegant alternative is hypergraph model of Çatalyürek, Aykanat, Pinar and Pinar. »Volume is number of cut hyperedges.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Communication Cost l Cost of single message involves volume and latency. l Cost of multiple messages involves congestion. l Cost within application depends only on slowest processor. l Our models don’t optimize the right metrics!

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Why Does Graph Partitioning Work? l Vast majority of applications are computational meshes. »Geometric properties ensure that good partitions exist. –Communication/Computation = n 1/2 in 2D, n 2/3 in 3D. –Runtime is dominated by computation. »Vertices have bounded numbers of neighbors. –Error in edge cut metric is bounded. »Homogeneity ensures all processors have similar subdomains. –No processor has dramatically more communication. l Other applications aren’t so forgiving. –E.g. Interior point methods, latent semantic indexing, etc. l We have been lucky!

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Myth 2: Simple Graphs are Sufficient l Graphs are widely used to encode data dependencies. »Vertex weights reflect computational cost. »Edge weights encode volume of data transfer. l Graph partitioning determines data decomposition. l However, many problems are not easily expressed this way! »Complex relationships or constraints on partitioning. –E.g. computation on nodes and on elements. –DRAMA has a uniquely rich model for this problem. »Dependencies are directed (e.g. unsymmetric matvec). »Computation consists of multiple phases.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Alternative Graph Models l Hypergraph model (Aykanat, et al.) »Vertices represent computations. »Hyperedge connects all objects which produce/use datum. »Handles directed dependencies. l Bipartite graph model (H. & Kolda) »Directed graph replaced by equivalent bipartite graph. »Handles directed dependencies. »Can model two-phase calculations. l Multi-Objective, Multi-Constraint (Schloegel, Karypis & Kumar) »Each vertex/edge can have multiple weights. »Can model multiple-phase calculations.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Myth 3: Partition Quality is Paramount l Partitioners compete on edge cuts » and (sometimes) runtime. l This isn’t the full story » particularly for dynamic load balancing!

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Habits of Highly Effective Dynamic Load Balancers l Balance the load »Balance each phase or some combination? »How is load measured? l Minimize communication cost »Volume? Number of messages? Something else? l Run fast in parallel l Be incremental »Make new partition similar to current one l Not use too much memory l Support determination of new communication pattern l Be easy to use

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Performance Tradeoffs l From Touheed, Selwood, Jimack and Berzins. »~1 million elements with adaptive refinement. »32 processors of SGI Origin. l Timing data for different partitioning algorithms. »Repartitioning time per invocation: seconds. »Migration time per invocation: seconds. »Explicit solve time per timestep: seconds. l Observations: »Migration time more important than partitioner runtime. »Importance of quality depends on: – Frequency of rebalancing. –Cost of solver.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Myth 4: Existing Tools Solve the Problem l Lots of good software exists. l Static partitioning is fairly mature. »However, it addresses an imperfect model. l Dynamic partitioning is more complicated. »Applications differ in need for cost, quality or incrementality. »No algorithm is uniformly best. »Good library should support several (e.g. DRAMA). »Subroutine interface requires good software engineering. –E.g. Zoltan?.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Myth 5: The Key is Finding the Right Partition l Assignment of work to processors is the key to scalability. l However, this needn’t be a single partition. l Example: parallel crash simulations. »In each timestep: –(1) Do finite element analysis & predict new deformations. –(2) Search for grid intersections (contact detection). –(3) If found, correct deformations & forces. »Each stage has different objects & different data dependencies. »Very difficult to balance them all with one decomposition. –But most work on this problem has taken this approach.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Multiple Decompositions l Finite element analysis: »Topological »Static l Contact detection: »Geometric »Dynamic l Key Idea: »Use graph partitioning for finite element phase »Use geometric partitioner for contact detection –We use recursive coordinate bisection (RCB)

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Parallel Crash Simulations l Outline of parallel timestep in Pronto code: »1. Finite element analysis in static, graph-based decomposition. »2. Map contact objects to RCB decomposition. »3. Update RCB decomposition. »4. Communicate objects crossing processor boundaries. »5. Each processor searches for contacts. »6. Communicate contacts back to finite element decomposition. »7. Correct deformations, etc. l Observations: »Reuse serial contact software. »RCB is incremental & facilitates determination of communication. »Each phase should scale well. »Cost of mapping between decompositions?

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Can Crush Example Finite element decomposition RCB Decomposition as can gets crushed RCB on undeformed geometry

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Scalability l Can crush problem, 3800 elements per compute node, run to 100 ms. A 13.8-million element simulation on 3600 compute nodes ran at 0.32 seconds per time step (120.4 Gflops/s). contacts FEM overhead

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Micro-mechanics of Foam

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Myth 6: All the Problems are Solved l Biggest & most damaging myth of all! l Already discussed need for: »More accurate & expressive models. »Algorithms for partitioning new models. »More user-friendly tools. l Lots of other open problems.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Open Problems l Partitioning for multiple goals. »Examples: –Multiple phases in a calculation. –Minimize communication volume and number of messages. » Multi-objective/constraint work is partial answer. –Only models costs on graph vertices or edges. –Can’t minimize number of messages. »New ideas are needed. l Partition to balance work of sparse solve on each subdomain. »Applications to FETI preconditioner, parallel sparse solvers, etc. »Complicated dependence on topology of submesh. »Can’t predict cost by standard edge/vertex weights.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, More Open Problems l Partitioning for heterogeneous parallel architectures. »E.g. Clusters of SMPs, Beowulf machines, etc. »How to model heterogeneous parallel machines? »How to adapt partitioners to address non-standard architectures? »(See Teresco, Beall, Flaherty & Shephard.) l Partitioning to minimize congestion. »Communication is limited by most heavily used wire. »How can we predict and avoid contention for wires? »(See Pellegrini or H., Leland & Van Driessche.)

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, More Information l Contact info: » l Collaborators: »Rob LelandKaren DevineTammy Kolda »Steve PlimptonBill Mitchell l Work supported by DOE’s MICS program l Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed-Martin Company, for the US DOE under contract DE-AC-94AL85000