Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Slides:



Advertisements
Similar presentations
Dynamic Load Balancing in Scientific Simulation Angen Zheng.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Load Balancing and Topology Aware Mapping for Petascale Machines Abhinav S Bhatele Parallel Programming Laboratory, Illinois.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Bridge the gap between HPC and HTC Applications structured as DAGs Data dependencies will be files that are written to and read from a file system Loosely.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Data Structures and Algorithms in Parallel Computing Lecture 7.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Parallel Application Paradigms CS433 Spring 2001 Laxmikant Kale.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
Dynamic Load Balancing in Scientific Simulation
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Programming By J. H. Wang May 2, 2017.
Pattern Parallel Programming
Grid Computing.
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Performance Evaluation of Adaptive MPI
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Communication and Memory Efficient Parallel Decision Tree Construction
AGENT OS.
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Integrated Runtime of Charm++ and OpenMP
Replication-based Fault-tolerance for Large-scale Graph Processing
Introduction to locality sensitive approach to distributed systems
CS 584.
Distributed Computing:
Case Studies with Projections
BigSim: Simulating PetaFLOPS Supercomputers
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Database System Architectures
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Implementation of Adaptive Spacetime Simulations A
Presentation transcript:

Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana Champaign

Motivations Load balancing is key to scalability on very large supercomputers Load balancing becomes challenging Increasing machine and problem size leads to more complex and costly load balancing algorithms Considerable large amount of resource needed Scale load balancing itself 4/16/2019 P2S2-2010

Periodic Load Balancing Perform load balancing periodically E.g. stop and go scheme Persistent tasks Pay load balancing cost only when it is needed Task and data migrate as needed 4/16/2019 P2S2-2010

Charm++ Parallel C++ An MPI implementation on Charm++ Objects with methods that can be called remotely Migratable objects Dynamic load balancing Fault tolerance An MPI implementation on Charm++ 4/16/2019 P2S2-2010

Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration) Parallel analog of principle of locality Heuristics, that holds for most CSE applications 4/16/2019 P2S2-2010

Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database) communication volume and computation time Measurement based load balancers Use the database periodically to make new decisions Many alternative strategies can use the database Centralized vs distributed Greedy vs refinement Taking communication into account Taking dependencies into account (More complex) Topology-aware 4/16/2019 P2S2-2010

Load Balancing Strategies Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier 4/16/2019 P2S2-2010

Limitations of Centralized Strategies Now consider an application with 1M objects on 64K processors Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow 4/16/2019 P2S2-2010

Load Balancing Execution Time Execution time of load balancing algorithms on 1M tasks 4/16/2019 P2S2-2010

Limitations of Distributed Strategies Each processor periodically exchange load information and migrate objects among neighboring processors Performance improved slowly Lack of global information Difficult to converge quickly to as good a solution as a centralized strategy Result with NAMD on 256 processors 4/16/2019 P2S2-2010

A Hybrid Load Balancing Strategy Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) Aggressive load balancing in sub-groups, combined with Refinement-based cross-group load balancing Each group has a leader (the central node) which performs centralized load balancing Reuse existing centralized load balancing 4/16/2019 P2S2-2010

Hierarchical Tree (an example) 64K processor hierarchical tree … 1023 65535 64512 1024 2047 64511 63488 …... 1 Level 2 Level 1 Example More aggressive one at low level Take advantage of faster communication Less aggressive one at higher level Refine-based algorithm 64 Level 0 1024 Apply different strategies at each level 4/16/2019 P2S2-2010

Issues Load data reduction Reducing data movement Semi-centralized load balancing scheme Reducing data movement Token-based local balancing Topology-aware tree construction 4/16/2019 P2S2-2010

Token-based HybridLB Scheme Refinement-based Load balancing 1 Load Data 1024 63488 64512 Load Data (OCG) … … …... … … 1023 1024 2047 63488 64511 64512 65535 Greedy-based Load balancing token object 4/16/2019 P2S2-2010

Performance Study with Synthetic Benchmark 1/64 lb_test benchmark on Ranger Cluster (1M objects) 4/16/2019 P2S2-2010

Load Balancing Time (lb_test) lb_test benchmark on Ranger Cluster 4/16/2019 P2S2-2010

Performance (lb_test) lb_test benchmark on Ranger Cluster 4/16/2019 P2S2-2010

Performance Study with Synthetic Benchmark 1/64 lb_test benchmark on Blue Gene/P (1M objects) 4/16/2019 P2S2-2010

NAMD Hierarchical LB NAMD implements its own specialized load balancing strategies Based on Charm++ load balancing framework Extended NAMD comprehensive and refinement-based solution Work on subset of processors 4/16/2019 P2S2-2010

NAMD LB Time 4/16/2019 P2S2-2010

NAMD LB Time (Comprehensive) 4/16/2019 P2S2-2010

NAMD LB Time (Refinement) 4/16/2019 P2S2-2010

NAMD Performance 4/16/2019 P2S2-2010

Conclusions Load balancing is challenging and potentially costly on very large machines Hierarchical load balancing is effective Using 64K cores with synthetic benchmark And 16K with real application 4/16/2019 P2S2-2010

Thank you! Any questions?