Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Dynamic Load Sharing and Balancing Sig Freund. Outline Introduction Distributed vs. Traditional scheduling Process Interaction models Distributed Systems.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
Adaptive MPI Milind A. Bhandarkar
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Data Structures and Algorithms in Parallel Computing Lecture 7.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
Dynamic Load Balancing Tree and Structured Computations.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
OpenMosix, Open SSI, and LinuxPMI
Parallel Programming By J. H. Wang May 2, 2017.
Performance Evaluation of Adaptive MPI
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Communication and Memory Efficient Parallel Decision Tree Construction
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
CS 584.
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Presentation transcript:

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010

Outline Dynamic Load Balancing framework in Charm++ Motivations Hierarchical load balancing strategy 2Charm++ Workshop 2010

Charm++ Dynamic Load- Balancing Framework One of the most popular reasons to use Charm++/AMPI Fully automatic Adaptive Application independent Modular, and extendable 3Charm++ Workshop 2010

Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration) Parallel analog of principle of locality Heuristics, that holds for most CSE applications 4Charm++ Workshop 2010

Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database) communication volume and computation time Measurement based load balancers Use the database periodically to make new decisions Many alternative strategies can use the database Centralized vs distributed Greedy vs refinement Taking communication into account Taking dependencies into account (More complex) Topology-aware 5Charm++ Workshop 2010

Load Balancer Strategies Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier 6Charm++ Workshop 2010

Limitations of Centralized Strategies Now consider an application with 1M objects on 64K processors Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow We demonstrate these limitations using the simulator we developed 7Charm++ Workshop 2010

Memory Overhead (simulation results with lb_test) Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh. Run on Lemieux 64 processors 8Charm++ Workshop 2010

Load Balancing Execution Time Execution time of load balancing algorithms on 64K processor simulation 9Charm++ Workshop 2010

10 Limitations of Distributed Strategies Each processor periodically exchange load information and migrate objects among neighboring processors Performance improved slowly Lack of global information Difficult to converge quickly to as good a solution as a centralized strategy Result with NAMD on 256 processors

A Hybrid Load Balancing Strategy Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) Aggressive load balancing in sub-groups, combined with Refinement-based cross-group load balancing Each group has a leader (the central node) which performs centralized load balancing Reuse existing centralized load balancing 11Charm++ Workshop 2010

Hierarchical Tree (an example) 0 … … 1024 … … … K processor hierarchical tree Apply different strategies at each level Level 0 Level 1 Level Charm++ Workshop 2010

Issues Load data reduction Semi-centralized load balancing scheme Reducing data movement Token-based local balancing Topology-aware tree construction Charm++ Workshop

Token-based HybridLB Scheme 0 … … 1024 … … … Load Data (OCG) Refinement-based Load balancing Greedy-based Load balancing Load Data token object 14Charm++ Workshop 2010

Performance Study with Synthetic Benchmark lb_test benchmark on Ranger Cluster (1M objects) 15Charm++ Workshop 2010

Load Balancing Time (lb_test) lb_test benchmark on Ranger Cluster 16Charm++ Workshop 2010

Performance (lb_test) lb_test benchmark on Ranger Cluster 17Charm++ Workshop 2010

NAMD Hierarchical LB NAMD implements its own specialized load balancing strategies Based on Charm++ load balancing framework Extended NAMD comprehensive and refinement-based solution Work on subset of processors Charm++ Workshop

NAMD LB Time Charm++ Workshop

NAMD LB Time (Comprehensive) Charm++ Workshop

NAMD LB Time (Refinement) Charm++ Workshop

NAMD Performance Charm++ Workshop

Conclusions Scalable LBs are needed due to large machines like BG/P Avoid memory and communication bottleneck Achieve similar result to the more expensive centralized load balancer Take processor topology into account 23Charm++ Workshop 2010