Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

Parallel Simulation etc Roger Curry Presentation on Load Balancing.

On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Strategies for Implementing Dynamic Load Sharing.

Load Balancing and Topology Aware Mapping for Petascale Machines Abhinav S Bhatele Parallel Programming Laboratory, Illinois.

Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Dynamic Load Sharing and Balancing Sig Freund. Outline Introduction Distributed vs. Traditional scheduling Process Interaction models Distributed Systems.

Load Balancing in Charm++ Eric Bohm. How to diagnose load imbalance?  Often hidden in statements such as: o Very high synchronization overhead  Most.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Load Balancing and Termination Detection Load balance : - statically before the execution of any processes - dynamic during the execution of the processes.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Real-Time Support for Mobile Robotics K. Ramamritham (+ Li Huan, Prashant Shenoy, Rod Grupen)

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Static Process Scheduling

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.

Dynamic Load Balancing Tree and Structured Computations.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Pradeep Konduri Static Process Scheduling:  Proceedance process model  Communication system model  Application  Dicussion.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Auburn University

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

Parallel Programming in C with MPI and OpenMP

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Integrated Runtime of Charm++ and OpenMP

Parallel Algorithm Models

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: –Hybrid Load Balancers –Topology-aware Load Balancers User Control and Flexibility Future Work

Dynamic Load-Balancing Task of load balancing (LB) –Given a collection of migratable objects and a set of processors –Find a mapping of objects to processors Almost same amount of computation on each processor –Additional constraints Ensure communication between processors is minimum Take topology of the machine into consideration Dynamic mapping of chares to processors –Load on processors keeps changing during the actual execution

Load-Balancing Approaches A rich set of strategies in Charm++ Two main ideas –No correlation between successive iterations Fully dynamic Seed load balancers –Load varies slightly over iterations CSE, Molecular Dynamics simulations Measurement-based load balancers

Principle of Persistence Object communication patterns and computational loads tend to persist over time –In spite of dynamic behavior Abrupt and large, but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration) Parallel analog of principle of locality –Heuristics, that hold for most CSE applications

Measurement Based Load Balancing Based on principle of persistence Runtime instrumentation (LB Database) –communication volume and computation time Measurement based load balancers –Use the database periodically to make new decisions –Many alternative strategies can use the database Centralized vs. distributed Greedy improvements vs. complete reassignment Topology-aware

Load Balancer Strategies Centralized – Object load data are sent to processor 0 – Integrate to a complete object graph – Migration decision is broadcasted from processor 0 – Global barrier Distributed – Load balancing among neighboring processors – Build partial object graph – Migration decision is sent to its neighbors – No global barrier

Load Balancing on Large Machines Existing load balancing strategies don’t scale on extremely large machines Limitations of centralized strategies: –Central node: memory/communication bottleneck –Decision-making algorithms tend to be very slow Limitations of distributed strategies: –Difficult to achieve well-informed load balancing decisions

Simulation Study - Memory Overhead lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh. Simulation performed with the performance simulator BigSim

Load Balancing Execution Time Execution time of load balancing algorithms on a 64K processor simulation

Hierarchical Load Balancers Hierarchical distributed load balancers –Divide into processor groups –Apply different strategies at each level –Scalable to a large number of processors

Hierarchical Tree (an example) 0 … … 1024 … … … K processor hierarchical tree Apply different strategies at each level Level 0 Level 1 Level

An Example: Hybrid LB Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) Each group has a leader (the central node) which performs centralized load balancing A particular hybrid strategy that works well Gengbin Zheng, PhD Thesis, 2005

Our HybridLB Scheme 0 … … 1024 … … … Load Data (OCG) Refinement-based Load balancing Greedy-based Load balancing Load Data token object

Memory Overhead Simulation of lb_test (for 64k processors)

Total Load Balancing Time N procs Memory6.8MB22.57MB22.63MB lb_test benchmark’s actual run on BG/L at IBM (512K objects)

Load Balancing Quality

Topology-aware LB Problem –Map tasks to processors connected in a topology, such that: Compute load on processors is balanced Communicating chares (objects) are placed on nearby processors.

Mapping Model Task Graph : –G t = (V t, E t ) –Weighted graph, undirected edges –Nodes  chares, w(v a )  computation –Edges  communication, c ab  bytes between v a and v b Topology-graph : –G p = (V p, E p ) –Nodes  processors –Edges  Direct Network Links –Ex: 3D-Torus, 2D-Mesh, Hypercube

Model (Contd.) Task Mapping – Assigns tasks to processors – P : V t  V p Hop-Bytes – Hop-Bytes  Communication cost – Weigh inter-processor communication by distance on the network

Load Balancing Framework in Charm++ Issues of mapping and decomposition separated User has full control over mapping Many choices –Initial static mapping –Mapping at run-time as newer objects created –Write a new load balancing strategy: inherit from CentralLB or NeighborLB

Future Work Hybrid Model-based Load Balancers –User gives a model to the LB –Combine it with measurement based load balancer Multicast aware Load Balancers –Try and place targets of multicast on the same processor

Conclusions Measurement based LBs are good for most cases Need scalable LBs in the future due to large machines like BG/L –Hybrid LBs –Topology aware LBs –Model based LBs –Communication sensitive LBs