Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Slides:



Advertisements
Similar presentations
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
A Comparison of Layering and Stream Replication Video Multicast Schemes Taehyun Kim and Mostafa H. Ammar.
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Mario Čagalj supervised by prof. Jean-Pierre Hubaux (EPFL-DSC-ICA) and prof. Christian Enz (EPFL-DE-LEG, CSEM) Wireless Sensor Networks:
Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,
Load Balancing in Charm++ Eric Bohm. How to diagnose load imbalance?  Often hidden in statements such as: o Very high synchronization overhead  Most.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Load Balancing and Termination Detection Load balance : - statically before the execution of any processes - dynamic during the execution of the processes.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Network Aware Resource Allocation in Distributed Clouds.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Random Graph Generator University of CS 8910 – Final Research Project Presentation Professor: Dr. Zhu Presented: December 8, 2010 By: Hanh Tran.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Real-Time Support for Mobile Robotics K. Ramamritham (+ Li Huan, Prashant Shenoy, Rod Grupen)
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
KAIS T On the problem of placing Mobility Anchor Points in Wireless Mesh Networks Lei Wu & Bjorn Lanfeldt, Wireless Mesh Community Networks Workshop, 2006.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Internet-Based TSP Computation with Javelin++ Michael Neary & Peter Cappello Computer Science, UCSB.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Dynamic Load Balancing Tree and Structured Computations.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
-1/16- Maximum Battery Life Routing to Support Ubiquitous Mobile Computing in Wireless Ad Hoc Networks C.-K. Toh, Georgia Institute of Technology IEEE.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Performance Evaluation of Adaptive MPI
Parallel Programming in C with MPI and OpenMP
Mayank Bhatt, Jayasi Mehar
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
CS 584.
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Outline Dynamic Load Balancing framework in Charm++ Load Balancing on Large Machines Scalable Load Balancer Topology-aware Load Balancers

Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++ Given a collection of migratable objects and a set of computers connected in a certain topology Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum Dynamic mapping of chares to processors

Load-Balancing Approaches Two major approaches No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound,.. Seed load balancers With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy

Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration) Parallel analog of principle of locality Heuristics, that hold for most CSE applications

Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database) communication volume and computation time Measurement based load balancers Use the database periodically to make new decisions Many alternative strategies can use the database Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependencies into account (More complex) Topology-aware

Load Balancer Strategies Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier

Load Balancing on Very Large Machines – New Challenges Existing load balancing strategies don’t scale on extremely large machines Consider an application with 1M objects on 64K processors Limiting factors and issues Decision-making algorithm Difficult to achieve well-informed load balancing decisions Resource limitations

Limitations of Centralized Strategies Effective on small number of processors, easy to achieve good load balance Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow We demonstrate these limitations using the simulator we developed

Memory Overhead (simulation results with lb_test) Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh. Run on Lemieux 64 processors

Load Balancing Execution Time Execution time of load balancing algorithms on 64K processor simulation

Why Hierarchical LB? Centralized load balancer Bottleneck for communication on processor 0 Memory constraint Fully distributed load balancer Neighborhood balancing Without global load information Hierarchical distributed load balancer Divide into processor groups Apply different strategies at each level Scalable to a large number of processors

A Hybrid Load Balancing Strategy Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) Each group has a leader (the central node) which performs centralized load balancing A particular hybrid strategy that works well Gengbin Zheng, PhD Thesis, 2005

Hierarchical Tree (an example) 0 … … 1024 … … … K processor hierarchical tree Apply different strategies at each level Level 0 Level 1 Level

Our HybridLB Scheme 0 … … 1024 … … … Load Data (OCG) Refinement-based Load balancing Greedy-based Load balancing Load Data token object

Simulation Study - Memory Usage Simulation of lb_test benchmark with the performance simulator

Total Load Balancing Time

Load Balancing Quality

Topology-aware mapping of tasks Problem Map tasks to processors connected in a topology, such that: Compute load on processors is balanced Communicating chares (objects) are placed on nearby processors.

Mapping Model Task Graph : G t = (V t, E t ) Weighted graph, undirected edges Nodes  chares, w(v a )  computation Edges  communication, c ab  bytes between v a and v b Topology-graph : G p = (V p, E p ) Nodes  processors Edges  Direct Network Links Ex: 3D-Torus, 2D-Mesh, Hypercube

Model (Cont.) Task Mapping Assigns tasks to processors P : V t  V p Hop-Bytes Hop-Bytes  Communication cost The cost imposed on the network is more if more links are used Weigh inter-processor communication by distance on the network

Metric Minimize Hop-Bytes, equivalently Hops-per-Byte Hops-per-Byte Average hops traveled by a byte under a task-mapping.

TopoLB: Topology-aware LB Overview First coalesce task graph to n nodes. (n = number of processors) Use MetisLB because it reduces inter-group communication Can use GreedyLB, GreedyCommLB, etc. Repeat n times: Pick a task t and processor p Place t on p (P(t)  p) Tarun Agarwal, MS Thesis, 2005

Picking t,p t is the task for which placement in this iteration is critical p is the processor where t costs least Note that: Cost of placing t on p is approximated as:

Picking t,p (Cont.) Criticality of placing t in this iteration: By how much will the cost of placing t increase in the future? Future Cost : t will be placed on some random processor in future iteration Criticality of t :

Putting it together

TopoCentLB: A faster Topology- aware LB Coalesce task graph to n nodes. (n=number of processors) Picking task t, processor p t is the task which has maximum total communication with already assigned tasks p is the processor where t costs least

TopoCentLB (Cont.) Difference from TopoLB No notion of criticality as in TopoLB Considers only past mapping Doesn’t look into the future Running Complexity TopoLB: Depends on the criticality function O(p|E t |) O(p 3 ) TopoCentLB O(p|E t |) (with a smaller constant than TopoLB)

Results Compare TopoLB, TopoCentLB, Random Placement Charm++ LB Simulation mode 2D-Jacobi like benchmark LeanMD Reduction in Hop-Bytes BlueGene/L 2D-Jacobi like benchmark Reduction in running time

Simulation Results 2D-Mesh pattern on a 3D-Torus topology (same size)

Simulation Results LeanMD on 3D-Torus

Experimental Results: Bluegene 2D-Mesh pattern on a 3D-Torus (Msg Size:100KB)

Experimental Results: Bluegene 2D-Mesh pattern on a 3D-Mesh (Msg Size:100KB)

Conclusions Need for Scalable LBs in future due to large machines like BG/L Hybrid Load Balancers Distributed approach for keeping communication localized in a neighborhood Efficient topology-aware task mapping strategies which reduce hop-bytes also lead to Lower network latencies Better tolerance to contention and bandwidth constraints