Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC
Outline Dynamic Load Balancing framework in Charm++ Load Balancing on Large Machines Scalable Load Balancer Topology-aware Load Balancers
Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++ Given a collection of migratable objects and a set of computers connected in a certain topology Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum Dynamic mapping of chares to processors
Load-Balancing Approaches Two major approaches No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound,.. Seed load balancers With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy
Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration) Parallel analog of principle of locality Heuristics, that hold for most CSE applications
Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database) communication volume and computation time Measurement based load balancers Use the database periodically to make new decisions Many alternative strategies can use the database Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependencies into account (More complex) Topology-aware
Load Balancer Strategies Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier
Load Balancing on Very Large Machines – New Challenges Existing load balancing strategies don’t scale on extremely large machines Consider an application with 1M objects on 64K processors Limiting factors and issues Decision-making algorithm Difficult to achieve well-informed load balancing decisions Resource limitations
Limitations of Centralized Strategies Effective on small number of processors, easy to achieve good load balance Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow We demonstrate these limitations using the simulator we developed
Memory Overhead (simulation results with lb_test) Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh. Run on Lemieux 64 processors
Load Balancing Execution Time Execution time of load balancing algorithms on 64K processor simulation
Why Hierarchical LB? Centralized load balancer Bottleneck for communication on processor 0 Memory constraint Fully distributed load balancer Neighborhood balancing Without global load information Hierarchical distributed load balancer Divide into processor groups Apply different strategies at each level Scalable to a large number of processors
A Hybrid Load Balancing Strategy Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) Each group has a leader (the central node) which performs centralized load balancing A particular hybrid strategy that works well Gengbin Zheng, PhD Thesis, 2005
Hierarchical Tree (an example) 0 … … 1024 … … … K processor hierarchical tree Apply different strategies at each level Level 0 Level 1 Level
Our HybridLB Scheme 0 … … 1024 … … … Load Data (OCG) Refinement-based Load balancing Greedy-based Load balancing Load Data token object
Simulation Study - Memory Usage Simulation of lb_test benchmark with the performance simulator
Total Load Balancing Time
Load Balancing Quality
Topology-aware mapping of tasks Problem Map tasks to processors connected in a topology, such that: Compute load on processors is balanced Communicating chares (objects) are placed on nearby processors.
Mapping Model Task Graph : G t = (V t, E t ) Weighted graph, undirected edges Nodes chares, w(v a ) computation Edges communication, c ab bytes between v a and v b Topology-graph : G p = (V p, E p ) Nodes processors Edges Direct Network Links Ex: 3D-Torus, 2D-Mesh, Hypercube
Model (Cont.) Task Mapping Assigns tasks to processors P : V t V p Hop-Bytes Hop-Bytes Communication cost The cost imposed on the network is more if more links are used Weigh inter-processor communication by distance on the network
Metric Minimize Hop-Bytes, equivalently Hops-per-Byte Hops-per-Byte Average hops traveled by a byte under a task-mapping.
TopoLB: Topology-aware LB Overview First coalesce task graph to n nodes. (n = number of processors) Use MetisLB because it reduces inter-group communication Can use GreedyLB, GreedyCommLB, etc. Repeat n times: Pick a task t and processor p Place t on p (P(t) p) Tarun Agarwal, MS Thesis, 2005
Picking t,p t is the task for which placement in this iteration is critical p is the processor where t costs least Note that: Cost of placing t on p is approximated as:
Picking t,p (Cont.) Criticality of placing t in this iteration: By how much will the cost of placing t increase in the future? Future Cost : t will be placed on some random processor in future iteration Criticality of t :
Putting it together
TopoCentLB: A faster Topology- aware LB Coalesce task graph to n nodes. (n=number of processors) Picking task t, processor p t is the task which has maximum total communication with already assigned tasks p is the processor where t costs least
TopoCentLB (Cont.) Difference from TopoLB No notion of criticality as in TopoLB Considers only past mapping Doesn’t look into the future Running Complexity TopoLB: Depends on the criticality function O(p|E t |) O(p 3 ) TopoCentLB O(p|E t |) (with a smaller constant than TopoLB)
Results Compare TopoLB, TopoCentLB, Random Placement Charm++ LB Simulation mode 2D-Jacobi like benchmark LeanMD Reduction in Hop-Bytes BlueGene/L 2D-Jacobi like benchmark Reduction in running time
Simulation Results 2D-Mesh pattern on a 3D-Torus topology (same size)
Simulation Results LeanMD on 3D-Torus
Experimental Results: Bluegene 2D-Mesh pattern on a 3D-Torus (Msg Size:100KB)
Experimental Results: Bluegene 2D-Mesh pattern on a 3D-Mesh (Msg Size:100KB)
Conclusions Need for Scalable LBs in future due to large machines like BG/L Hybrid Load Balancers Distributed approach for keeping communication localized in a neighborhood Efficient topology-aware task mapping strategies which reduce hop-bytes also lead to Lower network latencies Better tolerance to contention and bandwidth constraints