Download presentation
Presentation is loading. Please wait.
Published byMiles Banks Modified over 9 years ago
1
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012
2
Dynamic load balancing on 100,000 processor cores and beyond
3
HPDC’12, Delft Iterative Applications Applications repeatedly executing the same computation Static or slowly evolving execution characteristics Execution characteristics preclude static balancing Application characteristics (comm. pattern, sparsity,…) Execution environment (topology, asymmetry, …) Challenge: Load-balancing such applications
4
HPDC’12, Delft Overdecomposition Expose greater levels concurrency than supported by hardware Middleware (runtime) dynamically maps the concurrent tasks to hardware resources Abstraction supports continuous optimization and adaptation Improvements to load balancing New metrics (power, energy, graceful degradation, …) New features: fault tolerance, power/energy-awareness
5
HPDC’12, Delft Problem Statement Scalable load balancers for iterative overdecomposed applications We consider two alternatives: Persistence-based load balancing Work stealing How do these algorithms behave at scale? How do they compare?
6
HPDC’12, Delft Related Work Overdecomposition is a widely used approach Inspector-executor approaches employ start-time load balancers Hierarchical load balancers in the past typically do not consider localization Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores No comparative evaluation of the two schemes
7
HPDC’12, Delft TASCEL: Task Scheduling Library Runtime library for task-parallel programs Manages task collections for execution on distributed memory machines Compatible with native MPI programs Phase-based switch between SPMD and non-SPMD modes of execution
8
HPDC’12, Delft TASCEL Execution Task: basic unit of migrateable execution Typical workflow: Create a task collection Seed it with one or more tasks Process tasks in the collection till termination detection Processing of task collections Manages concurrency, faults, … Trade-offs exposed through implementation specializations Dynamic load balancing schemes Fault tolerance protocols …
9
HPDC’12, Delft Load Balancers Greedy localized hierarchical persistence-based load balancing Retentive work stealing
10
HPDC’12, Delft 0 0 1 1 2 2 3 3 4 4 5 5 3 3 4 4 5 5 1 1 2 2 0 0 Greedy Localized Hierarchical Persistence- based LB Intuition: Satisfy local imbalance first
11
HPDC’12, Delft 0 0 1 1 2 2 3 3 4 4 5 5 3 3 4 4 5 5 1 1 2 2 0 0 Greedy Localized Hierarchical Persistence- based LB Intuition: Satisfy local imbalance first
12
HPDC’12, Delft Proc 1 Proc 2 Proc 3 … Proc n Local QueuesWork Pool Retentive Work Stealing
13
HPDC’12, Delft head split stail LocalRemote Retentive Work Stealing
14
HPDC’12, Delft head split addTask() : add task to local region getTask() : remove task from local region stail Buffer of locally executed tasks Retentive Work Stealing
15
HPDC’12, Delft head split releaseToShared() : move to shared portion acquireFromShared() : move to local portion stail Retentive Work Stealing
16
HPDC’12, Delft head split 1. Mark tasks stolen at stail and begin transfer itail ctail stail : beginning of tasks available to be stolen itail : number of tasks that have finished transfer ctail : past this marker it is safe to use buffer stail 2. Atomically increment itail on completion of transfer 3. Worker updates ctail when stail == itail ==itail ==ctail Retentive Work Stealing
17
HPDC’12, Delft Proc 1 Proc 2 Proc 3 … Proc n Seeded Local Queues Proc 1 Proc 2 Proc 3 Proc n Actual Executed Tasks Intuition: Stealing indicates poor initial balance Retentive Work Stealing
18
HPDC’12, Delft Retentive Work Stealing Active message based work stealing optimized for distributed memory Exploit persistence across work stealing iterations Each work stealing phase Track tasks executed by this worker in this iteration Seed with tasks executed by this worker for the next iteration
19
HPDC’12, Delft Experimental Setup Multi-threaded MPI; one core per node for active messages “Flat” execution – each core is an independent worker No. nodes Cores per node Memory per node Max cores in queue Hopper (Cray XE6)63842432GB146400 Intrepid (BG/P)4096044GB163840 Titan (Cray XK6)186881632GB298592
20
HPDC’12, Delft Hartree-Fock Benchmark Basis for several electronic structure theories Two-electron contribution Schwarz-screening: data dependent sparsity screening at runtime Tasks vary in size from milliseconds to seconds HF-Be512 (20)HF-Be512 (40) Total tasks2.2x10 10 1.4x10 9 Non-null tasks9.1x10 6 8.6x10 5
21
HPDC’12, Delft Hopper: Performance Persistence-based load balancing “converges” faster Retentive stealing also improves efficiency Stealing effective even with limited parallelism Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core
22
HPDC’12, Delft Intrepid: Performance Much worse performance for the first iteration Converges to a better efficiency than on Hopper Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core
23
HPDC’12, Delft Titan: Performance Similar behavior as on Intrepid Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core
24
HPDC’12, Delft Intrepid: Num. Steals Retentive stealing stabilizes stealing costs Similar trends on all systems Core count Core count Num. steals Attempted steals Successful steals
25
HPDC’12, Delft Utilization HF-Be256 on 9600 cores on Hopper Initial stealing has high costs during ramp-down Retentive stealing does a better job reducing this cost Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs) Utilization (%) Time Time Time
26
HPDC’12, Delft Summary of Insights Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan Retentive stealing and persistence-based load balancing perform comparably Retentive stealing incrementally improves balance Number of steals does not grow substantially with scale Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.