Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, Sriram Krishnamoorthy, Laxmikant V. Kale.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

MATH 224 – Discrete Mathematics
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Chapter 15 Basic Asynchronous Network Algorithms
Greedy Algorithms Amihood Amir Bar-Ilan University.
One more definition: A binary tree, T, is balanced if T is empty, or if abs ( height (leftsubtree of T) - height ( right subtree of T) )
CS202 - Fundamental Structures of Computer Science II
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 24 Sorting.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Data Structures: Trees i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst, Brian Hayes, or Glenn Brookshear.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 19: Heap Sort.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
© 2006 Pearson Addison-Wesley. All rights reserved11 A-1 Chapter 11 Trees.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Chapter 9: Huffman Codes
Transforming Infix to Postfix
Marc Smith and Jim Ten Eyck
Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.
Advance Data Structure 1 College Of Mathematic & Computer Sciences 1 Computer Sciences Department م. م علي عبد الكريم حبيب.
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
CS Data Structures Chapter 15 Trees Mehmet H Gunes
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 10: Trees Data Abstraction & Problem Solving with C++
© 2011 Pearson Addison-Wesley. All rights reserved 11 B-1 Chapter 11 (continued) Trees.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Distributed Process Scheduling : A Summary
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Static Process Scheduling
1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
A System Performance Model Distributed Process Scheduling.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
Graphs and Trees Mathematical Structures for Computer Science Chapter 5 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesGraphs and Trees.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Dynamic Load Balancing Tree and Structured Computations.
Navigation Piles with Applications to Sorting, Priority Queues, and Priority Deques Jyrki Katajainen and Fabio Vitale Department of Computing, University.
Balanced Search Trees 2-3 Trees AVL Trees Red-Black Trees
Trees Chapter 15.
Trees Chapter 11 (continued)
Trees Chapter 11 (continued)
CS202 - Fundamental Structures of Computer Science II
CS202 - Fundamental Structures of Computer Science II
Department of Computer Science University of California,Santa Barbara
Chapter 9: Virtual-Memory Management
CS202 - Fundamental Structures of Computer Science II
Multi-Way Search Trees
CS202 - Fundamental Structures of Computer Science II
Lecture 9 Algorithm Analysis
Lecture 9 Algorithm Analysis
Lecture 9 Algorithm Analysis
Advanced Implementation of Tables
Self-Balancing Search Trees
CS202 - Fundamental Structures of Computer Science II
Case Studies with Projections
CS202 - Fundamental Structures of Computer Science II
CS202 - Fundamental Structures of Computer Science II
Presentation transcript:

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, Sriram Krishnamoorthy, Laxmikant V. Kale FILE NAME | FILE CREATION DATE | ERICA CLEARANCE NUMBER Applications iteratively executing identical or slowly evolving calculations require incremental rebalancing to improve load balance across iterations. The work to be performed is overdecomposed into tasks, enabling automatic rebalancing by the middleware. We consider the design and evaluation of two traditionally disjoint approaches to rebalancing: work stealing and persistence-based load balancing. We apply the principle of persistence to work stealing to design an optimized retentive work stealing algorithm. We demonstrate high efficiencies on the full NERSC Hopper (146,400 cores) and ALCF Intrepid (163,840 cores) systems, and on up to 128,000 cores on OLCF Titan. Retentive Work Stealing IntroductionPersistence-based Load Balancing Applications that retain the computation balance over iterations, with gradual change, are said to adhere to the principle of persistence. This behavior can be exploited by measuring the performance profile in a previous iteration and using these measurements to rebalance the tasks. We present a scalable hierarchical persistence-based load balancing algorithm that greedily redistributes “excess” load to localize the rebalance operations and migration of tasks. In this algorithm, the cores are organized in a tree where every core is a leaf. In work stealing, every core executes tasks from a local queue until it is empty. It then becomes a thief that randomly searches for work until it finds additional work or termination is detected. In our distributed-memory implementation, we use a bounded- buffer circular deque with a fixed-sized array to reduce data transfer costs and limit synchronization overhead. The following figures illustrate the operations on the queue. A red dotted arrow indicates that a lock or atomic operation is involved. Adding a task Releasing tasks to the shared section Acquiring tasks from the shared section Getting a task Steps in a steal operation Leaves send their excess load (any load above a constant times the average load) to their parent, selecting the work units from shortest duration to longest. Each core receives excess load from its children and saves work for underloaded children until their load is above a constant times the average load. Remaining tasks are sent to the parent. Starting at the root, the excess load is distributed to each child applying the centralized greedy algorithm: maximum duration task is assigned to minimum- loaded child. Experimental Results TCE-Hopper HF-Hopper HF-Intrepid HF-Titan Execution times for first and last iteration. x-axis – number of cores; y-axis – execution time in seconds Efficiency of persistence-based load-balancing across iterations for the three system sizes, relative to the ideal anticipated speedup. x-axis – number of cores; y-axis – efficiency Efficiency of retentive work stealing across iterations relative to ideal anticipated speedup and tasks per core. x-axis – core count; left y- axis – efficiency; right y-axis – tasks per core (error bar: std. dev.) Challenges Primary Contributions -On distributed-memory, the cost of work stealing is significant due to communication latency and task migration costs. -Previous rebalancing is ignored with traditional work stealing, incurring repeated costs across iterations. -Work stealing algorithms that randomly redistribute work may disrupt locality or topology-optimized distributions. -Persistence-based load balancers cannot adapt immediately to load imbalance: they require initial profiling and rebalancing only occurs at the end of a phase. -Persistence-based load balancers incur periodic rebalancing overheads. -An active-message-based retentive work stealing algorithm that is highly optimized for distributed-memory -A hierarchical persistence-based load balancing algorithm that performs greedy localized rebalancing -Most scalable demonstration of work stealing — on up to 163,840 cores of BG/P, 146,400 cores of XE6, and 128,000 cores of XK6 -First comparative evaluation of two traditionally disjoint approaches, work stealing and persistence-based, for load balancing iterative scientific applications -Demonstration of the benefits of retentive stealing in incrementally rebalancing iterative applications -Experimental data demonstrating that the number of steals (successful or otherwise) does not grow substantially with scale -A comparison of the execution time and quality of load balance between a centralized versus hierarchical persistence-based load balancer using a micro-benchmark Presentation: Session 9: Thursday, June 10:20am