Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
FLANN Fast Library for Approximate Nearest Neighbors
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Ensemble Learning (2), Tree and Forest
Load Balancing Understanding GreedyLB and RefineLB strategies.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Dynamic Load Balancing Tree and Structured Computations.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Auburn University
OPERATING SYSTEMS CS 3502 Fall 2017
Data Transformation: Normalization
Architecture and Algorithms for an IEEE 802
Artificial Neural Networks
The Greedy Method and Text Compression
Parallel Algorithm Design
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Performance Evaluation of Adaptive MPI
Component Frameworks:
On Spatial Joins in MapReduce
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Integrated Runtime of Charm++ and OpenMP
Objective of This Course
Soft Error Detection for Iterative Applications Using Offline Training
Gary M. Zoppetti Gagan Agrawal
Utility-Function based Resource Allocation for Adaptable Applications in Dynamic, Distributed Real-Time Systems Presenter: David Fleeman {
Ensemble learning.
Case Studies with Projections
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
Predicting Loan Defaults
An Orchestration Language for Parallel Objects
Kostas Kolomvatsos, Christos Anagnostopoulos
Phase based adaptive Branch predictor: Seeing the forest for the trees
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Title Meta-Balancer: Automated Selection of Load Balancing Strategies Presenter: Kavitha Chandrasekar Work done by: Harshitha Menon, Kavitha Chandrasekar, Laxmikant V. Kale

Outline Motivation Need for Dynamic Load-balancing and Strategy Selection Implementation in Charm++ Meta-Balancer Framework Discussion of application results

Need for Load-balancing Several HPC applications exhibit load imbalance, for example: Dynamic computation changes like in AMR Variability in particles per computation unit in Molecular Dynamics Load balancing achieves High performance High Resource utilization Load Imbalance in AMR application Reference: AMR mini-app in Charm++

Charm++ Programming Model and RTS Asynchronous message-driven execution model Supports over-decomposition Dividing computation into more units than hardware processing units Supports Migratability of work units Hence, load balancing is possible Supports a suite of load balancing strategies Over-decomposition of work into Charm++ objects Mapping of Charm++ objects to processors Charm++ RTS view of over-decomposition

Need for Adaptive Load Balancing Selection HPC application have dynamically varying load imbalance Applications exhibit varying characteristics Varying over input sizes Varying over phases within an application run Varying core counts Load Imbalance characteristics also vary as a result Need for choosing the optimal load balancing strategy Previous work Metabalancer chooses the best load balancing period Lassen’s Propagating Wave Front Reference: Lassen from LLNL

Differences in Load Balancing Strategies Different applications require different load balancing strategies Communication intensive applications require comm-aware load balancers such as graph partitioning-based LBs Computation-based load imbalance may require From-scratch load balancers Refinement-based Applications may require different strategies at different stages of the run

Load balancing Strategies in Charm++ Available load balancing strategies Centralized: GreedyLB : From-scratch re-mapping of objects to PEs RefineLB : Re-mapping of objects, taking into account current mapping of objects HierarchicalLB : HybridLB performs GreedyLB at sub-trees and RefineLB at higher-levels Distributed: DistributedLB : A distributed strategy for load statistics propagation for mapping of chares from overloaded to underloaded processors Communication-aware: MetisLB : Objects are mapped based on partitioning of communication graph ScotchLB : Communication graph partitioning, also taking into account load imbalance ScotchRefineLB : Similar to ScotchLB but takes existing mapping of chares into account

Performance of Load-balancing strategies Different strategies perform differently with different applications, different datasets, varying core counts etc. Figure shows performance of Lassen on varying number of cores HybridLB shows better speedup for smaller number of cores DistributedLB performs better for large core counts Even for the same application with weak- scaling, we observe variations in performance Performance of Lassen with different LBs

Meta-Balancer: For Automated Strategy Selection Capture Features Feed test data to Trained Model Generate Load-balancing strategy decision Invoke Load-balancer How do we select a strategy from the set of strategies? Train a Random Forest Machine learning model offline with training data At application runtime, capture application characteristics Adaptive run-time system component Identifies when to balance load and which strategy to use Invokes LB and continues to monitor application load imbalance characteristics Perform multiple runs Train Random Forest Model

Meta-Balancer – Statistics collection Objects deposit load information at the iteration boundary Aggregation of object load statistics at PEs AtSync application calls are used to collect statistics on each PE There is no waiting on barrier unlike regular load balancing at AtSync Hence they are very low overhead AtSync calls Tree based reductions to collect statistics Double array with about 30 features are aggregated by reduction

Random Forest Machine Learning technique Random forest ML technique to predict the best load balancing strategy - classification task We use 100 trees for good prediction accuracy Nodes in the tree represent features with associated threshold values Leaf Nodes are load-balancing strategy classes with assigned probabilities Tunable model parameters for accuracy Tree depth Types of classifier etc.

Random Forest Machine Learning technique Why not use a decision tree? Issues in coming up with the right threshold per internal node They can result in overfitting We use a decision tree to: Provide a general idea of different scenarios applicable to different LB strategies Decide on Features that can be used to train a model Random Forest Model Uses several trees to determine the LB with highest probability Trees are constructed by considering random sets of features

Decision Tree – How strategies vary with differing load imbalance characteristics Test data No High load imbalance Yes No LB High comm load/compute load Yes No High rate of change of imbalance No Topology sensitive Yes Yes No High rate of change of imbalance Migration cost low Large core counts High None Low Yes No Yes No CommRefineLB ScotchLB MetisLB TopoAware (Refine)LB GreedyLB RefineLB DsitributedLB

Statistics Collected PE stats Obj stats Migration overhead Number of PEs PE load imbalance PE utilization (min, max, avg) Obj stats Num objs Per PE Obj load (min, max, avg) Migration overhead Communication stats Total messages, Total bytes, External messages fraction, External bytes fraction, Avg hops, Avg hop bytes, Avg comm neighbors Machine stats Latency, Bandwidth Comm cost by Compute cost Overhead over Benefit of Centralized strategy

Features generated from Statistics PE stats PE load imbalance = PE max load / PE avg load Relative rate of change of load (max load, avg load) Communication cost by Compute cost Communication cost based on latency, bandwidth Computation cost is the sum of time taken to execute work Overhead by Benefit of Centralized strategy Overhead of centralized GreedyLB strategy is O(num_objs * log (p)) Benefit of the centralized strategy is based on the assumption that it achieves perfect balance

Training Set Synthetic benchmark lb_test to study different behavior varying communication pattern computation load migration cost message sizes These parameters allow us to introduce application characteristics Load imbalance Dynamic load imbalance Compute vs Communication-intensive Communication-intensive synthetic run Computation-intensive synthetic run

Predictions on Test Data 75 configurations were used in training set and 25 in test set Variations in parameters allowed to simulate different application characteristics Eg: Communication-intensive, Compute-intensive score = Tpredicted_lb / Tbest_lb-1 Larger the score, worse the performance Prediction Scores

Histogram of Features Histogram of features used for sample test data 100 trees are traversed by the classifier to find the load balancer with maximum probability We output how often a feature is used to make a decision on the path Some key features were: PE_load_Imb Avg_utilization Overhead_vs_benefit Comm_vs_Compute Histogram of features used by Random Forest for synthetic test data

Using Trained Model for applications Trained Random Forest Model from synthetic runs was used with Lassen PRK Particle-In-Cell Simulation NPB BT-MZ PDES LeanMD Some of the mini-application results are presented here

Lassen – Wave propagation application Used to study detonation shock dynamics Suffers from varying load imbalance Meta-Balancer is able to choose a good load balancer In most cases, atleast the 2nd best load balancer Meta-Balancer predicts LB (DistLB - 1024 cores) that improves performance by 2X for worst LB Application times for all LBs Worst was – Refine for 1024 Lassen Prediction for LB strategies

Prediction for Particle-In-Cell Simulation Charm++ implementation of PRK PIC Used for simulation of plasma particles Load imbalance results from imbalanced distribution of particles and particle motion Meta-Balancer is able to predict GreedyLB that improves performance by more than 2X for worst LB PRK PIC Prediction for LB strategies

Prediction for NPB BT-MZ Use of Random Forest Algorithm for prediction of Load balancer Example: NPB BT-MZ, an AMPI benchmark Uneven zone size – load imbalance Communication intensive Input classes used: C and D For core 128, 256 cores, prediction is the second best NPB BT-MZ Prediction for LB strategies

Prediction for NPB BT-MZ Load-Balancing Score for NPB BT-MZ Score of prediction with respect to performance speedup is shown For core 128, 256 cores, prediction is the second best Smaller cores show might have variability due to short run time ScotchLB could show up as dominating for larger dataset

Tree Path description for BT-MZ Result Analysis: What features work? Figure shows BT-MZ’s LB prediction LB_Gain (overhead_vs_benefit) is > 0.09 Hence centralized strategy is acceptable NUM_PEs is low Hence Greedy is acceptable (instead of Hybrid) Migration overhead is low Hence Greedy is acceptable than Refine Rate of imbalance is low Hence greedy is acceptable, since it can be called infrequently with low overhead Random Forest’s tree for BT-MZ’s LB prediction

Summary: Meta-Balancer for Strategy Selection Random Forest algorithm for training and to predict suitable load balancing strategy with high accuracy Adaptive run-time component to make the load balancing strategy selection decision Several applications benefit from strategy selection Our Result Analysis looks at why a Load-balancing strategy was predicted with high probability

Thank You