Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Slides:

Advertisements

Similar presentations

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Advertisements

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Load Balancing in Charm++ Eric Bohm. How to diagnose load imbalance?  Often hidden in statements such as: o Very high synchronization overhead  Most.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Adaptive MPI Milind A. Bhandarkar

Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

Advanced / Other Programming Models Sathish Vadhiyar.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.

1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Static Process Scheduling

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

CS 420 Design of Algorithms Parallel Algorithm Design.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Parallel Application Paradigms CS433 Spring 2001 Laxmikant Kale.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Using Charm++ with Arrays Laxmikant (Sanjay) Kale Parallel Programming Lab Department of Computer Science, UIUC charm.cs.uiuc.edu.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Parallel Objects: Virtualization & In-Process Components

Performance Evaluation of Adaptive MPI

Parallel Programming in C with MPI and OpenMP

Component Frameworks:

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Case Studies with Projections

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Screen shots – Load imbalance

Parallel Programming in C with MPI and OpenMP

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

Parallel Implementation of Adaptive Spacetime Simulations A

Presentation transcript:

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

2 Motivation Irregular or dynamic applications Initial static load balancing Initial static load balancing Application behaviors change dynamically Application behaviors change dynamically Difficult to implement with good parallel efficiency Difficult to implement with good parallel efficiency Versatile, automatic load balancers Application independent Application independent No/little user effort is needed in load balance No/little user effort is needed in load balance Based on Charm++ and Adaptive MPI Based on Charm++ and Adaptive MPI

3 Parallel Objects, Adaptive Runtime System Libraries and Tools Molecular Dynamics Computational Cosmology Rocket Simulation Protein FoldingQuantum Chemistry (QM/MM) Crack Propagation Dendritic Growth Space-time meshes

4 Load Balancing in Charm++ Viewing an application as a collection of communicating objects Object migration as mechanism for adjusting load Measurement based strategy Principle of persistent computation and communication structure. Principle of persistent computation and communication structure. Instrument cpu usage and communication Instrument cpu usage and communication Overload vs. underload processor

5 Load Balancing – graph partitioning LB View mapping of objects Weighted object graph in view of Load Balance Charm++ PE

6 Load Balancing Framework LB Framework

7 Centralized vs. Distributed Load Balancing Centralized Object load data are sent to processor 0 Object load data are sent to processor 0 Integrate to a complete object graph Integrate to a complete object graph Migration decision is broadcasted from processor 0 Migration decision is broadcasted from processor 0 Global barrier Global barrierDistributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier

8 Load Balancing Strategies

9 Strategy Example - GreedyCommLB Greedy algorithm Put the heaviest object to the most underloaded processor Put the heaviest object to the most underloaded processor Object load is its cpu load plus comm cost Communication cost is computed as α+βm Communication cost is computed as α+βm

10 Strategy Example - GreedyCommLB

11 Strategy Example - GreedyCommLB

12 Strategy Example - GreedyCommLB

13 64 processors1024 processors Min loadMax loadAve loadMin loadMax loadAve load GreedyRefLB GreedyCommLB RecBisectBfLB MetisLB RefineLB RefineCommLB OrbLB Comparison of Strategies Jacobi1D program with 2048 chares on 64 pes and chares on 1024 pes

processors Min loadMax loadAve load GreedyLB GreedyRefLB GreedyCommLB RefineLB RefineCommLB OrbLB Comparison of Strategies NAMD atpase Benchmark atoms Number of chares:31811 migratable:31107

15 User Interfaces Fully automatic load balancing Nothing needs to be changed in application code Nothing needs to be changed in application code Load balancing happens periodically and transparently Load balancing happens periodically and transparently +LBPeriod to control the load balancing interval +LBPeriod to control the load balancing interval User controlled load balancing Insert AtSync() calls at places ready for load balancing (hint) Insert AtSync() calls at places ready for load balancing (hint) LB pass control back to ResumeFromSync() after migration finishes LB pass control back to ResumeFromSync() after migration finishes

16 Migrating Objects Moving data Runtime packs object data into a message and send to its destination Runtime packs object data into a message and send to its destination Runtime unpacks the data and creates object Runtime unpacks the data and creates object User needs to write pup function for packing/unpacking object data User needs to write pup function for packing/unpacking object data

17 Compiler Interface Link time options -module: Link load balancers as modules -module: Link load balancers as modules Link multiple modules into binary Link multiple modules into binary Runtime options +balancer: Choose to invoke a load balancer +balancer: Choose to invoke a load balancer Can have multiple load balancers Can have multiple load balancers +balancer GreedyCommLB +balancer RefineLB

18 NAMD case study Molecular dynamics Atoms move slowly Initial load balancing can be as simple as round-robin Load balancing is only needed for once for a while, typically once every thousand steps Greedy balancer followed by Refine strategy

19 Load Balancing Steps Regular Timesteps Instrumented Timesteps Detailed, aggressive Load Balancing Refinement Load Balancing

20 Processor Utilization against Time on (a) 128 (b) 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. Load Balancing Aggressive Load Balancing Refinement Load Balancing

21 Processor Utilization across processors after (a) greedy load balancing and (b) refining Note that the underloaded processors are left underloaded (as they don’t impact perforamnce); refinement deals only with the overloaded ones Some overloaded processors

22 Profile view of a 3000 processor run of NAMD (White shows idle time)

23 Load Balance Research with Blue Gene Centralized load balancer Bottleneck for communication on processor 0 Bottleneck for communication on processor 0 Memory constraint Memory constraint Fully distributed load balancer Neighborhood balancing Neighborhood balancing Without global load information Without global load information Hierarchical distributed load balancer Divide into processor groups Divide into processor groups Different strategies at each level Different strategies at each level