Dynamic Load Balancing in Scientific Simulation Angen Zheng.

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Progress Report Wireless Routing By Edward Mulimba.

Parallel Simulation etc Roger Curry Presentation on Load Balancing.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Department of Biomedical Informatics Dynamic Load Balancing (Repartitioning) & Matrix Partitioning Ümit V. Çatalyürek Associate Professor Department of.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Network Aware Resource Allocation in Distributed Clouds.

Energy Aware Task Mapping Algorithm For Heterogeneous MPSoC Based Architectures Amr M. A. Hussien¹, Ahmed M. Eltawil¹, Rahul Amin 2 and Jim Martin 2 ¹Wireless.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Summer Report Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY

PIMA-motivation PIMA: Partition Improvement using Mesh Adjacencies  Parallel simulation requires that the mesh be distributed with equal work-load and.

Jason Ernst, University of Guelph 1.  Introduction ◦ Background Information ◦ Motivation for Research / Current Problems  Proposed Solution ◦ Algorithm.

Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.

Locality Aware Dynamic Load Management for Massively Multiplayer Games Jin Chen, Baohua Wu, Margaret Delap, Bjorn Knutsson, Margaret Delap, Bjorn Knutsson,

1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,

CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

Static Process Scheduling Section 5.2 CSc 8320 Alex De Ruiter

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.

PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

An Adaptive Load Balancing Management for Distributed Virtual Environment Systems Yuanxing Yao 1, Tae-Hyung Kim 1, 1 Department of Computer Science and.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Static Process Scheduling

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.

Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Selected Topics in Data Networking Explore Social Networks:

Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.

Dynamic Load Balancing in Scientific Simulation

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Supporting On-Demand Elasticity in Distributed Graph Processing Mayank Pundir*, Manoj Kumar, Luke M. Leslie, Indranil Gupta, Roy H. Campbell University.

Auburn University

2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure

Parallel Hypergraph Partitioning for Scientific Computing

Ioannis E. Venetis Department of Computer Engineering and Informatics

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

An Adaptive Load Balancing Management for

Performance Evaluation of Adaptive MPI

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Mingxing Zhang, Youwei Zhuo (equal contribution),

Integrated Runtime of Charm++ and OpenMP

Hybrid Programming with OpenMP and MPI

Case Studies with Projections

Adaptive Data Refinement for Parallel Dynamic Programming Applications

Adaptive Mesh Applications

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Support for Adaptivity in ARMCI Using Migratable Objects

Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Presentation transcript:

Dynamic Load Balancing in Scientific Simulation Angen Zheng

Static Load Balancing Distribute the load evenly across processing unit. Is this good enough? It depends! No data dependency! Load distribution remain unchanged! Initial Balanced Load Distribution Initial Load PU 1 PU 2 PU 3 Unchanged Load Distribution Computations No Communication among PUs.

Static Load Balancing Distribute the load evenly across processing unit. Minimize inter-processing-unit communication. Initial Balanced Load Distribution Initial Load PU 1 PU 2 PU 3 Unchanged Load Distribution Computation PUs need to communicate with each other to carry out the computation.

Dynamic Load Balancing PU 1 PU 2 PU 3 Imbalanced Load Distribution Iterative Computation Steps Balanced Load Distribution Repartitioning Initial Balanced Load Distribution Initial Load PUs need to communicate with each other to carry out the computation. Distribute the load evenly across processing unit. Minimize inter-processing-unit communication! Minimize data migration among processing units.

Bcomm= 3 Given a (Hyper)graph G=(V, E).  Partition V into k partitions P 0, P 1, … P k, such that all parts  Disjoint: P 0 U P 1 U … P k = V and P i ∩ P j = Ø where i ≠ j.  Balanced: |Pi| ≤ (|V| / k) * (1 + ᵋ )  Edge-cut is minimized: edges crossing different parts. (Hyper)graph Partitioning

Given a Partitioned (Hyper)graph G=(V, E) and a Partition Vector P.  Repartition V into k partitions P 0, P 1, … P k, such that all parts  Disjoint.  Balanced.  Minimal Edge-cut.  Minimal Migration. (Hyper)graph Repartitioning Bcomm = 4 Bmig =2 Repartitioning

(Hyper)graph-Based Dynamic Load Balancing 6 3 Build the Initial (Hyper)graph Initial Partitioning PU1 PU2 PU3 Update the Initial (Hyper)graph Iterative Computation Steps Load Distribution After Repartitioning Repartitioning the Updated (Hyper)graph 6 3

(Hyper)graph-Based Dynamic Load Balancing: Cost Model T comm and T mig depend on architecture- specific features, such as network topology, and cache hierarchy T compu is usually implicitly minimized. T repart is commonly negligible.

(Hyper)graph-Based Dynamic Load Balancing: NUMA Effect

(Hyper)graph-Based Dynamic Load Balancing: NUCA Effect Initial (Hyper)graph Initial Partitioning PU1 PU2 PU3 Updated (Hyper)graph Iterative Computation Steps Migration Once After Repartitioning Rebalancing

 NUMA-Aware Inter-Node Repartitioning:  Goal: Group the most communicating data into compute nodes closed to each other.  Main Idea:  Regrouping.  Repartitioning.  Refinement.  NUCA-Aware Intra-Node Repartitioning:  Goal: Group the most communicating data into cores sharing more level of caches.  Solution#1: Hierarchical Repartitioning.  Solution#2: Flat Repartitioning. Hierarchical Topology-Aware (Hyper)graph-Based Dynamic Load Balancing

 Motivations:  Heterogeneous inter- and intra-node communication.  Network topology v.s. Cache hierarchy.  Different cost metrics.  Varying impact.  Benefits:  Fully aware of the underlying topology.  Different cost models and repartitioning schemes for inter- and intra-node repartitioning.  Repartitioning the (hyper)graph at node level first offers us more freedom in deciding:  Which object to be migrated?  Which partition that the object should migrated to? Hierarchical Topology-Aware (Hyper)graph-Based Dynamic Load Balancing

NUMA-Aware Inter-Node (Hyper)graph Repartitioning: Regrouping P4 Regrouping P1P2P3P4 Node#0Node#1 Partition Assignment

NUMA-Aware Inter-Node (Hyper)graph Repartitioning: Repartitioning Repartitioning 0

0 Migration Cost: 4 Comm Cost: 3 0 Refinement by taking current partitions to compute nodes assignment into account. NUMA-Aware Inter-Node (Hyper)graph Repartitioning: Refinement Migration Cost: 0 Comm Cost: 3

 Main Idea: Repartition the subgraph assigned to each node hierarchically according to the cache hierarchy. Hierarchical NUCA-Aware Intra-Node (Hyper)graph Repartitioning

Flat NUCA-Aware Intra-Node (Hyper)graph Repartition Main Idea:  Repartition the subgraph assigned to each compute node directly into k parts from scratch. K equals to the number of cores per node.  Explore all possible partition to physical core mappings to find the one with minimal cost:

Flat NUCA-Aware Intra-Node (Hyper)graph Repartition P1P2P3 Core#0Core#1Core#2 Old Partition Assignment Old Partition

Flat NUCA-Aware Intra-Node (Hyper)graph Repartition Old Partition New Partition P1P2P3P4 Core#0Core#1Core#2Core#3 P1P2P3 Core#0Core#1Core#2 Old Assignment New Assignment#M1

Major References [1] K. Schloegel, G. Karypis, and V. Kumar, Graph partitioning for high performance scientific simulations. Army High Performance Computing Research Center, [2] B. Hendrickson and T. G. Kolda, Graph partitioning models for parallel computing," Parallel computing, vol. 26, no. 12, pp. 1519~1534, [3] K. D. Devine, E. G. Boman, R. T. Heaphy, R. H.Bisseling, and U. V. Catalyurek, Parallel hypergraph partitioning for scientific computing," in Parallel and Distributed Processing Symposium, IPDPS th International, pp. 10-pp, IEEE, [4] U. V. Catalyurek, E. G. Boman, K. D. Devine,D. Bozdag, R. T. Heaphy, and L. A. Riesen, A repartitioning hypergraph model for dynamic load balancing," Journal of Parallel and Distributed Computing, vol. 69, no. 8, pp. 711~724, [5] E. Jeannot, E. Meneses, G. Mercier, F. Tessier,G. Zheng, et al., Communication and topology-aware load balancing in charm++ with treematch," in IEEE Cluster [6] L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele,P. O. Navaux, J.-F. Mehaut, L. V. Kale, et al., Improving parallel system performance with a numa-aware load balancer," INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL, Tech. Rep. TR-JLPC-11-02, vol , 2011.

Thanks!