Integrated Runtime of Charm++ and OpenMP

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Scaling Content Based Image Retrieval Systems Christine Lo, Sushant Shankar, Arun Vijayvergiya CS 267.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Models and Paradigms
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Adaptive MPI Milind A. Bhandarkar
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Adaptive MPI Performance & Application Studies
Welcome to the 2017 Charm++ Workshop!
Distributed Processors
Conception of parallel algorithms
GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.
Performance Evaluation of Adaptive MPI
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Debashis Ganguly and John R. Lange
Department of Computer Science University of California, Santa Barbara
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Welcome to the 2016 Charm++ Workshop!
Hybrid Programming with OpenMP and MPI
Using OpenMP offloading in Charm++
Upcoming Improvements and Features in Charm++
Faucets: Efficient Utilization of Multiple Clusters
Case Studies with Projections
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Integrated Runtime of Charm++ and OpenMP Parallel Programming Laboratory, CS @ Illinois Ph.D Student Seonmyeong Bak

Tables of contents Introduction Implementation Evaluation Motivation Load imbalance/performance analysis of periodic LB and overdecomposition Implementation Integrated runtime of OpenMP and Charm++/AMPI using user-level threads Evaluation Scaling results on KNL with applications (Lassen, Kripke, ChaNGa)

Motivation Load imbalance becomes more critical issue The number of cores increasing rapidly Parallel applications have variable size of problems, which cannot be evenly distributed among nodes and cores

Ways to resolve load imbalance within a node Coarse-grained periodic load balancing Load information on all processes is collected and make decision to migrate objects Periodic load balancing incurs overhead so cannot be called frequently Using hybrid programming model such as MPI +X Help load imbalance with fine-grained task-level parallelism and low cost Each process has separate thread pool -> limited load balancing

Analysis of periodic load balancing / overdecomposition Application: Lassen, proxy application Wave propagation, highly load imbalanced Machine: NERSC Cori(Haswell, 16-core x 2 sockets on each node) 2 hardware threads per core are used -> 64 threads on a single node Measure the load imbalance of Lassen and and performance Metric Bigger Value means processing elements and nodes have more load imbalance

Analysis of periodic load balancing / overdecomposition 4 different Load Balancing strategies GreedyLB Move the load on the heaviest loaded PE to the least loaded PE GreedyRefineLB Similar to GreedyLB but minimize migration of objects RefineLB Compute the average load of PEs and migrate chares to make each PE reach the average Hybrid LB Hierarchical Load balancing -> RefineLB for the 1st level of the hierarchy of Pes and GreedyRefineLB for the 2nd level NoLB (Baseline, periodic load balancing is not called)

More number of chares per PE -> Load imbalance reduced Fixed LB freq = 25 Fixed decomposition ratio = 2 More number of chares per PE -> Load imbalance reduced More frequent LB -> Load imbalance reduce The reduced load imbalance leads to performance gain? NO!

Fixed decomposition ratio = 2 Fixed LB freq = 25 Fixed decomposition ratio = 2 More number of chares per PE -> Performance get worse with more than 8 chares / PE More number of chares -> surface to volume ratio increases, scheduling overhead, lack of locality More frequent LB -> Performance doesn’t get better from 10 iterations Load balancing overhead, migrating cost

Detailed timing result of Lassen More frequent LB LB Overhead increased Higher decomposition ratio Application time increased Lack of locality Scheduling overhead Surface to volume ratio increased Periodic LB have limitation in performance gain

Implementation Create implicit tasks (OpenMP threads) using user-level threads on each chare These ULTs can be migratable within a node Idle PE steals ULTs from busy PEs, which improve load imbalance within a node in marginal cost Use heuristics to minimize the unnecessary creation of ULTs Use an atomic counter to keep record of how many PEs are idle within a node Keep history vector on each PE to monitor how many tasks created on the PE has been stolen

Implementation – Integrated OpenMP on Charm++

Implementation All the barriers are implemented using atomic reference counters Suspend/Resume cannot work for this integrated scheme -> chares creating implicit tasks cannot be suspended Minimize unnecessary internal barriers in LLVM OpenMP runtime more overlapping internally Works with ICC, Clang, and GNU on Charm++ x86/IBM Power targets

Evaluation Applications Machines Compiler: icc 17.0 Lassen: Wave propagation mini app by LLNL, written in Charm++, highly load imbalanced Kripke: LLNL proxy app for parallel deterministic transport codes, written in MPI+OpenMP ChaNGa: N-body cosmology simulation, written in Charm++ Machines NERSC Cori (Haswell, 2 x 16 cores on each node), ANL Theta(KNL, 64 cores) 2 hardware threads per node on both machines Cori used only for Lassen, Theta used for all the applications Compiler: icc 17.0

LB factor & Performance on a single node – Lassen On both Haswell and KNL, the LB factor and performance is improved Improved load imbalance results in performance gain-> marginal overhead

Scaling results – Lassen Strong scaling 32.5% on Cori(Haswell, 512 cores run), 45.9% on Theta (KNL, 2K cores run)

Scaling result – Kripke Weak scaling Run Kripke with standard MPI+OpenMP, AMPI + OpenMP AMPI + OpenMP can use all the threads within a node for all MPI ranks and OpenMP threads Better performance 11% improved over the best MPI config

Scaling result – ChaNGa Input set: Dwf1.2048 This input has strong scaling limit Best LB applied for both experiments 25.7% on single node, 16.7% on two nodes, 9.4% on 4 nodes 4% on 8 nodes

Publications & Future work Integrated OpenMP has been open to public since 6.8 The initial version published and presented in ESPM2 `17 Seonmyeong Bak, Harshitha Menon, Sam White, Matthias Diener, and Laxmikant Kale. Integrating OpenMP into the Charm++ Programming Model. In ESPM2’17: Third International Workshop on Extreme Scale Programming Models and Middleware, November 12–17, 2017, Denver, CO, USA The version of integrated OpenMP for this work will be available in 6.9 All the results in this presentation are based on a paper accepted to CCGrid `18 Seonmyeong Bak, Harshitha Menon, Sam White, Matthias Diener, and Laxmikant Kale. Multi-level Load Balancing with an Integrated Runtime Approach. In CCGrid `18: Eighteenth IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 1-4, 2018, Washington DC, USA (to appear) Standalone User-level thread based OpenMP is in progress for accelerating existing OpenMP applications/libraries Similar to BOLT in Argonne National Lab with different scheduling strategies and limitations

Summary Periodic LB and overdecomposition have limitation in performance due to incurred overhead Scheduling overhead, LB overhead, lack of locality and surface to volume ratio increased Integrated OpenMP enables creation of implicit tasks which handling load imbalance within a node in marginal cost Evaluated the benefit of the integrated runtime with three applications on Xeon and KNL machines