Performance Evaluation of Adaptive MPI

Slides:

Advertisements

Similar presentations

Parallel Processing with OpenMP

Advertisements

Dynamic Load Balancing in Scientific Simulation Angen Zheng.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

SLA-Oriented Resource Provisioning for Cloud Computing

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.

Adaptive MPI Milind A. Bhandarkar

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Parallel Objects: Virtualization & In-Process Components

uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

AMPI: Adaptive MPI Celso Mendes & Chao Huang

Chapter 4: Threads.

Department of Computer Science University of California, Santa Barbara

Component Frameworks:

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Summary Background Introduction in algorithms and applications

Gary M. Zoppetti Gagan Agrawal

Faucets: Efficient Utilization of Multiple Clusters

BigSim: Simulating PetaFLOPS Supercomputers

Adaptive Data Refinement for Parallel Dynamic Programming Applications

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Chao Huang Parallel Programming Laboratory University of Illinois

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

An Orchestration Language for Parallel Objects

Higher Level Languages on Adaptive Run-Time System

Support for Adaptivity in ARMCI Using Migratable Objects

Laxmikant (Sanjay) Kale Parallel Programming Laboratory

Presentation transcript:

Performance Evaluation of Adaptive MPI Chao Huang1, Gengbin Zheng1, Sameer Kumar2, Laxmikant Kale1 1 University of Illinois at Urbana-Champaign 2 IBM T. J. Watson Research Center 9/16/2018 PPoPP 06

Motivation Challenges Adaptive MPI Applications with dynamic nature Shifting workload, adaptive refinement, etc Traditional MPI implementations Limited support for such dynamic applications Adaptive MPI Virtual processes (VPs) via migratable objects Powerful run-time system that offers various novel features and performance benefits A mature implementation that is being used in real-world applications 9/16/2018 PPoPP 06

Outline Motivation Design and Implementation Features and Benefits Adaptive Overlapping Automatic Load Balancing Communication Optimizations Flexibility and Overhead Conclusion 9/16/2018 PPoPP 06

Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, number of VPs >> P, to allow for various optimizations System Implementation User View Before I go into the detail of Adaptive MPI, I would like to introduce the concept of processor virtualization. … This idea is realized in Charm++, an object-based parallel programming framework developed by our group. There are a series of advantages I’ll talk about shortly, and I just listed a few here: Advantages Run program on any given number of processors Adaptive overlapping computation and communication Automatic load balancing, shrink/expand, etc 9/16/2018 PPoPP 06

AMPI: MPI with Virtualization Each AMPI virtual process is implemented by a user-level thread embedded in a migratable object Real Processors MPI “processes” MPI processes RTS takes advantage of freedom of mapping VPs onto processors Explain that: Threads are light-weight, context-switch time about 1 usec. We use the threads to do things like blocking a process. Charm object communicates with other Charm object on behalf of the thread. This is how we implement point to point message passing. 9/16/2018 PPoPP 06

Outline Motivation Design and Implementation Features and Benefits Adaptive Overlapping Automatic Load Balancing Communication Optimizations Flexibility and Overhead Conclusion 9/16/2018 PPoPP 06

Adaptive Overlap Problem: Gap between completion time and CPU overhead Solution: Overlap between communication and computation Completion time and CPU overhead of 2-way ping-pong program on Turing (Apple G5) Cluster 9/16/2018 PPoPP 06

Timeline of 3D stencil calculation with different VP/P Adaptive Overlap 1 VP/P 2 VP/P 4 VP/P Let’s look at the first advantage of processor virtualization: adaptive overlapping of communication and computation. These 3 graphs are timelines of a 3D stencil calculation. What the application does is … On the 1st graph, we show the timeline without virtualization… Sometimes when vp/p is not big enough, the overhead is the dominant factor Timeline of 3D stencil calculation with different VP/P 9/16/2018 PPoPP 06

Automatic Load Balancing Challenge Dynamically varying applications Load imbalance impacts overall performance Solution Measurement-based load balancing Scientific applications are typically iteration-based The principle of persistence RTS collects CPU and network usage of VPs Load balancing by migrating threads (VPs) Threads can be packed and shipped as needed Different variations of load balancing strategies Load shifting, dynamic Heaps and stacks migrated with threads The variations of load balancing strategies include centralized, distributed, aggressive, based on refinement, random, etc. Say computation and communication instead, from time to time. 9/16/2018 PPoPP 06

Automatic Load Balancing Application: Fractography3D Models fracture propagation in material 9/16/2018 PPoPP 06

Automatic Load Balancing Wave propagate through a bar of material, making it elastic->plastic Zigzag lb Finish early 16 hours vs 22 hours CPU utilization of Fractography3D without vs. with load balancing 9/16/2018 PPoPP 06

Communication Optimizations AMPI run-time has capability of Observing communication patterns Applying communication optimizations accordingly Switching between communication algorithms automatically Examples Streaming strategy for point-to-point communication Collectives optimizations 9/16/2018 PPoPP 06

Streaming Strategy Combining short messages to reduce per-message overhead Streaming strategy for point-to-point communication on NCSA IA-64 Cluster 9/16/2018 PPoPP 06

Optimizing Collectives A number of optimization are developed to improve collective communication performance Asynchronous collective interface allows higher CPU utilization for collectives Computation is only a small proportion of the elapsed time Time breakdown of an all-to-all operation using Mesh library 9/16/2018 PPoPP 06

Virtualization Overhead Compared with performance benefits, overhead is very small Usually offset by caching effect alone Better performance when features are applied Unless very fine-grain application Performance for point-to-point communication on NCSA IA-64 Cluster 9/16/2018 PPoPP 06

3D stencil calculation of size 2403 run on Lemieux. Flexibility Running on arbitrary number of processors Runs with a specific number of MPI processes Big runs on a few processors 3D stencil calculation of size 2403 run on Lemieux. 9/16/2018 PPoPP 06

Outline Motivation Design and Implementation Features and Benefits Adaptive Overlapping Automatic Load Balancing Communication Optimizations Flexibility and Overhead Conclusion 9/16/2018 PPoPP 06

Conclusion Adaptive MPI supports the following benefits Adaptive overlap Automatic load balancing Communication optimizations Flexibility Automatic checkpoint/restart mechanism Shrink/expand AMPI is being used in real-world parallel applications and frameworks Rocket simulation at CSAR FEM Framework Portable to a variety of HPC platforms Ampi is a mature and portable system 9/16/2018 PPoPP 06

Future Work Performance Improvement Performance Analysis Reducing overhead Intelligent communication strategy substitution Machine-topology specific load balancing Performance Analysis More direct support for AMPI programs 9/16/2018 PPoPP 06

Thank You! Download of AMPI is available at: http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois 9/16/2018 PPoPP 06