Support for Adaptivity in ARMCI Using Migratable Objects

Slides:

Advertisements

Similar presentations

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Towards Efficient Load Balancing in Structured P2P Systems Yingwu Zhu, Yiming Hu University of Cincinnati.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Tutorial 6 Memory Management

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Adaptive MPI Milind A. Bhandarkar

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Embedded Real-Time Systems

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Chapter 3: Windows7 Part 5.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Current Generation Hypervisor Type 1 Type 2.

Accelerating Large Charm++ Messages using RDMA

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Parallel Objects: Virtualization & In-Process Components

Task Scheduling for Multicore CPUs and NUMA Systems

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

AMPI: Adaptive MPI Tutorial

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

AMPI: Adaptive MPI Celso Mendes & Chao Huang

Department of Computer Science University of California, Santa Barbara

Component Frameworks:

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Lecture 3: Main Memory.

Parallelization of CPAIMD using Charm++

Faucets: Efficient Utilization of Multiple Clusters

Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale

High Performance Computing

Charisma: Orchestrating Migratable Parallel Objects

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Chao Huang Parallel Programming Laboratory University of Illinois

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Chapter 01: Introduction

Department of Computer Science University of California, Santa Barbara

An Orchestration Language for Parallel Objects

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Higher Level Languages on Adaptive Run-Time System

Laxmikant (Sanjay) Kale Parallel Programming Laboratory

Page Main Memory.

Presentation transcript:

Support for Adaptivity in ARMCI Using Migratable Objects Chao Huang, Chee Wai Lee, Laxmikant Kale Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Motivation Different programming paradigms fit different algorithms and applications Adaptive Run-Time System (ARTS) offers performance benefits Goal: to support ARMCI and global address space languages on ARTS 5/30/2019

Common RTS Motivations for common run-time system Support concurrent composibility Support common functions: load-balancing, checkpoint 5/30/2019

Outline Motivation Adaptive Run-Time System Adaptive ARMCI Implementation Preliminary Results Microbenchmarks Checkpoint/Restart Application Performance: LU Future Work 5/30/2019

ARTS with Migratable Objects Programming model User decomposes work to parallel objects (VPs) RTS maps VPs onto physical processors Typically, number of VPs >> P, to allow for various optimizations System Implementation User View 5/30/2019

Features and Benefits of ARTS Adaptive overlap Automatic load balancing Automatic checkpoint/restart Communication optimizations Software engineering benefits 5/30/2019

Adaptive Overlap Challenge: Gap between completion time and CPU overhead Solution: Overlap between communication and computation Natural fit for ARTS Completion time and CPU overhead of 2-way ping-pong communication on Apple G5 Cluster 5/30/2019

Automatic Load Balancing Challenge Dynamically varying applications Load imbalance impacts overall performance Solution Measurement-based load balancing Scientific applications are typically iteration-based The Principle of Persistence RTS collects CPU and network usage of VPs Load balancing by migrating threads (VPs) Threads can be packed and shipped as needed Different variations of load balancing strategies Eg. communication-aware, topology-based 5/30/2019

Features and Benefits of ARTS Adaptive overlap Automatic load balancing Automatic checkpoint/restart Communication optimizations Software engineering benefits 5/30/2019

Outline Motivation Adaptive Run-Time System Adaptive ARMCI Implementation Preliminary Results Microbenchmarks Checkpoint/Restart Application Performance: LU Future Work 5/30/2019

ARMCI Aggregate Remote Memory Copy Interface (ARMCI) Remote memory access (RMA) operations (one-sided communication) Contiguous and noncontiguous (strided, vector); blocking and non-blocking Supporting various global-address space models Global Array, Co-Array Fortran compiler, Adlib Built on top of MPI or PVM Now on Charm++ Well designed noncontiguous interface 5/30/2019

Virtualizing ARMCI Processes Each ARMCI virtual process is implemented by a light-weight, user-level thread embedded in a migratable object Real Processors Virtual Processes Virtual Processes 5/30/2019

Isomalloc Memory Isomalloc approach for migratable threads Same iso-address area in all nodes’ virtual address space Separate regions globally reserved for each VP Memory allocated locally Thread data moved, without pointer or address update ... ... i i j j ... ... m m n n Iso-address is between process stack and heap setup at startup ... ... P0 P1 5/30/2019

Performance of contiguous operation on IA64 Cluster Microbenchmarks Performance of contiguous operation on IA64 Cluster 5/30/2019

Performance of strided operation on IA64 Cluster Microbenchmarks Performance of strided operation on IA64 Cluster 5/30/2019

On-disk checkpoint time of LU, on 2 to 32 PEs on IA64 Cluster Checkpoint/restart automated at run-time level User inserts simple function calls Possible NFS bottleneck for on-disk scheme Alternative: in-memory scheme P Total Data (MB) Time (ms) Bandwidth (MB/s) 2 20.05 221 90.8 4 22.29 249 89.7 8 26.5 303 87.6 16 35.43 366 96.9 32 53.27 533 100 On-disk checkpoint time of LU, on 2 to 32 PEs on IA64 Cluster 5/30/2019

Application Performance Performance of LU application on IA64 Cluster 5/30/2019

Application Performance Performance of LU-Block application on IA64 Cluster 5/30/2019

Future Work Performance Optimization Performance Tuning Reduce overheads Performance Tuning Visualization and analysis tools Port other GAS languages GA and CAF compiler RDMA Projections collaboration 5/30/2019