Support for Adaptivity in ARMCI Using Migratable Objects

Slides:



Advertisements
Similar presentations
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Towards Efficient Load Balancing in Structured P2P Systems Yingwu Zhu, Yiming Hu University of Cincinnati.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Tutorial 6 Memory Management
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Adaptive MPI Milind A. Bhandarkar
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Embedded Real-Time Systems
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Chapter 3: Windows7 Part 5.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Current Generation Hypervisor Type 1 Type 2.
Accelerating Large Charm++ Messages using RDMA
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Parallel Objects: Virtualization & In-Process Components
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
AMPI: Adaptive MPI Tutorial
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
AMPI: Adaptive MPI Celso Mendes & Chao Huang
Department of Computer Science University of California, Santa Barbara
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Lecture 3: Main Memory.
Parallelization of CPAIMD using Charm++
Faucets: Efficient Utilization of Multiple Clusters
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
High Performance Computing
Charisma: Orchestrating Migratable Parallel Objects
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Chao Huang Parallel Programming Laboratory University of Illinois
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Chapter 01: Introduction
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
Higher Level Languages on Adaptive Run-Time System
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Page Main Memory.
Presentation transcript:

Support for Adaptivity in ARMCI Using Migratable Objects Chao Huang, Chee Wai Lee, Laxmikant Kale Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Motivation Different programming paradigms fit different algorithms and applications Adaptive Run-Time System (ARTS) offers performance benefits Goal: to support ARMCI and global address space languages on ARTS 5/30/2019

Common RTS Motivations for common run-time system Support concurrent composibility Support common functions: load-balancing, checkpoint 5/30/2019

Outline Motivation Adaptive Run-Time System Adaptive ARMCI Implementation Preliminary Results Microbenchmarks Checkpoint/Restart Application Performance: LU Future Work 5/30/2019

ARTS with Migratable Objects Programming model User decomposes work to parallel objects (VPs) RTS maps VPs onto physical processors Typically, number of VPs >> P, to allow for various optimizations System Implementation User View 5/30/2019

Features and Benefits of ARTS Adaptive overlap Automatic load balancing Automatic checkpoint/restart Communication optimizations Software engineering benefits 5/30/2019

Adaptive Overlap Challenge: Gap between completion time and CPU overhead Solution: Overlap between communication and computation Natural fit for ARTS Completion time and CPU overhead of 2-way ping-pong communication on Apple G5 Cluster 5/30/2019

Automatic Load Balancing Challenge Dynamically varying applications Load imbalance impacts overall performance Solution Measurement-based load balancing Scientific applications are typically iteration-based The Principle of Persistence RTS collects CPU and network usage of VPs Load balancing by migrating threads (VPs) Threads can be packed and shipped as needed Different variations of load balancing strategies Eg. communication-aware, topology-based 5/30/2019

Features and Benefits of ARTS Adaptive overlap Automatic load balancing Automatic checkpoint/restart Communication optimizations Software engineering benefits 5/30/2019

Outline Motivation Adaptive Run-Time System Adaptive ARMCI Implementation Preliminary Results Microbenchmarks Checkpoint/Restart Application Performance: LU Future Work 5/30/2019

ARMCI Aggregate Remote Memory Copy Interface (ARMCI) Remote memory access (RMA) operations (one-sided communication) Contiguous and noncontiguous (strided, vector); blocking and non-blocking Supporting various global-address space models Global Array, Co-Array Fortran compiler, Adlib Built on top of MPI or PVM Now on Charm++ Well designed noncontiguous interface 5/30/2019

Virtualizing ARMCI Processes Each ARMCI virtual process is implemented by a light-weight, user-level thread embedded in a migratable object Real Processors Virtual Processes Virtual Processes 5/30/2019

Isomalloc Memory Isomalloc approach for migratable threads Same iso-address area in all nodes’ virtual address space Separate regions globally reserved for each VP Memory allocated locally Thread data moved, without pointer or address update ... ... i i j j ... ... m m n n Iso-address is between process stack and heap setup at startup ... ... P0 P1 5/30/2019

Performance of contiguous operation on IA64 Cluster Microbenchmarks Performance of contiguous operation on IA64 Cluster 5/30/2019

Performance of strided operation on IA64 Cluster Microbenchmarks Performance of strided operation on IA64 Cluster 5/30/2019

On-disk checkpoint time of LU, on 2 to 32 PEs on IA64 Cluster Checkpoint/restart automated at run-time level User inserts simple function calls Possible NFS bottleneck for on-disk scheme Alternative: in-memory scheme P Total Data (MB) Time (ms) Bandwidth (MB/s) 2 20.05 221 90.8 4 22.29 249 89.7 8 26.5 303 87.6 16 35.43 366 96.9 32 53.27 533 100 On-disk checkpoint time of LU, on 2 to 32 PEs on IA64 Cluster 5/30/2019

Application Performance Performance of LU application on IA64 Cluster 5/30/2019

Application Performance Performance of LU-Block application on IA64 Cluster 5/30/2019

Future Work Performance Optimization Performance Tuning Reduce overheads Performance Tuning Visualization and analysis tools Port other GAS languages GA and CAF compiler RDMA Projections collaboration 5/30/2019