Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Thoughts on Shared Caches Jeff Odom University of Maryland.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Adaptive MPI Milind A. Bhandarkar
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Advanced / Other Programming Models Sathish Vadhiyar.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
Faucets Queuing System Presented by, Sameer Kumar.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Operating Systems: Internals and Design Principles
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
CSAR Master Presentation Presenter Name 20 May 2003 ©2003 Board of Trustees of the University of Illinois ©
CSAR Overview Laxmikant (Sanjay) Kale 11 September 2001 © ©2001 Board of Trustees of the University of Illinois.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Component Frameworks:
Scalable Molecular Dynamics for Large Biomolecular Systems
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Runtime Optimizations via Processor Virtualization
Faucets: Efficient Utilization of Multiple Clusters
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois ©

2 ©2000 Board of Trustees of the University of Illinois Adaptive Strategies Dynamic behavior of rocket simulation components: Burning away of solid fuel Adaptive refinement Possible external interference:  on timeshared clusters Requires adaptive load balancing strategies Idea: Automatically adapt to variations  Multi-domain decomposition Minimal influence on Component Applications  Should write as if stand-alone code  Minimal recoding/effort to integrate

3 ©2000 Board of Trustees of the University of Illinois Multi-domain decomposition Divide the computation into a large number of small pieces Typically, much larger than the number of processors Let the system map pieces (chunks) to processors

4 ©2000 Board of Trustees of the University of Illinois Multi-domain decomposition: Advantages: Separation of concerns:  System automates what can be best done by the system System can remap chunks to deal with  load imbalances,  external interference Preserves and encourages modularity:  multiple codes can interleave..

5 ©2000 Board of Trustees of the University of Illinois Multi-domain decomposition RTSRTS 1 PE0 PE1 PE2 PE3

6 ©2000 Board of Trustees of the University of Illinois Need Data Driven Execution Support There are multiple chunks on each processor How should execution interleave between them? Who gets the chance to run next? Charm++ provides this support: Parallel C++ with Data Driven Objects Object Groups:  global object with a “representative” on each PE Object Arrays and Object Collections Asynchronous method invocation Prioritized scheduling Mature, robust, portable

7 ©2000 Board of Trustees of the University of Illinois Data driven execution Scheduler Message Q

8 ©2000 Board of Trustees of the University of Illinois Object based load balancing Application induced imbalances in CSE Apps: Abrupt, but infrequent, or Slow, cumulative rarely: frequent, large changes Principle of persistence Extension of principle of locality Behavior of objects, including computational load and communication patterns, tends to persist over time Measurement based load balancing: Periodic, or triggered Automatic Already implemented a framework With strategies that exploit this principle automatically

9 ©2000 Board of Trustees of the University of Illinois Object based load balancing Automatic instrumentation via Charm RTS Object times Object communication graph:  Number and total size of messages between every pair of objects Suite of remapping strategies Centralized: for periodic case  Several strategies Distributed strategies  For frequent Load balancing  For workstation clusters In fruitful use for several applications CSAR and outside

10 ©2000 Board of Trustees of the University of Illinois Example of Recent Speedup Results: Molecular Dynamics on ASCI Red

11 ©2000 Board of Trustees of the University of Illinois Adaptive objects: How can they be used in more traditional (MPI) programs? AMPI: Adaptive array based MPI Being used bor Rocekt Simulation component applications What other consequences and benefits follow? Automatic Checkpointing scheme Timeshared parallel machines and clusters

12 ©2000 Board of Trustees of the University of Illinois Using Adaptive Objects Framework The framework requires Charm++, while Rocket simulation components use MPI No problem! We have implemented AMPI on top of charm++ User level threads embeded in objects  Multiple threads per processor Thread migration! Multiple threads per processor:  No global variables  MPI programs must be changed a bit to encapsulate globals  Automatic conversion is possible: future work with Hoeflinger/Padua

13 ©2000 Board of Trustees of the University of Illinois AMPI Array-based Adaptive MPI Each MPI “process” runs in a user-level thread  embeded in Charm++ object All MPI calls (will be) supported Threads are migratable! Migrate under the control of load balancer/RTS

14 ©2000 Board of Trustees of the University of Illinois Making threads migrate Thread migration is difficult: Pointers within stack Need to minimize user intervention Solution strategy I: Multiple threads (stack) on each PE Use only one stack Copy entire stack in and out on each context switch! Advantages and disadvantages: Pointers are safe: migrated threads run in the stack at the same virtual address Higher overhead at context switch  Even when no migration  How much is this overhead?

15 ©2000 Board of Trustees of the University of Illinois Context switching overhead

16 ©2000 Board of Trustees of the University of Illinois Low overhead migratable threads Challenge: if stack size is large, overhead large even w/o migration.. Possible remedy: reduce stack size iso_malloc idea Due to group at ENS Lyon, France Allocate virtual space for each thread on all processors, when created Now: Context switching is faster Migration is still possible, as before Scalability optimizations: map and unmap, to avoid virtual memory clutter

17 ©2000 Board of Trustees of the University of Illinois Application Performance: ROCFLO

18 ©2000 Board of Trustees of the University of Illinois ROCSOLID: scaled problem-size

19 ©2000 Board of Trustees of the University of Illinois Other applications of adaptive objects Adaptive Objects is a versatile idea Cache Performance optimization Out of core codes:  Automatic prefetch before execution of each object method and parameters Flexible Checkpointing Timeshared Parallel Machines  Effective utilization of resources

20 ©2000 Board of Trustees of the University of Illinois Checkpointing When running on large configuration: E.g processors of a 4096 PE system If one PE is down, can’t restart from last checkpoint! In some applications: number of PEs must be a power of two (square,..) A solution based on objects Checkpoint objects Note: number of objects need to be a power of two, not PEs Restart on a different number processors!  Let adaptive balancer migrate objects to restore efficiency Requires several innovations, but potential is open with objects  We have a preliminary system implemented for this  Can restart on fewer or more processors

21 ©2000 Board of Trustees of the University of Illinois Time shared parallel machines Need the ability to shrink and expand a job to available number of processors Again, we can do that with objects Completely vacating processors when needed Use fraction of power on a desktop workstation Already available for Charm++ and AMPI Need quality of service contracts Min/Max Pes Deadlines Performance profile Selection of machine: negotiation “Faucets” project in progress  Will link up with Globus

22 ©2000 Board of Trustees of the University of Illinois Contributors AMPI Library Milind Bhandarkar Timeshared cluster Sameer Kumar Checkpointing Sameer Paranjpye Charm++ support: Robert Brunner Orion Lawlor ROCFLO/ROCSOLID development and conversion to Charm Prasad Alavilli Eric de Sturler Jay Hoeflinger Jim Jiao Fady Najar Ali Namazifard David Padua Dennis Parsons Zhe Zhang Jie Zheng

23 ©2000 Board of Trustees of the University of Illinois Laxmikant (Sanjay) V. Kale Dept. of Computer Science and Center for Simulation of Advanced Rockets University of Illinois at Urbana-Champaign 3304 Digital Computer Laboratory 1304 West Springfield Avenue Urbana, IL USA telephone: fax: