Download presentation
Presentation is loading. Please wait.
Published byHolly Day Modified over 9 years ago
1
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois ©
2
2 ©2000 Board of Trustees of the University of Illinois Adaptive Strategies Dynamic behavior of rocket simulation components: Burning away of solid fuel Adaptive refinement Possible external interference: on timeshared clusters Requires adaptive load balancing strategies Idea: Automatically adapt to variations Multi-domain decomposition Minimal influence on Component Applications Should write as if stand-alone code Minimal recoding/effort to integrate
3
3 ©2000 Board of Trustees of the University of Illinois Multi-domain decomposition Divide the computation into a large number of small pieces Typically, much larger than the number of processors Let the system map pieces (chunks) to processors
4
4 ©2000 Board of Trustees of the University of Illinois Multi-domain decomposition: Advantages: Separation of concerns: System automates what can be best done by the system System can remap chunks to deal with load imbalances, external interference Preserves and encourages modularity: multiple codes can interleave..
5
5 ©2000 Board of Trustees of the University of Illinois Multi-domain decomposition 1 12 5 9 10 2 11 34 7 13 6 8 5 8 10 4 11 12 9 23 9 6 7 13 RTSRTS 1 PE0 PE1 PE2 PE3
6
6 ©2000 Board of Trustees of the University of Illinois Need Data Driven Execution Support There are multiple chunks on each processor How should execution interleave between them? Who gets the chance to run next? Charm++ provides this support: Parallel C++ with Data Driven Objects Object Groups: global object with a “representative” on each PE Object Arrays and Object Collections Asynchronous method invocation Prioritized scheduling Mature, robust, portable http://charm.cs.uiuc.edu
7
7 ©2000 Board of Trustees of the University of Illinois Data driven execution Scheduler Message Q
8
8 ©2000 Board of Trustees of the University of Illinois Object based load balancing Application induced imbalances in CSE Apps: Abrupt, but infrequent, or Slow, cumulative rarely: frequent, large changes Principle of persistence Extension of principle of locality Behavior of objects, including computational load and communication patterns, tends to persist over time Measurement based load balancing: Periodic, or triggered Automatic Already implemented a framework With strategies that exploit this principle automatically
9
9 ©2000 Board of Trustees of the University of Illinois Object based load balancing Automatic instrumentation via Charm RTS Object times Object communication graph: Number and total size of messages between every pair of objects Suite of remapping strategies Centralized: for periodic case Several strategies Distributed strategies For frequent Load balancing For workstation clusters In fruitful use for several applications CSAR and outside
10
10 ©2000 Board of Trustees of the University of Illinois Example of Recent Speedup Results: Molecular Dynamics on ASCI Red
11
11 ©2000 Board of Trustees of the University of Illinois Adaptive objects: How can they be used in more traditional (MPI) programs? AMPI: Adaptive array based MPI Being used bor Rocekt Simulation component applications What other consequences and benefits follow? Automatic Checkpointing scheme Timeshared parallel machines and clusters
12
12 ©2000 Board of Trustees of the University of Illinois Using Adaptive Objects Framework The framework requires Charm++, while Rocket simulation components use MPI No problem! We have implemented AMPI on top of charm++ User level threads embeded in objects Multiple threads per processor Thread migration! Multiple threads per processor: No global variables MPI programs must be changed a bit to encapsulate globals Automatic conversion is possible: future work with Hoeflinger/Padua
13
13 ©2000 Board of Trustees of the University of Illinois AMPI Array-based Adaptive MPI Each MPI “process” runs in a user-level thread embeded in Charm++ object All MPI calls (will be) supported Threads are migratable! Migrate under the control of load balancer/RTS
14
14 ©2000 Board of Trustees of the University of Illinois Making threads migrate Thread migration is difficult: Pointers within stack Need to minimize user intervention Solution strategy I: Multiple threads (stack) on each PE Use only one stack Copy entire stack in and out on each context switch! Advantages and disadvantages: Pointers are safe: migrated threads run in the stack at the same virtual address Higher overhead at context switch Even when no migration How much is this overhead?
15
15 ©2000 Board of Trustees of the University of Illinois Context switching overhead
16
16 ©2000 Board of Trustees of the University of Illinois Low overhead migratable threads Challenge: if stack size is large, overhead large even w/o migration.. Possible remedy: reduce stack size iso_malloc idea Due to group at ENS Lyon, France Allocate virtual space for each thread on all processors, when created Now: Context switching is faster Migration is still possible, as before Scalability optimizations: map and unmap, to avoid virtual memory clutter
17
17 ©2000 Board of Trustees of the University of Illinois Application Performance: ROCFLO
18
18 ©2000 Board of Trustees of the University of Illinois ROCSOLID: scaled problem-size
19
19 ©2000 Board of Trustees of the University of Illinois Other applications of adaptive objects Adaptive Objects is a versatile idea Cache Performance optimization Out of core codes: Automatic prefetch before execution of each object method and parameters Flexible Checkpointing Timeshared Parallel Machines Effective utilization of resources
20
20 ©2000 Board of Trustees of the University of Illinois Checkpointing When running on large configuration: E.g. 4096 processors of a 4096 PE system If one PE is down, can’t restart from last checkpoint! In some applications: number of PEs must be a power of two (square,..) A solution based on objects Checkpoint objects Note: number of objects need to be a power of two, not PEs Restart on a different number processors! Let adaptive balancer migrate objects to restore efficiency Requires several innovations, but potential is open with objects We have a preliminary system implemented for this Can restart on fewer or more processors
21
21 ©2000 Board of Trustees of the University of Illinois Time shared parallel machines Need the ability to shrink and expand a job to available number of processors Again, we can do that with objects Completely vacating processors when needed Use fraction of power on a desktop workstation Already available for Charm++ and AMPI Need quality of service contracts Min/Max Pes Deadlines Performance profile Selection of machine: negotiation “Faucets” project in progress Will link up with Globus
22
22 ©2000 Board of Trustees of the University of Illinois Contributors AMPI Library Milind Bhandarkar Timeshared cluster Sameer Kumar Checkpointing Sameer Paranjpye Charm++ support: Robert Brunner Orion Lawlor ROCFLO/ROCSOLID development and conversion to Charm Prasad Alavilli Eric de Sturler Jay Hoeflinger Jim Jiao Fady Najar Ali Namazifard David Padua Dennis Parsons Zhe Zhang Jie Zheng
23
23 ©2000 Board of Trustees of the University of Illinois Laxmikant (Sanjay) V. Kale Dept. of Computer Science and Center for Simulation of Advanced Rockets University of Illinois at Urbana-Champaign 3304 Digital Computer Laboratory 1304 West Springfield Avenue Urbana, IL 61801 USA kale@cs.uiuc.edu http://charm.cs.uiuc.edu/~kale telephone: 217-244-0094 fax: 217-333-1910
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.