Download presentation
Presentation is loading. Please wait.
1
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign
2
http://charm.cs.uiuc.eduAdaptive MPI2 Motivation Challenges New generation parallel applications are: Dynamically varying: load shifting, adaptive refinement Typical MPI implementations are: Not naturally suitable for dynamic applications Set of available processors: May not match those required by the algorithm Adaptive MPI Virtual MPI Processor (VP) Solutions and optimizations
3
http://charm.cs.uiuc.eduAdaptive MPI3 Outline Motivation Implementations Features and Experiments Current Status Future Work
4
http://charm.cs.uiuc.eduAdaptive MPI4 Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VP’s) RTS maps VP’s onto physical processors Typically, # virtual processors > # processors User View System implementation
5
http://charm.cs.uiuc.eduAdaptive MPI5 AMPI: MPI with Virtualization Each virtual process implemented as a user-level thread embedded in a Charm++ object MPI processes Real Processors Implemented as virtual processes (user-level migratable threads) MPI “processes”
6
http://charm.cs.uiuc.eduAdaptive MPI6 Adaptive Overlap Problem setup: 3D stencil calculation of size 240 3 run on Lemieux. Run with virtualization ratio 1 and 8. p = 8 vp = 8 p = 8 vp = 64
7
http://charm.cs.uiuc.eduAdaptive MPI7 Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size 240 3 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.
8
http://charm.cs.uiuc.eduAdaptive MPI8 Problem setup: 3D stencil calculation of size 240 3 run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs cube #. Comparison with Native MPI Performance Slightly worse w/o optimization Being improved Flexibility Big runs on a few processors Fits the nature of algorithms
9
http://charm.cs.uiuc.eduAdaptive MPI9 Automatic Load Balancing Problems Dynamically varying applications Load imbalance impacts overall performance Difficult to move jobs between processors Implementation Load balancing by migrating objects (VP’s) RTS collects CPU and network usage of VP’s New mapping based on collected information Threads are packed up and shipped as needed Different variations of strategies available
10
http://charm.cs.uiuc.eduAdaptive MPI10 Load Balancing Example AMR application Refinement happens at step 25 Load balancer is activated at time steps 20, 40, 60, and 80.
11
http://charm.cs.uiuc.eduAdaptive MPI11 Collective Operations Problem with collective operations Complex: involving many processors Time consuming: designed as blocking calls in MPI Time breakdown of 2D FFT benchmark [ms]
12
http://charm.cs.uiuc.eduAdaptive MPI12 Collective Communication Optimization Time breakdown of an all-to-all operation using Mesh library Computation is only a small proportion of the elapsed time A number of optimization techniques are developed to improve collective communication performance
13
http://charm.cs.uiuc.eduAdaptive MPI13 Asynchronous Collectives Time breakdown of 2D FFT benchmark [ms] VP’s implemented as threads Overlapping computation with waiting time of collective operations Total completion time reduced
14
http://charm.cs.uiuc.eduAdaptive MPI14 Shrink/Expand Problem: Availability of computing platform may change Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600
15
http://charm.cs.uiuc.eduAdaptive MPI15 Summary of Advantages Run MPI programs on any # of PE’s Overlap computation with communication Automatic load balancing Asynchronous collective operations Shrink/expand Automatic checkpoint/restart mechanism Cross communicators Interoperability
16
http://charm.cs.uiuc.eduAdaptive MPI16 Application Example: GEN2 MPI/AMPI Rocpanda Rocblas Rocface Rocman Roccom Rocflo-MP Rocflu-MP Rocsolid Rocfrac Rocburn2D ZNZN APNAPN PYPY Truegrid Tetmesh Metis Gridgen Makeflo
17
http://charm.cs.uiuc.eduAdaptive MPI17 Future Work Performance prediction via direct simulation Performance tuning w/o continuous access to large machine Support for visualization Facilitating debugging and performance tuning Support for MPI-2 standard ROMIO as parallel I/O library One-sided communications
18
http://charm.cs.uiuc.eduAdaptive MPI18 Thank You! Free download of AMPI is available at: http://charm.cs.uiuc.edu/ http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.