Download presentation
Presentation is loading. Please wait.
Published byLeon Malone Modified over 9 years ago
1
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe
2
2 CASC Umpire l Writing correct MPI programs is hard l Unsafe or erroneous MPI programs —Deadlock —Resource errors l Umpire —Automatically detect MPI programming errors —Dynamic software testing —Shared memory implementation
3
3 CASC MPI Runtime System MPI Application Umpire Manager Task 0Task 1Task 2 Task N-1 Interposition using MPI profiling layer Transactions via Shared Memory Task 0Task 1Task 2 Task N-1 Task 0 Task 1 Task 2 Task N-1... Umpire Architecture Verification Algorithms
4
4 CASC Collection system l Calling task —Use MPI profiling layer —Perform local checks —Communicate with manager if necessary –Call parameters –Return program counter (PC) –Call specific information (e.g. Buffer checksum) l Manager —Allocate Unix shared memory —Receive transactions from calling tasks
5
5 CASC Manager l Detects global programming errors l Unix shared memory communication l History queues —One per MPI task —Chronological lists of MPI operations l Resource registry —Communicators —Derived datatypes —Required for message matching l Perform verification algorithms
6
6 CASC Configuration Dependent Deadlock l Unsafe MPI programming practice l Code result depends on: —MPI implementation limitations —User input parameters l Classic example code: Task 0Task 1MPI_SendMPI_Recv
7
7 CASC Mismatched Collective Operations l Erroneous MPI programming practice l Simple example code: Tasks 0, 1, & 2Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast l Possible code results: —Deadlock —Correct message matching —Incorrect message matching —Mysterious error messages
8
8 CASC Deadlock detection l MPI history queues —One per task in Manager —Track MPI messaging operations –Items added through transactions –Remove when safely matched l Automatically detect deadlocks —MPI operations only —Wait-for graph —Recursive algorithm —Invoke when queue head changes l Also support timeouts
9
9 CASC Deadlock Detection Example Bcast Barrier Bcast Barrier Bcast Barrier Task 0Task 1Task 2Task 3 Task 1:MPI_BcastTask 0:MPI_BcastTask 0:MPI_BarrierTask 2:MPI_BcastTask 3:MPI_BarrierERROR! Report it!Task 2:MPI_BarrierTask 1:MPI_Barrier
10
10 CASC Resource Tracking Errors l Many MPI features require resource allocations —Communicators, datatypes and requests —Detect “leaks” automatically l Simple “lost request” example: MPI_Irecv (..., &req); MPI_Wait (&req,…) l Complicated by assignment l Also detect errant writes to send buffers
11
11 CASC Conclusion l First automated MPI debugging tool —Detect deadlocks —Eliminates resource leaks —Assure correct non-blocking sends l Performance —Low overhead (21% for sPPM) —Located deadlock in code set-up l Limitations —MPI_Waitany and MPI_Cancel —Shared memory implementation —Prototype only
12
12 CASC Future Work l Further prototype testing l Improve user interface l Handle all MPI calls l Tool distribution —LLNL application group testing —Exploring mechanisms for wider availability l Detection of other errors —Datatype matching —Others? l Distributed memory implementation
13
13 CASC Work performed under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48 UCRL-VG-139184
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.