Download presentation
Presentation is loading. Please wait.
Published byJayson Esarey Modified over 10 years ago
1
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory
2
Working Assumption MPI Provides the hooks into the communications and process control system to allow others to implement Fault Tolerant algorithms This working-group will address Process Fault Tolerance (FT), not network FT (don’t need changes to the standard for this) MPI Provides the hooks into the communications and process control system to allow others to implement Fault Tolerant algorithms This working-group will address Process Fault Tolerance (FT), not network FT (don’t need changes to the standard for this)
3
Process Failure Scenario: Running Parallel Application, of process count N, looses one or more processes due to failure not related to the application Recovery scenarios: Communicator/s abort Application uses some sort of CPR to continue May want to quiet the communications system May want to log messages Application continues to run with M processes, where M <= N (application chooses if it can continue with M < N ) Application expects a dense Rank index Application expects a sparse Rank index Scenario: Running Parallel Application, of process count N, looses one or more processes due to failure not related to the application Recovery scenarios: Communicator/s abort Application uses some sort of CPR to continue May want to quiet the communications system May want to log messages Application continues to run with M processes, where M <= N (application chooses if it can continue with M < N ) Application expects a dense Rank index Application expects a sparse Rank index
4
MPI Implications Scenario: Communicator Abort (current state in MPI) Process Control: Terminate processes in the failed intra- communicator Communications: Discard traffic associated with the failed communicators Scenario: Communicator Abort (current state in MPI) Process Control: Terminate processes in the failed intra- communicator Communications: Discard traffic associated with the failed communicators
5
MPI Implications Scenario: Some sort of CPR method in use Process Control: Need to re-establish communications with the restarted processes Communications: May need to quiet the communications system to get into a state the CPR system can handle May need to replay messages May need to quiet communications until parallel application is fully restarted (after failure) Scenario: Some sort of CPR method in use Process Control: Need to re-establish communications with the restarted processes Communications: May need to quiet the communications system to get into a state the CPR system can handle May need to replay messages May need to quiet communications until parallel application is fully restarted (after failure)
6
MPI Implications Scenario: Application continues to run with M processes ( M <= N) Process Control: May need to re-index processes within affected communicators Communications: May need to (re-)establish communications May need to handle communications to non-existent processes May need to discard data All outstanding traffic Only traffic associated with failed processes Groups and Communicators May change during the life-cycle of these objects Collective Communications How are collective optimizations impacted ? What happens with outstanding collective operations ? Scenario: Application continues to run with M processes ( M <= N) Process Control: May need to re-index processes within affected communicators Communications: May need to (re-)establish communications May need to handle communications to non-existent processes May need to discard data All outstanding traffic Only traffic associated with failed processes Groups and Communicators May change during the life-cycle of these objects Collective Communications How are collective optimizations impacted ? What happens with outstanding collective operations ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.