Download presentation
Presentation is loading. Please wait.
Published byFlorence Charles Modified over 9 years ago
1
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi
2
Parallel Programming Laboratory University of Illinois, U-C 2 Outline Motivation Background Solutions Co-ordinated Checkpointing In-memory double checkpoint Sender based Message Logging Processor Evacuation in response to fault prediction : New Work
3
Parallel Programming Laboratory University of Illinois, U-C 3 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Modern Hardware is making fault prediction possible Temperature sensors, PAPI-4, SMART Paper on detection tomorrow
4
Parallel Programming Laboratory University of Illinois, U-C 4 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated recovery Causal Logging – [Elnozahy93] complicated causality tracking and recovery, Manetho, MPICH-V3
5
Parallel Programming Laboratory University of Illinois, U-C 5 Multiple Solutions in Charm++ Reactive : react to a fault Disk based Checkpoint/Restart In Memory Double Checkpointing/Restart Sender based Message Logging Proactive : react to a fault prediction Evacuate processors that are warned
6
Parallel Programming Laboratory University of Illinois, U-C 6 Checkpoint/Restart Mechanism Blocking Co-ordinated Checkpoint State of chares are checkpointed to disk Collective call MPI_Checkpoint(DIRNAME) The entire job is restarted Virtualization allows restarting on different # of Pes Runtime option >./charmrun pgm +p4 +vp16 +restart DIRNAME Simple but effective for common cases
7
Parallel Programming Laboratory University of Illinois, U-C 7 Drawbacks Disk based coordinated checkpointing is slow Job needs to be restarted Requires user intervention Impractical in the case of frequent faults
8
Parallel Programming Laboratory University of Illinois, U-C 8 In-memory Double Checkpoint In-memory checkpoint Faster than disk Co-ordinated checkpoint Simple User can decide what makes up useful state Double checkpointing Each object maintains 2 checkpoints on: Local physical processor Remote “buddy” processor
9
Parallel Programming Laboratory University of Illinois, U-C 9 Restart A “Dummy” process is created: Need not have application data or checkpoint Necessary for runtime Starts recovery on all other PEs Other processors: Remove all chares Restore checkpoints lost on the crashed PE Restore chares from local checkpoints Load balance after restart
10
Parallel Programming Laboratory University of Illinois, U-C 10 Overhead Evaluation Jacobi (200MB data size) on up to 128 processors 8 checkpoints in 100 steps
11
Parallel Programming Laboratory University of Illinois, U-C 11 Recovery Performance LeanMD application 10 crashes 128 processors Checkpoint every 10 time steps
12
Parallel Programming Laboratory University of Illinois, U-C 12 Drawbacks High Memory Overhead Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted Restart cost is similar to Checkpoint period Blocking co-ordinated checkpoint requires user intervention
13
Parallel Programming Laboratory University of Illinois, U-C 13 Sender based Message Logging Message Logging Store message logs on the sender Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Restart: processor from an extra pool Recreate only objects on crashed processor Playback logged messages Restores state to that after the last processed message Processor virtualization can speed it up
14
Parallel Programming Laboratory University of Illinois, U-C 14 Message Logging State of an object is determined by Messages processed Sequence of processed messages Protocol Sender logs message and requests receiver for TN Receiver sends back TN Sender stores TN with log and sends message Receiver processes messages in order of TN Processor virtualization complicates message logging Messages to object on the same processor needs to be logged remotely
15
Parallel Programming Laboratory University of Illinois, U-C 15 Parallel Restart Message Logging allows fault-free processors to continue with their execution However, sooner or later some processors start waiting for crashed processor Virtualization allows us to move work from the restarted processor to waiting processors Chares are restarted in parallel Restart cost can be reduced
16
Parallel Programming Laboratory University of Illinois, U-C 16 Present Status Most of Charm++ has been ported Support for migration has not yet been implemented in the fault tolerant protocol AMPI ported Parallel restart not yet implemented
17
Parallel Programming Laboratory University of Illinois, U-C 17 Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s)
18
Parallel Programming Laboratory University of Illinois, U-C 18 Pros and Cons Low overhead for jobs with low communication Currently high overhead for jobs with high communication Should be tested with high virtualization ratio to reduce the message logging overhead
19
Parallel Programming Laboratory University of Illinois, U-C 19 Processor evacuation Modern Hardware can be used to predict faults Runtime system response Low response time No new processors should be required Efficiency loss should be proportional to loss in computational power
20
Parallel Programming Laboratory University of Illinois, U-C 20 Solution Migrate Charm++ objects off processor Requires remapping of “home” PEs of objects Point to Point message delivery continues to work efficiently Collective operations cope with loss of processors Rewire reduction tree around a warned processor Can deal with multiple simultaneous warnings Load balance after an evacuation
21
Parallel Programming Laboratory University of Illinois, U-C 21 Rearrange the reduction tree Do not rewire tree while reduction is going on Stop reductions Rewire tree Continue reductions Affects only parent and children of a node Unbalances tree: Could be solved by recreating tree
22
Parallel Programming Laboratory University of Illinois, U-C 22 Response time Evacuation time for a Sweep3d execution on the 150^3 case Total ~500 MB of data Pessimistic estimate of evacuation time
23
Parallel Programming Laboratory University of Illinois, U-C 23 Performance after evacuation Iteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning
24
Parallel Programming Laboratory University of Illinois, U-C 24 Processor Utilization after evacuation Iteration time of Sweep3d on 32 processors for 150^3 problem with both processors on node 3( processors 4 and 5) being warned simultaneously
25
Parallel Programming Laboratory University of Illinois, U-C 25 Conclusions Available in Charm++ and AMPI Checkpoint/Restart In memory Checkpoint/Restart Proactive fault tolerance Under development Sender based message logging Deal with migration, deletion Parallel Restart Abstraction layers in Charm++/AMPI make it suitable for implementing fault tolerance protocols
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.