Download presentation
Presentation is loading. Please wait.
Published byAnis Simmons Modified over 9 years ago
1
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign
2
Parallel Programming Laboratory Univ. of Illinois, U-C 2 Outline Motivation Background Design Protocols Results Summary Future Work
3
Parallel Programming Laboratory Univ. of Illinois, U-C 3 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted Restart cost is similar to Checkpoint period
4
Parallel Programming Laboratory Univ. of Illinois, U-C 4 Requirements Fast and scalable Checkpoints Fast Restart Only crashed processor to be restarted Minimize effect on fault free processors Restart cost less than checkpoint period Low fault free runtime overhead Transparent to the user
5
Parallel Programming Laboratory Univ. of Illinois, U-C 5 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated recovery Causal Logging – [Elnozahy93] Manetho, complicated causality tracking and recovery
6
Parallel Programming Laboratory Univ. of Illinois, U-C 6 Design Message Logging Sender side message logging Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Processor Virtualization Speed up restart
7
Parallel Programming Laboratory Univ. of Illinois, U-C 7 System Model Processors are fail-stop All communication is through messages Piecewise deterministic assumption holds Machine has a fault detection system Network doesn’t guarantee delivery order No fully reliable nodes in the system Idea of processor virtualization is used
8
Parallel Programming Laboratory Univ. of Illinois, U-C 8 Processor Virtualization User View System implementation Charm++ Parallel C++ with Data driven objects - Chares Runtime maps objects to physical processors Asynchronous method invocation Adaptive MPI Implemented on Charm++ Multiple virtual processors on a physical processor
9
Parallel Programming Laboratory Univ. of Illinois, U-C 9 Benefits of Virtualization Latency Tolerant Adaptive overlap of communication and computation Supports migration of virtual processors
10
Parallel Programming Laboratory Univ. of Illinois, U-C 10 Message Logging Protocol Correctness: Messages should be processed in the same order before and after the crash Problem: A B C A B C Before Crash After Crash
11
Parallel Programming Laboratory Univ. of Illinois, U-C 11 Message Logging.. Solution: Fix an order the first time and always follow it Receiver gives each message a ticket number Process messages in order of ticket number Each message contains Sender ID – who sent it Receiver ID – to whom was it sent Sequence Number (SN) – together with sender and receiver IDs, identifies a message Ticket Number (TN) – decide order of processing
12
Parallel Programming Laboratory Univ. of Illinois, U-C 12 Message to Remote Chares Chare P sender Chare Q receiver If has been seen earlier TN is marked as received Otherwise create new TN and store the
13
Parallel Programming Laboratory Univ. of Illinois, U-C 13 Message to Local Chare Multiple Chares on 1 processor If processor crashes all trace of local message is lost After restart it should have the same TN Store on buddy Ack Processor R Chare Q Chare P Buddy of Processor R
14
Parallel Programming Laboratory Univ. of Illinois, U-C 14 Checkpoint Protocol A processor asynchronously decides to checkpoint Packs up the state of all its chares and sends it to the buddy Message logs are part of a chare’s state Message log on senders can be garbage collected Deciding when to checkpoint is an interesting problem
15
Parallel Programming Laboratory Univ. of Illinois, U-C 15 Reliability Only one scenario when our protocol fails Processor X (buddy of Y) crashes and restarts Checkpoint of Y is lost Y now crashes before saving its checkpoint Result of not assuming reliable nodes for storing checkpoint Still increases reliability by orders of magnitude Probability can be minimized by having Y checkpoint after X crashes and restarts
16
Parallel Programming Laboratory Univ. of Illinois, U-C 16 Basic Restart Protocol After a crash, a Charm++ process is restarted on a new processor Gets checkpoint and local message log from buddy Chares are restored and other processors are informed of it Logged messages for chares on restarted processors are resent The highest TN, from a crashed chare, seen is also sent Messages are reprocessed by the restarted chares Local messages check first in the restored local message log
17
Parallel Programming Laboratory Univ. of Illinois, U-C 17 Parallel Restart Message Logging allows fault-free processors to continue with their execution However, sooner or later some processors start waiting for crashed processor Virtualization allows us to move work from the restarted processor to waiting processors Chares are restarted in parallel Restart cost can be reduced
18
Parallel Programming Laboratory Univ. of Illinois, U-C 18 Present Status Most of Charm++ has been ported Support for migration has not yet been implemented in the fault tolerant protocol Simple AMPI programs work Barriers to be done Parallel restart not yet implemented
19
Parallel Programming Laboratory Univ. of Illinois, U-C 19 Experimental Evaluation NAS benchmarks could not be used Used a 5-point stencil computation with a 1-D decomposition 8 quad 500 Mhz PIII cluster with 500 MB of RAM per node, connected by ethernet
20
Parallel Programming Laboratory Univ. of Illinois, U-C 20 Overhead Measurement of overhead for an application with low communication to computation ratio
21
Parallel Programming Laboratory Univ. of Illinois, U-C 21 Measurement of overhead for an application with high communication to computation ratio
22
Parallel Programming Laboratory Univ. of Illinois, U-C 22 Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s)
23
Parallel Programming Laboratory Univ. of Illinois, U-C 23 Summary Designed a fault tolerant protocol that Performs fast checkpoints Performs fast parallel restarts Doesn’t depend on any completely reliable node Supports multiple faults Minimizes the effect of a crash on fault free processors Partial implementation of the protocol
24
Parallel Programming Laboratory Univ. of Illinois, U-C 24 Future Work Include support for migration in the protocol Parallel restart Extend to AMPI Test with NAS benchmark Study the tradeoffs involved in deciding the checkpoint period
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.