A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign

Parallel Programming Laboratory Univ. of Illinois, U-C 2 Outline Motivation Background Design Protocols Results Summary Future Work

Parallel Programming Laboratory Univ. of Illinois, U-C 3 Motivation As machines grow in size  MTBF decreases  Applications have to tolerate faults Checkpoint/Rollback doesn’t scale  All nodes are rolled back just because 1 crashed  Even nodes independent of the crashed node are restarted Restart cost is similar to Checkpoint period

Parallel Programming Laboratory Univ. of Illinois, U-C 4 Requirements Fast and scalable Checkpoints Fast Restart  Only crashed processor to be restarted  Minimize effect on fault free processors  Restart cost less than checkpoint period Low fault free runtime overhead Transparent to the user

Parallel Programming Laboratory Univ. of Illinois, U-C 5 Background Checkpoint based methods  Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI  Uncoordinated – suffers from rollback propagation  Communication – [Briatico84], doesn’t scale well Log-based  Pessimistic – MPICH-V1 and V2, SBML [Johnson87]  Optimistic – [Strom85] unbounded rollback, complicated recovery  Causal Logging – [Elnozahy93] Manetho, complicated causality tracking and recovery

Parallel Programming Laboratory Univ. of Illinois, U-C 6 Design Message Logging  Sender side message logging Asynchronous checkpoints  Each processor has a buddy processor  Stores its checkpoint in the buddy’s memory Processor Virtualization  Speed up restart

Parallel Programming Laboratory Univ. of Illinois, U-C 7 System Model Processors are fail-stop All communication is through messages Piecewise deterministic assumption holds Machine has a fault detection system Network doesn’t guarantee delivery order No fully reliable nodes in the system Idea of processor virtualization is used

Parallel Programming Laboratory Univ. of Illinois, U-C 8 Processor Virtualization User View System implementation Charm++ Parallel C++ with Data driven objects - Chares Runtime maps objects to physical processors Asynchronous method invocation Adaptive MPI Implemented on Charm++ Multiple virtual processors on a physical processor

Parallel Programming Laboratory Univ. of Illinois, U-C 9 Benefits of Virtualization Latency Tolerant  Adaptive overlap of communication and computation Supports migration of virtual processors

Parallel Programming Laboratory Univ. of Illinois, U-C 10 Message Logging Protocol Correctness: Messages should be processed in the same order before and after the crash Problem: A B C A B C Before Crash After Crash

Parallel Programming Laboratory Univ. of Illinois, U-C 11 Message Logging.. Solution:  Fix an order the first time and always follow it  Receiver gives each message a ticket number  Process messages in order of ticket number Each message contains  Sender ID – who sent it  Receiver ID – to whom was it sent  Sequence Number (SN) – together with sender and receiver IDs, identifies a message  Ticket Number (TN) – decide order of processing

Parallel Programming Laboratory Univ. of Illinois, U-C 12 Message to Remote Chares Chare P sender Chare Q receiver If has been seen earlier TN is marked as received Otherwise create new TN and store the

Parallel Programming Laboratory Univ. of Illinois, U-C 13 Message to Local Chare Multiple Chares on 1 processor  If processor crashes all trace of local message is lost  After restart it should have the same TN  Store on buddy Ack Processor R Chare Q Chare P Buddy of Processor R

Parallel Programming Laboratory Univ. of Illinois, U-C 14 Checkpoint Protocol A processor asynchronously decides to checkpoint Packs up the state of all its chares and sends it to the buddy  Message logs are part of a chare’s state Message log on senders can be garbage collected Deciding when to checkpoint is an interesting problem

Parallel Programming Laboratory Univ. of Illinois, U-C 15 Reliability Only one scenario when our protocol fails  Processor X (buddy of Y) crashes and restarts  Checkpoint of Y is lost  Y now crashes before saving its checkpoint Result of not assuming reliable nodes for storing checkpoint Still increases reliability by orders of magnitude Probability can be minimized by having Y checkpoint after X crashes and restarts

Parallel Programming Laboratory Univ. of Illinois, U-C 16 Basic Restart Protocol After a crash, a Charm++ process is restarted on a new processor Gets checkpoint and local message log from buddy Chares are restored and other processors are informed of it Logged messages for chares on restarted processors are resent The highest TN, from a crashed chare, seen is also sent Messages are reprocessed by the restarted chares Local messages check first in the restored local message log

Parallel Programming Laboratory Univ. of Illinois, U-C 17 Parallel Restart Message Logging allows fault-free processors to continue with their execution However, sooner or later some processors start waiting for crashed processor Virtualization allows us to move work from the restarted processor to waiting processors Chares are restarted in parallel Restart cost can be reduced

Parallel Programming Laboratory Univ. of Illinois, U-C 18 Present Status Most of Charm++ has been ported  Support for migration has not yet been implemented in the fault tolerant protocol Simple AMPI programs work  Barriers to be done Parallel restart not yet implemented

Parallel Programming Laboratory Univ. of Illinois, U-C 19 Experimental Evaluation NAS benchmarks could not be used Used a 5-point stencil computation with a 1-D decomposition 8 quad 500 Mhz PIII cluster with 500 MB of RAM per node, connected by ethernet

Parallel Programming Laboratory Univ. of Illinois, U-C 20 Overhead Measurement of overhead for an application with low communication to computation ratio

Parallel Programming Laboratory Univ. of Illinois, U-C 21 Measurement of overhead for an application with high communication to computation ratio

Parallel Programming Laboratory Univ. of Illinois, U-C 22 Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s)

Parallel Programming Laboratory Univ. of Illinois, U-C 23 Summary Designed a fault tolerant protocol that  Performs fast checkpoints  Performs fast parallel restarts  Doesn’t depend on any completely reliable node  Supports multiple faults  Minimizes the effect of a crash on fault free processors Partial implementation of the protocol

Parallel Programming Laboratory Univ. of Illinois, U-C 24 Future Work Include support for migration in the protocol Parallel restart Extend to AMPI Test with NAS benchmark Study the tradeoffs involved in deciding the checkpoint period

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Similar presentations

Presentation on theme: "A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Similar presentations

Presentation on theme: "A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback