Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive
Christian Delbe2 Fault Tolerance A system is said to be fault tolerant if it can continue operating properly in the event of failure of some of its parts. New requirements for Grid Computing Large scale High failure rate Simultaneous failures Heterogeneous Software Portability Heterogeneous Hardware Different dependability characteristics in each group
Christian Delbe3 Fault Tolerance in Java Rollback-Recovery approach Each process periodically takes a checkpoint Based on the availability of a stable storage Checkpoints are used to recover application in a correct state But Java threads are not checkpointable ! Provide checkpointability with specific tools ? System level, Virtual Machine level, Compiler level Unfortunately … Loss of portability / efficiency Unique and non-standard implementation
Christian Delbe4 Fault Tolerance in ProActive New Communication-Induced-Checkpointing protocol (CIC) Pessimistic Message-Logging protocol (PML) Non-intrusive 100% standard Java, based on serialization Transparent for the programmer Fault tolerance settings in deployment descriptors Based on a Fault Tolerance Server Checkpoint storage Failures detection Resource service (deployed nodes or P2P infrastructure) Localization service
Christian Delbe5 CIC Protocol Overview Creation of a consistent global snapshot Non-blocking synchronization: low failure-free overhead p1 p4 p3 p2
Christian Delbe6 p1 p4 p3 p2 p4 CIC Protocol Overview Creation of a consistent global snapshot Non-blocking synchronization: low failure-free overhead After a failure, the entire system restarts Recovery time increases with system size
Christian Delbe7 PML Protocol Overview Independent checkpoints All messages must be logged Failure free overhead increases with message rate m1 p1 p4 p3 p2 m1
Christian Delbe8 p1 p4 p3 p2 m1 p4 Independent checkpoints All messages must be logged Failure free overhead increases with message rate After a failure, only the faulty restarts Recovery time is system size independent PML Protocol Overview
Christian Delbe9 Performance comparison CIC vs PML Jacobi iteration (SPMD iterative reduction of matrix) on matrix of size and System size increases Checkpoint size decreases Message rate increases
Christian Delbe10 Mixing CIC and PML Based on Recovery Groups Independent groups linked with PML After a failure, only the group have to restart Fault Tolerance Servers are independent Groups Dynamically created on common stable server CIC PML CIC PML
Christian Delbe11 Rollback on Grid requirements Large scale + Divide-and-Conquer approach High failure rate + Failure impact limited to the group + Can handle multiple failures Heterogeneous Software + Only Standard Java Heterogeneous Hardware + Can apply the most adapted settings in each group
Christian Delbe12 Performance Comparison CIC vs Mixed Jacobi iteration on and matrix Two groups mapped on two clusters of Grid5000
Christian Delbe13 Performance Comparison CIC vs Mixed Jacobi iteration on and matrix Two groups mapped on two clusters of Grid nodes
Christian Delbe14 Automatic and Transparent Fault Tolerance Easy to use Configured at deployment time Three protocols: Depends on hardware and application properties CIC PML Mixed Next release 3.2 will include Mixed protocol Fault Tolerance in ProActive - Failure Frequency + - Communication Rate +
Christian Delbe15 Performance of the Mixed protocol Jacobi iteration on a matrix Groups mapped on 4 to 6 clusters of Grid5000
Christian Delbe16 CIC Performance Evaluation Jacobi iteration (SPMD iterative reduction of matrix) CG NAS Parallel Benchmark (Conjugate Gradient)
Christian Delbe17 CIC Performance Evaluation Jacobi iteration (SPMD iterative reduction of matrix) CG NAS Parallel Benchmark (Conjugate Gradient)