Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November 29 2006 Automatic Fault Tolerance in ProActive.

Similar presentations


Presentation on theme: "Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November 29 2006 Automatic Fault Tolerance in ProActive."— Presentation transcript:

1 Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November 29 2006 Automatic Fault Tolerance in ProActive

2 Christian Delbe2 Fault Tolerance A system is said to be fault tolerant if it can continue operating properly in the event of failure of some of its parts. New requirements for Grid Computing Large scale High failure rate Simultaneous failures Heterogeneous Software Portability Heterogeneous Hardware Different dependability characteristics in each group

3 Christian Delbe3 Fault Tolerance in Java Rollback-Recovery approach Each process periodically takes a checkpoint Based on the availability of a stable storage Checkpoints are used to recover application in a correct state But Java threads are not checkpointable ! Provide checkpointability with specific tools ? System level, Virtual Machine level, Compiler level Unfortunately … Loss of portability / efficiency Unique and non-standard implementation

4 Christian Delbe4 Fault Tolerance in ProActive New Communication-Induced-Checkpointing protocol (CIC) Pessimistic Message-Logging protocol (PML) Non-intrusive 100% standard Java, based on serialization Transparent for the programmer Fault tolerance settings in deployment descriptors Based on a Fault Tolerance Server Checkpoint storage Failures detection Resource service (deployed nodes or P2P infrastructure) Localization service

5 Christian Delbe5 CIC Protocol Overview Creation of a consistent global snapshot Non-blocking synchronization: low failure-free overhead p1 p4 p3 p2

6 Christian Delbe6 p1 p4 p3 p2 p4 CIC Protocol Overview Creation of a consistent global snapshot Non-blocking synchronization: low failure-free overhead After a failure, the entire system restarts Recovery time increases with system size

7 Christian Delbe7 PML Protocol Overview Independent checkpoints All messages must be logged Failure free overhead increases with message rate m1 p1 p4 p3 p2 m1

8 Christian Delbe8 p1 p4 p3 p2 m1 p4 Independent checkpoints All messages must be logged Failure free overhead increases with message rate After a failure, only the faulty restarts Recovery time is system size independent PML Protocol Overview

9 Christian Delbe9 Performance comparison CIC vs PML Jacobi iteration (SPMD iterative reduction of matrix) on matrix of size 3000 2 and 5000 2 System size increases  Checkpoint size decreases  Message rate increases

10 Christian Delbe10 Mixing CIC and PML Based on Recovery Groups Independent groups linked with PML After a failure, only the group have to restart Fault Tolerance Servers are independent Groups Dynamically created on common stable server CIC PML CIC PML

11 Christian Delbe11 Rollback on Grid requirements Large scale + Divide-and-Conquer approach High failure rate + Failure impact limited to the group + Can handle multiple failures Heterogeneous Software + Only Standard Java Heterogeneous Hardware + Can apply the most adapted settings in each group

12 Christian Delbe12 Performance Comparison CIC vs Mixed Jacobi iteration on 7000 2 and 9000 2 matrix Two groups mapped on two clusters of Grid5000

13 Christian Delbe13 Performance Comparison CIC vs Mixed Jacobi iteration on 7000 2 and 9000 2 matrix Two groups mapped on two clusters of Grid5000 1 1 nodes

14 Christian Delbe14 Automatic and Transparent Fault Tolerance Easy to use Configured at deployment time Three protocols: Depends on hardware and application properties CIC PML Mixed Next release 3.2 will include Mixed protocol Fault Tolerance in ProActive - Failure Frequency + - Communication Rate +

15 Christian Delbe15 Performance of the Mixed protocol Jacobi iteration on a 16000 matrix Groups mapped on 4 to 6 clusters of Grid5000

16 Christian Delbe16 CIC Performance Evaluation Jacobi iteration (SPMD iterative reduction of matrix) CG NAS Parallel Benchmark (Conjugate Gradient)

17 Christian Delbe17 CIC Performance Evaluation Jacobi iteration (SPMD iterative reduction of matrix) CG NAS Parallel Benchmark (Conjugate Gradient)


Download ppt "Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November 29 2006 Automatic Fault Tolerance in ProActive."

Similar presentations


Ads by Google