ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application using the GASPI communication layer

2 Motivation  Nowadays, the increasing computational capacity is mainly due to extreme level of hardware parallelism.  With future machines, the Mean time to failure is expected to be in minutes and hours.  Absence of fault tolerant environment will put precious data at risk.  The lack of well-defined fault tolerant environment is the first big challenge in the development of fault tolerant application. Building a fault tolerant application using the GASPI communication layer

3 Automatic Fault Tolerance Application: Approaching the problem 1.Failure detection: i.Who detects the failure? ii.Failure information propagation iii.Consensus about failed processes 2.Processes and communicator recovery i.Shrink ii.Spawn iii.Spare 3.Lost data recovery … Building a fault tolerant application using the GASPI communication layer

4 Failure detection approaches: 1.Ping based all-to-all Within each iteration, the health of all procecsses is check. 2.Ping based neighbor-level: After neighbor failure detection -> check all-to-all health 3.Unsuccessful communication After failure detection -> check all-to-all health 4.Dedicated failure detection process(es) Pings all other processes Global view of processes healths Propagates the failure info to remaining processes Building a fault tolerant application using the GASPI communication layer

5 Automatic Fault Tolerance Application: Approaching the problem 1.Failure detection:  Fault-detector process 2.Processes and communicator recovery  Spare nodes 3.Lost data recovery  „Neighbor“ node level Checkpoint/Restart Building a fault tolerant application using the GASPI communication layer 4 5 6 7 8 9 Worker communicator 1 2 3 Spare nodes 0 Fault -detector process

6 Fault tolerance in GASPI: Introduction (I)  GASPI – Developed by Fraunhofer IWTM, Kaiserslautern, Germany  Based on PGAS programming model  Two memory parts Local: only local to the GASPI process (and its threads) Global: Available to other processes for reading and writing.  Enables fault tolerance: In case of single node failure, rest of the nodes stay up and running Provides TIMEOUT for every communication call.  Return values: GASPI_SUCCESS, GASPI_TIMEOUT, GASPI_ERROR Building a fault tolerant application using the GASPI communication layer

7 Fault tolerance in GASPI: Introduction (II)  What GASPI provides: Gaspi_proc_ping(): A process can check the state of a process by pinging any specific process. The return value of ping can either be 0 or 1 (Healthy or dead).  User side: Deletion of old comm., creation of new comm., new communication structure, (checkpoint/restart) -> user‘s responsibility Building a fault tolerant application using the GASPI communication layer

8 Failure detector (I): 0 1 2 3 4 5 6 7 8 9 Worker communicator Idle processes gaspi_proc_ping() return_val = gaspi_wait() return_val: 1) GASPI_SUCCESS 2) GASPI_TIMEOUT 3) GASPI_ERROR gaspi_proc_ping() return_val = gaspi_wait() Fault detector process Building a fault tolerant application using the GASPI communication layer

9 Failure detector (II): 0 3 4 5 6 7 8 9 Worker communicator Idle processes gaspi_proc_ping() return_val = gaspi_wait() GASPI_ERROR Failed Proc(s) IDs Rescue Proc(s) IDs 6, 71, 2 Failure detector process  Detector processes informs every process about failure details via gaspi_write(). 1 2 return_val: 1) GASPI_SUCCESS 2) GASPI_TIMEOUT 3) GASPI_ERROR Building a fault tolerant application using the GASPI communication layer

10 Automatic Fault Tolerance Application  Program flow: Building a fault tolerant application using the GASPI communication layer

11 Asynchronous in-memory checkpointing Building a fault tolerant application using the GASPI communication layer

12 Benchmarks (I): Test bed  Lanczos algorithm:  Checkpoint data structure:  After startup: Every process once stores matrix communication data structure.  Two recent Lanczos vectors are stored at each checkpoint iteration.  Recently calculated eigenvalues.  Test cluster:  LiMa – RRZE, Erlangen: 500 nodes, Xeon 5650 "Westmere" chips (12 cores + SMT), 2.66 GHz, 24 GB RAM, QDR Infiniband Building a fault tolerant application using the GASPI communication layer Checkpoint data:  v j, v j+1  metadata

13 Benchmark (II):  Average ping time per process ~ 5-6 µs Failure-Detector Process: Weak scaling of ping scan, failure detection and ack. time. Building a fault tolerant application using the GASPI communication layer

14 Benchmarks (III): 64s Failure detection + re-init + redo-work Computation Num. of nodes = 256, threads-per-process = 12 Failure detection + acknowledgement + Re-init = 11 sec. Building a fault tolerant application using the GASPI communication layer # iters. = 3500 Chpt. freq = 500

15 Remarks:  Worker processes remain undisturbed in failure-free application run.  Overhead only in case of worker failure(s).  Redo-Work after failure recovery  Checkpoint Frequency. Building a fault tolerant application using the GASPI communication layer

16 Outlook:  Related work:  FT communication: › MPICH-V › User-level Failure Mitigation - MPI (ULFM) › Fault tolerance Messaging Interface FMI  Node-level checkpoint/restart: › Fault Tolerance Interface (FTI) › Scalable Checkpoint/Restart (SCR)  Future work:  Having multiple failure detector processes.  Adding Redundancy for failure detector processes  Compartive study: ULFM, SCR  Building a fault tolerant application using the GASPI communication layer

17 Thank you! Questions? Building a fault tolerant application using the GASPI communication layer http://blogs.fau.de/essex/ https://bitbucket.org/essex/ghost

ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Similar presentations

Presentation on theme: "ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Similar presentations

Presentation on theme: "ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application."— Presentation transcript:

Similar presentations

About project

Feedback