ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application using the GASPI communication layer
2 Motivation Nowadays, the increasing computational capacity is mainly due to extreme level of hardware parallelism. With future machines, the Mean time to failure is expected to be in minutes and hours. Absence of fault tolerant environment will put precious data at risk. The lack of well-defined fault tolerant environment is the first big challenge in the development of fault tolerant application. Building a fault tolerant application using the GASPI communication layer
3 Automatic Fault Tolerance Application: Approaching the problem 1.Failure detection: i.Who detects the failure? ii.Failure information propagation iii.Consensus about failed processes 2.Processes and communicator recovery i.Shrink ii.Spawn iii.Spare 3.Lost data recovery … Building a fault tolerant application using the GASPI communication layer
4 Failure detection approaches: 1.Ping based all-to-all Within each iteration, the health of all procecsses is check. 2.Ping based neighbor-level: After neighbor failure detection -> check all-to-all health 3.Unsuccessful communication After failure detection -> check all-to-all health 4.Dedicated failure detection process(es) Pings all other processes Global view of processes healths Propagates the failure info to remaining processes Building a fault tolerant application using the GASPI communication layer
5 Automatic Fault Tolerance Application: Approaching the problem 1.Failure detection: Fault-detector process 2.Processes and communicator recovery Spare nodes 3.Lost data recovery „Neighbor“ node level Checkpoint/Restart Building a fault tolerant application using the GASPI communication layer Worker communicator Spare nodes 0 Fault -detector process
6 Fault tolerance in GASPI: Introduction (I) GASPI – Developed by Fraunhofer IWTM, Kaiserslautern, Germany Based on PGAS programming model Two memory parts Local: only local to the GASPI process (and its threads) Global: Available to other processes for reading and writing. Enables fault tolerance: In case of single node failure, rest of the nodes stay up and running Provides TIMEOUT for every communication call. Return values: GASPI_SUCCESS, GASPI_TIMEOUT, GASPI_ERROR Building a fault tolerant application using the GASPI communication layer
7 Fault tolerance in GASPI: Introduction (II) What GASPI provides: Gaspi_proc_ping(): A process can check the state of a process by pinging any specific process. The return value of ping can either be 0 or 1 (Healthy or dead). User side: Deletion of old comm., creation of new comm., new communication structure, (checkpoint/restart) -> user‘s responsibility Building a fault tolerant application using the GASPI communication layer
8 Failure detector (I): Worker communicator Idle processes gaspi_proc_ping() return_val = gaspi_wait() return_val: 1) GASPI_SUCCESS 2) GASPI_TIMEOUT 3) GASPI_ERROR gaspi_proc_ping() return_val = gaspi_wait() Fault detector process Building a fault tolerant application using the GASPI communication layer
9 Failure detector (II): Worker communicator Idle processes gaspi_proc_ping() return_val = gaspi_wait() GASPI_ERROR Failed Proc(s) IDs Rescue Proc(s) IDs 6, 71, 2 Failure detector process Detector processes informs every process about failure details via gaspi_write(). 1 2 return_val: 1) GASPI_SUCCESS 2) GASPI_TIMEOUT 3) GASPI_ERROR Building a fault tolerant application using the GASPI communication layer
10 Automatic Fault Tolerance Application Program flow: Building a fault tolerant application using the GASPI communication layer
11 Asynchronous in-memory checkpointing Building a fault tolerant application using the GASPI communication layer
12 Benchmarks (I): Test bed Lanczos algorithm: Checkpoint data structure: After startup: Every process once stores matrix communication data structure. Two recent Lanczos vectors are stored at each checkpoint iteration. Recently calculated eigenvalues. Test cluster: LiMa – RRZE, Erlangen: 500 nodes, Xeon 5650 "Westmere" chips (12 cores + SMT), 2.66 GHz, 24 GB RAM, QDR Infiniband Building a fault tolerant application using the GASPI communication layer Checkpoint data: v j, v j+1 metadata
13 Benchmark (II): Average ping time per process ~ 5-6 µs Failure-Detector Process: Weak scaling of ping scan, failure detection and ack. time. Building a fault tolerant application using the GASPI communication layer
14 Benchmarks (III): 64s Failure detection + re-init + redo-work Computation Num. of nodes = 256, threads-per-process = 12 Failure detection + acknowledgement + Re-init = 11 sec. Building a fault tolerant application using the GASPI communication layer # iters. = 3500 Chpt. freq = 500
15 Remarks: Worker processes remain undisturbed in failure-free application run. Overhead only in case of worker failure(s). Redo-Work after failure recovery Checkpoint Frequency. Building a fault tolerant application using the GASPI communication layer
16 Outlook: Related work: FT communication: › MPICH-V › User-level Failure Mitigation - MPI (ULFM) › Fault tolerance Messaging Interface FMI Node-level checkpoint/restart: › Fault Tolerance Interface (FTI) › Scalable Checkpoint/Restart (SCR) Future work: Having multiple failure detector processes. Adding Redundancy for failure detector processes Compartive study: ULFM, SCR Building a fault tolerant application using the GASPI communication layer
17 Thank you! Questions? Building a fault tolerant application using the GASPI communication layer