Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN, USA
2 Scott_RAS_0614 Research and development goals Develop techniques to enable HPC systems to run computational jobs 24x7 Develop proof-of-concept prototypes and production-type RAS solutions Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems Eliminate many of the numerous single points of failure and control in today’s HPC systems
3 Scott_RAS_0614 MOLAR: Adaptive runtime support for high-end computing operating and runtime systems Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultra-scale high-end computers Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) MOLAR is a collaborative research effort (
4 Scott_RAS_0614 Active/standby with shared storage Single active head node Backup to shared storage Simple checkpoint/restart Fail-over to standby node Possible corruption of backup state when failing during backup Introduction of a new single point of failure No guarantee of correctness and availability Simple Linux Utility for Resource Management, metadata servers of Parallel Virtual File System and Lustre Active/Standby Head Nodes with Shared Storage
5 Scott_RAS_0614 Active/standby redundancy Single active head node Backup to standby node Simple checkpoint/restart Fail-over to standby node Idle standby head node Rollback to backup Service interruption for fail-over and restore-over HA-OSCAR, Torque on Cray XT Active/Standby Head Nodes
6 Scott_RAS_0614 Asymmetric active/active redundancy Many active head nodes Work load distribution Optional fail-over to standby head node(s) (n+1 or n+m) No coordination between active head nodes Service interruption for fail-over and restore-over Loss of state without standby Limited use cases, such as high-throughput computing Prototype based on HA-OSCAR Asymmetric Active/Active Head Nodes
7 Scott_RAS_0614 Symmetric active/active redundancy Many active head nodes Work load distribution Symmetric replication between head nodes Continuous service Always up to date No fail-over necessary No restore-over necessary Virtual synchrony model Complex algorithms JOSHUA prototype for Torque Active/Active Head Nodes
8 Scott_RAS_0614 Input Replication Virtually Synchronous Processing Output Unification Symmetric active/active replication
9 Scott_RAS_0614 Symmetric active/active high availability for head and service nodes A component = MTTF / (MTTF + MTTR) A system = 1 - (1 - A component ) n T down = 8760 hours * (1 – A) Single node MTTF: 5000 hours Single node MTTR: 72 hours NodesAvailabilityEst. annual downtime %5d4h21m %1h45m % 1m30s % 1s Single-site redundancy for 7 nines does not mask catastrophic events NodesAvailabilityEst. annual downtime %5d4h21m %1h45m % 1m30s NodesAvailabilityEst. annual downtime %5d4h21m %1h45m NodesAvailabilityEst. annual downtime %5d4h21m
10 Scott_RAS_0614 High-availability framework for HPC Pluggable component framework Communication drivers Group communication Virtual synchrony Applications Interchangeable components Adaptation to application needs, such as level of consistency Adaptation to system properties, such as network and system scale Applications Scheduler MPI Runtime File System SSI Virtual Synchrony Replicated Memory Replicated File Replicated State-Machine Replicated Database Replicated RPC/RMI Distributed Control Group Communication Membership Management Failure Detection Reliable Multicast Atomic Multicast Communication Driver Singlecast Failure Detection Multicast Network (Ethernet, Myrinet, Elan+, Infiniband,…)
11 Scott_RAS_0614 Scalable, fault-tolerant membership for MPI tasks on HPC systems Scalable approach to reconfiguring communication infrastructure Decentralized (peer-to-peer) protocol that maintains consistent view of active nodes in the presence of faults Resilience against multiple node failures, even during reconfiguration Response time: Hundreds of microseconds over MPI on 1024-node Blue Gene/L Single-digit milliseconds over TCP on 64-node Gigabit Ethernet Linux cluster (XTORC) Integration with Berkeley Laboratory checkpoint/restart (BLCR) mechanism to handle node failures without restarting an entire MPI job
12 Scott_RAS_0614 Stabilization time over MPI on BG/L Time for Stabilization [microsecs] Number of Nodes (Log Scale) Experimental results Distance model Base model
13 Scott_RAS_0614 Stabilization time over TCP on XTORC Number of nodes Time for Stabilization [microsecs] Experimental results Distance Model Base Model
14 Scott_RAS_0614 ORNL contacts Stephen L. Scott Network and Cluster Computing Computer Science and Mathematics (865) Christian Engelmann Network and Cluster Computing Computer Science and Mathematics (865) Scott_RAS_0614