FastOS, Santa Clara CA, June Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha, Vinod Tipparaju, Manoj Krishnan Pacific Northwest National Laboratory Radu Teodoresci, Jun Nakanom Josep Torrellas University of Illinois Duncan Roweth Quadrics In collaboration with Patrick Mullaney, Novell Wayne Augsburger, Mellanox
FastOS, Santa Clara CA, June Project Motivation Component count in high-end systems has been growing How do we utilize large ( processor) systems for solving complex science problems? Fundamental problems –Scalability to massive processor counts –Application performance on single processor given the increasingly complex memory hierarchy –Hardware and software failures MTBF as a function of system size
FastOS, Santa Clara CA, June Multiple FT Techniques Application drivers Multidisciplinary, multiresolution, and multiscale nature of scientific problems drive the demand for high end systems Applications place increasingly differing demands on the system resources: disk, network, memory, and CPU Some of them have natural fault resiliency and require very little support System drivers Different I/O configurations, programmable or simple/commodity NICs, proprietary/custom/commodity operating systems Tradeoffs between acceptable rates of failure and cost Cost effectiveness is the main constraint in HPC Therefore, it is not cost-effective or practical to rely on a single fault tolerance approach for all applications and systems
FastOS, Santa Clara CA, June Key Elements of SFT Virtualization of High Performance Network Interfaces and Protocols Virtualization of High Performance Network Interfaces and Protocols ReVive and ReVive I/O provide efficient CR capability for shared memory servers (cluster node): ReVive and ReVive I/O provide efficient CR capability for shared memory servers (cluster node): IBA/QsNET ReVive (I/O) Buffered Coscheduling provides global coordination of system activities, communication, CR Buffered Coscheduling provides global coordination of system activities, communication, CR BCS Fault-Tolerance module for ARMCI runtime system FT ARMCI Hypervisor to enable virtualization of compute node environment including OS (external dependency) Hypervisor to enable virtualization of compute node environment including OS (external dependency) XEN Focus of the talk
FastOS, Santa Clara CA, June Transparent System-level CR of PGA Applications on Infiniband and QsNET We explored a new approach to cluster fault- tolerance by integrating Xen with the latest generations of Infiniband and Quadrics high- performance networks Focus on Partitioned Global Address Space (PGAs) programming models Most of existing work focused on MPI Design Goals low-overhead and transparent migration
FastOS, Santa Clara CA, June Main Contributions Integration of Xen and Infiniband Enhanced Xen’s kernel modules to fully support user- level Infiniband protocols and IP over IB with minimal overhead Support for Partitioned Global Address space programming models (PGAs) Emphasis on ARMCI Automatic Detection of a Global Recovery Line and Coordinated Migration Perform a live migration without any change to user applications Experimental evaluation
FastOS, Santa Clara CA, June Xen Hypervisor On each machine Xen allows the creation of a privileged virtual machine (Dom0) and and one or more non-privileged VMs (DomUs) Xen provides the ability to pause, un-pause, checkpoint and resume DomUs Xen employs para-virtualization Non-privileged domains run a modified operating system featuring guest device drivers Their requests are forwarded to the native device driver in Dom0 using a split driver model
FastOS, Santa Clara CA, June Infiniband Device Driver The driver is implemented in two sections A paravirtualized section for slow path control operations (e.g., q- pair creation) and A direct access section for fast path data operations (transmit/receive) Based on Ohio State/IBM implementation Driver was extended to support additional CPU architectures and Infiniband adapters Added proxy layer to allow Subnet and Connection management from guest VMs Propagate suspend/resume to the applications not only kernel modules Several stability improvements
FastOS, Santa Clara CA, June Software Stack
FastOS, Santa Clara CA, June Xen/Infiniband Device Driver: Communication Performance
FastOS, Santa Clara CA, June Xen/Infiniband Device Driver: Communication Performance
FastOS, Santa Clara CA, June Parallel Programming Models Single Threaded Data Parallel, e.g. HPF Multiple Processes Partitioned-Local Data Access MPI Uniform-Global-Shared Data Access OpenMP Partitioned-Global-Shared Data Access Co-Array Fortran Uniform-Global-Shared + Partitioned Data Access UPC, Global Arrays, X10
FastOS, Santa Clara CA, June Fault Tolerance In PGAs Models Implementation considerations 1-sided communication, perhaps some 2-sided and collectives Special considerations in implementation of global recovery line Memory operations need to be synchronized for checkpoint/restart Memory is a combination of local and global (globally visible) Global memory could be shared from OS view Pinned and registered with network adapter P G Network SMP node 0 SMP node n P L G L P G P L G L
FastOS, Santa Clara CA, June Xen-enabled ARMCI Runtime system - one-sided communication Global Arrays, Rice Co-Array Fortran, GPSHEM, IBM X10 port under way Portable high performance remote memory copy interface Asynchronous remote memory access (RMA) Fast Collective Operations Zero-copy protocols, explicit NIC support “Pure” non-blocking communication % overlap Data Locality Shared-memory within SMP node and RMA across nodes High performance delivered on wide range of platforms Multi-protocol and multi-method implementation message passing 2-sided model P1 P0 receivesend B A P1P0 put remote memory access (RMA) 1-sided model A B P1P0 A=B shared memory load/stores 0-sided model AB Examples of data transfers optimized in ARMCI Fundamental Communication Models in HPC
FastOS, Santa Clara CA, June Global Recovery Lines (GRLs) A GRL is required before each checkpoint / migration A GRL is required for Infiniband networks because IBA does not allow location-independent layer 2 and 3 addresses IBA hardware maintains stateful connections not accessible by software The protocol that enforces a GRL has A drain phase, which completes any ongoing communication Followed by a global silence, where it is possible to perform node migration And a resume phase, where the processing nodes acquire knowledge of the new network topology
FastOS, Santa Clara CA, June GRL and Resource Management
FastOS, Santa Clara CA, June Experimental Evaluation Our experimental testbed is a cluster of 8 Dell PowerEdge 1950 Each node has two dual-core Intel Xeon Woodcrest 3.0 GHz, with 8 Gbytes of memory The cluster is interconnected by a Mellanox Infinihost III 4X HCA adapters Suse Linux Enterprise Server 10.0 Xen 3.02
FastOS, Santa Clara CA, June Timing of a GRL
FastOS, Santa Clara CA, June Scalability of Network Drain
FastOS, Santa Clara CA, June Scalability of Network Resume
FastOS, Santa Clara CA, June Save and Restore Latencies
FastOS, Santa Clara CA, June Migration Latencies
FastOS, Santa Clara CA, June Conclusion We have presented a novel software infrastructure that allows completely transparent checkpoint/restart We have implemented a device driver that enhances the existing Xen/Infiniband drivers Support for PGAs programming models Minimal overhead, 10s of milliseconds Most of the time is spent saving/restoring the node image