FastOS, Santa Clara CA, June 18 2007 Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks Daniele Scarpazza, Oreste.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Live migration of Virtual Machines Nour Stefan, SCPD.
Threads, SMP, and Microkernels
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
2. Computer Clusters for Scalable Parallel Computing
Bart Miller. Outline Definition and goals Paravirtualization System Architecture The Virtual Machine Interface Memory Management CPU Device I/O Network,
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Introduction to Virtualization
Distributed Processing, Client/Server, and Clusters
Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
Network Implementation for Xen and KVM Class project for E : Network System Design and Implantation 12 Apr 2010 Kangkook Jee (kj2181)
1 I/O Management in Representative Operating Systems.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Virtualization for Cloud Computing
Container-based OS Virtualization A Scalable, High-performance Alternative to Hypervisors Stephen Soltesz, Herbert Pötzl, Marc Fiuczynski, Andy Bavier.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Virtualization Lab 3 – Virtualization Fall 2012 CSCI 6303 Principles of I.T.
SAIGONTECH COPPERATIVE EDUCATION NETWORKING Spring 2010 Seminar #1 VIRTUALIZATION EVERYWHERE.
SAIGONTECH COPPERATIVE EDUCATION NETWORKING Spring 2009 Seminar #1 VIRTUALIZATION EVERYWHERE.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Benefits: Increased server utilization Reduced IT TCO Improved IT agility.
Xen Overview for Campus Grids Andrew Warfield University of Cambridge Computer Laboratory.
Xen I/O Overview. Xen is a popular open-source x86 virtual machine monitor – full-virtualization – para-virtualization para-virtualization as a more efficient.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
Improving Network I/O Virtualization for Cloud Computing.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
CS533 Concepts of Operating Systems Jonathan Walpole.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
© 2012 MELLANOX TECHNOLOGIES 1 Disruptive Technologies in HPC Interconnect HPC User Forum April 16, 2012.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
VMware vSphere Configuration and Management v6
Full and Para Virtualization
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
Operating-System Structures
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Virtualization for Cloud Computing
For Massively Parallel Computation The Chaotic State of the Art
Virtualization overview
QNX Technology Overview
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

FastOS, Santa Clara CA, June Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha, Vinod Tipparaju, Manoj Krishnan Pacific Northwest National Laboratory Radu Teodoresci, Jun Nakanom Josep Torrellas University of Illinois Duncan Roweth Quadrics In collaboration with Patrick Mullaney, Novell Wayne Augsburger, Mellanox

FastOS, Santa Clara CA, June Project Motivation Component count in high-end systems has been growing How do we utilize large ( processor) systems for solving complex science problems? Fundamental problems –Scalability to massive processor counts –Application performance on single processor given the increasingly complex memory hierarchy –Hardware and software failures MTBF as a function of system size

FastOS, Santa Clara CA, June Multiple FT Techniques Application drivers Multidisciplinary, multiresolution, and multiscale nature of scientific problems drive the demand for high end systems Applications place increasingly differing demands on the system resources: disk, network, memory, and CPU Some of them have natural fault resiliency and require very little support System drivers Different I/O configurations, programmable or simple/commodity NICs, proprietary/custom/commodity operating systems Tradeoffs between acceptable rates of failure and cost Cost effectiveness is the main constraint in HPC Therefore, it is not cost-effective or practical to rely on a single fault tolerance approach for all applications and systems

FastOS, Santa Clara CA, June Key Elements of SFT Virtualization of High Performance Network Interfaces and Protocols Virtualization of High Performance Network Interfaces and Protocols ReVive and ReVive I/O provide efficient CR capability for shared memory servers (cluster node): ReVive and ReVive I/O provide efficient CR capability for shared memory servers (cluster node): IBA/QsNET ReVive (I/O) Buffered Coscheduling provides global coordination of system activities, communication, CR Buffered Coscheduling provides global coordination of system activities, communication, CR BCS Fault-Tolerance module for ARMCI runtime system FT ARMCI Hypervisor to enable virtualization of compute node environment including OS (external dependency) Hypervisor to enable virtualization of compute node environment including OS (external dependency) XEN Focus of the talk

FastOS, Santa Clara CA, June Transparent System-level CR of PGA Applications on Infiniband and QsNET We explored a new approach to cluster fault- tolerance by integrating Xen with the latest generations of Infiniband and Quadrics high- performance networks Focus on Partitioned Global Address Space (PGAs) programming models Most of existing work focused on MPI Design Goals low-overhead and transparent migration

FastOS, Santa Clara CA, June Main Contributions Integration of Xen and Infiniband Enhanced Xen’s kernel modules to fully support user- level Infiniband protocols and IP over IB with minimal overhead Support for Partitioned Global Address space programming models (PGAs) Emphasis on ARMCI Automatic Detection of a Global Recovery Line and Coordinated Migration Perform a live migration without any change to user applications Experimental evaluation

FastOS, Santa Clara CA, June Xen Hypervisor On each machine Xen allows the creation of a privileged virtual machine (Dom0) and and one or more non-privileged VMs (DomUs) Xen provides the ability to pause, un-pause, checkpoint and resume DomUs Xen employs para-virtualization Non-privileged domains run a modified operating system featuring guest device drivers Their requests are forwarded to the native device driver in Dom0 using a split driver model

FastOS, Santa Clara CA, June Infiniband Device Driver The driver is implemented in two sections A paravirtualized section for slow path control operations (e.g., q- pair creation) and A direct access section for fast path data operations (transmit/receive) Based on Ohio State/IBM implementation Driver was extended to support additional CPU architectures and Infiniband adapters Added proxy layer to allow Subnet and Connection management from guest VMs Propagate suspend/resume to the applications not only kernel modules Several stability improvements

FastOS, Santa Clara CA, June Software Stack

FastOS, Santa Clara CA, June Xen/Infiniband Device Driver: Communication Performance

FastOS, Santa Clara CA, June Xen/Infiniband Device Driver: Communication Performance

FastOS, Santa Clara CA, June Parallel Programming Models Single Threaded Data Parallel, e.g. HPF Multiple Processes Partitioned-Local Data Access MPI Uniform-Global-Shared Data Access OpenMP Partitioned-Global-Shared Data Access Co-Array Fortran Uniform-Global-Shared + Partitioned Data Access UPC, Global Arrays, X10

FastOS, Santa Clara CA, June Fault Tolerance In PGAs Models Implementation considerations 1-sided communication, perhaps some 2-sided and collectives Special considerations in implementation of global recovery line Memory operations need to be synchronized for checkpoint/restart Memory is a combination of local and global (globally visible) Global memory could be shared from OS view Pinned and registered with network adapter P G Network SMP node 0 SMP node n P L G L P G P L G L

FastOS, Santa Clara CA, June Xen-enabled ARMCI Runtime system - one-sided communication Global Arrays, Rice Co-Array Fortran, GPSHEM, IBM X10 port under way Portable high performance remote memory copy interface Asynchronous remote memory access (RMA) Fast Collective Operations Zero-copy protocols, explicit NIC support “Pure” non-blocking communication % overlap Data Locality Shared-memory within SMP node and RMA across nodes High performance delivered on wide range of platforms Multi-protocol and multi-method implementation message passing 2-sided model P1 P0 receivesend B A P1P0 put remote memory access (RMA) 1-sided model A B P1P0 A=B shared memory load/stores 0-sided model AB Examples of data transfers optimized in ARMCI Fundamental Communication Models in HPC

FastOS, Santa Clara CA, June Global Recovery Lines (GRLs) A GRL is required before each checkpoint / migration A GRL is required for Infiniband networks because IBA does not allow location-independent layer 2 and 3 addresses IBA hardware maintains stateful connections not accessible by software The protocol that enforces a GRL has A drain phase, which completes any ongoing communication Followed by a global silence, where it is possible to perform node migration And a resume phase, where the processing nodes acquire knowledge of the new network topology

FastOS, Santa Clara CA, June GRL and Resource Management

FastOS, Santa Clara CA, June Experimental Evaluation Our experimental testbed is a cluster of 8 Dell PowerEdge 1950 Each node has two dual-core Intel Xeon Woodcrest 3.0 GHz, with 8 Gbytes of memory The cluster is interconnected by a Mellanox Infinihost III 4X HCA adapters Suse Linux Enterprise Server 10.0 Xen 3.02

FastOS, Santa Clara CA, June Timing of a GRL

FastOS, Santa Clara CA, June Scalability of Network Drain

FastOS, Santa Clara CA, June Scalability of Network Resume

FastOS, Santa Clara CA, June Save and Restore Latencies

FastOS, Santa Clara CA, June Migration Latencies

FastOS, Santa Clara CA, June Conclusion We have presented a novel software infrastructure that allows completely transparent checkpoint/restart We have implemented a device driver that enhances the existing Xen/Infiniband drivers Support for PGAs programming models Minimal overhead, 10s of milliseconds Most of the time is spent saving/restoring the node image