Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

Remus: High Availability via Asynchronous Virtual Machine Replication

Live migration of Virtual Machines Nour Stefan, SCPD.

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Xen , Linux Vserver , Planet Lab

XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.

Communication Pattern Based Node Selection for Shared Networks

Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

G Robert Grimm New York University Disco.

Virtual Machines Measure Up John Staton Karsten Steinhaeuser University of Notre Dame December 15, 2005 Graduate Operating Systems, Fall 2005 Final Project.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Virtualization for Cloud Computing

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

Virtualization and Cloud Computing Research at Vasabilab Kasidit Chanchio Vasabilab Dept of Computer Science, Faculty of Science and Technology, Thammasat.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Department of Computer Science Engineering SRM University

+ CS 325: CS Hardware and Software Organization and Architecture Cloud Architectures.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

การติดตั้งและทดสอบการทำคลัสเต อร์เสมือนบน Xen, ROCKS, และไท ยกริด Roll Implementation of Virtualization Clusters based on Xen, ROCKS, and ThaiGrid Roll.

Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.

SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

CS533 Concepts of Operating Systems Jonathan Walpole.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Large Scale Parallel File System and Cluster Management ICT, CAS.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Arun Babu Nagarajan, Frank Mueller Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory Proactive Fault Tolerance for HPC using Xen Virtualization.

In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo

The xCloud and Design Alternatives Presented by Lavone Rodolph.

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

NTU Cloud 2010/05/30. System Diagram Architecture Gluster File System – Provide a distributed shared file system for migration NFS – A Prototype Image.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Full and Para Virtualization

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Running clusters on a Shoestring Fermilab SC 2007.

© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

IMPROVEMENT OF COMPUTATIONAL ABILITIES IN COMPUTING ENVIRONMENTS WITH VIRTUALIZATION TECHNOLOGIES Abstract We illustrates the ways to improve abilities.

Running clusters on a Shoestring US Lattice QCD Fermilab SC 2007.

Virtualization for Cloud Computing

Presented by Yoon-Soo Lee

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Virtual machines benefits

Virtualization Dr. S. R. Ahmed.

Efficient Migration of Large-memory VMs Using Private Virtual Memory

Presentation transcript:

Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization

2 Problem Statement Trends in HPC: high end systems with thousands of processors — Increased probability of a node failure: MTBF becomes shorter MPI widely accepted in scientific computing — Problem with MPI: no recovery from faults in the standard Currently FT exist but… — only reactive: process checkpoint/restart — must restart entire job – inefficient if only one (few) node(s) fails — overhead due to redoing some of the work — issues: checkpoint at what frequency? — 100 hr job will run for addln 150 hrs on a petaflop machine (w/o failure) [I.philip, 2005]

3 Our Solution Proactive FT — anticipates node failure — takes preventive action instead of a ‘reaction’ to a failure –migrate the whole OS to a better physical node –entirely transparent to the application (rather to the OS itself) — hence avoids high overhead compared to reactive scheme (associated overhead w/ our scheme is very little )

4 Design space 1. A mechanism to predict/anticipate the failure of a node — OpenIPMI — lm_sensors (more system specific x86 Linux) 2. A mechanism to identify the best target node — Custom centralized approaches – doesn’t scale + unreliable — Scalable distributed approach – Ganglia 3. More importantly, a mechanism (for preventive action) which supports the relocation of the running application with — its state preserved — minimum overhead on the application itself — Xen Virtualisation with live migration support [C.Clark et al, May2005] – Open source

5 Mechanisms explained 1. Health Monitoring with OpenIPMI — Baseboard Mgmt Controller (BMC) equipped with sensors to monitor diff. properties like temperature, fan speed, voltage etc. of each node — IPMI (Intelligent Platform Management Interface) – increasingly common in HPC – std. message-based interface to monitor H/W – raw messaging harder to use and debug — OpenIPMI: open source, higher level abstraction from raw IPMI message-response system to communicate w/ BMC ( ie. to read sensors) — We use OpenIPMI to gather health information of nodes

6 Mechanisms explained 2. Ganglia — widely used, scalable distributed load monitoring tool — All the nodes in the cluster run a ganglia daemon and each node has a approximate view of the entire cluster — UDP used to transfer messages — Measures –cpu usage, mem usage, n/w usage by default –We use ganglia to identify least loaded node  migration target — Also extended to distribute IPMI sensor data

7 Mechanisms explained Xen VMM Privileged VM Guest VM MPI Task MPI Task 3. Fault Tolerance w/ xen — para-virtualized environment – OS modified – application unchanged — Privileged VM & Guest VM runs on Xen hypervisor/ VMM — Guest VMs can live migrate to other hosts  little overhead – State of the VM preserved – VM halted for an insignificant period of time – Migration phases: – phase 1: send guest image  dst node, app running – phase 2: repeated diffs  dst node, app still running – phase 3: commit final diffs  dst node, OS/app frozen – phase 4: activate guest on dst, app running again H/w

8 Overall set-up of the components Stand-by Xen host, no guest Xen VMM Ganglia Privileged VM Xen VMM Privileged VM Ganglia PFT Daemon PFT Daemon Guest VM MPI Task MPI Task PFT Daemon PFT Daemon Migrate Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Deteriorating health  migrate guest (w/ MPI app) to stand-by host H/w BMC H/w BMC H/w BMC H/w BMC Baseboard Management Contoller

9 Overall set-up of the components Stand-by Xen host, no guest Xen VMM Ganglia Privileged VM Xen VMM Privileged VM Ganglia PFT Daemon PFT Daemon Guest VM MPI Task MPI Task PFT Daemon PFT Daemon Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Deteriorating health  migrate guest (w/ MPI app) to stand-by host The destination host generates unsolicited ARP reply advertising that Guest VM IP has moved to a new location [C.Clark et. Al 2005] - This will take care of peers to resend packets to the new host H/w BMC H/w BMC H/w BMC H/w BMC Baseboard Management Contoller

10 Proactive Fault Tolerance (PFT) Daemon Runs on privileged VM (host) Initialize Read safe threshold from config file CPU temperature, fan speeds extensible (corrupt sectors, network, voltage fluctuations, …) Init connection w/ IPMI BMC using authentication parameters and hostname Gathers a listing of available sensors in the system and validates it against out list Initialize Health Monitor IPMI Baseboard Mgmt Controller IPMI Baseboard Mgmt Controller Threshold Breach? Threshold Breach? Load Balance Ganglia N Y PFT Daemon Raise Alarm / Maintenance of the system

11 PFT Daemon Health Monitoring interacts w/ IPMI BMC (via OpenIPMI) to read sensors Periodic sampling of data (event driven is also supported) threshold exceeded  control handed over to load balancing PFTd determines migration target by contacting Ganglia — Load-based selection (lowest load) — Load obtained by /proc file system — Invokes Xen live migration for guest VM Xen user-land tools (at VM/host) — command line interface for live migration — PFT Daemon initiates migration for guest VM

12 Experimental Framework Cluster of 16 nodes (dual core, dual Opteron 265, 1 Gbps Ether) Xen VMM Privileged and guest VM run ported Linux kernel version Guest VM: — Very same configuration as privileged VM — Has 1GB RAM — Booted on VMM w/ PXE netboot via NFS — Has access to NFS (same as the privileged VM) Ganglia on Privileged VM (and also Guest VM) in all nodes Node sensors obtained via OpenIPMI

13 Experimental Framework NAS Parallel Benchmarks run on Guest Virtual Machine MPICH-2 w/ MPD ring on n GuestVMs (no job-pause required!) Process on Privileged domain — monitors MPI task runs — issues migration command (NFS used for synchronization) Measured: — wallclock time with and w/o migration — actual downtime + migration overhead (modified Xen migration) benchmarks run 10 times, results report avg. NPB V3.2.1: BT, CG, EP, LU and SP benchmarks — IS run is too short — MG requires > 1GB for class C

14 Experimental Results 1. Single node failure 2. Double node failure NPB Class B / 4 nodesNPB Class C / 4 nodes Single node failure – overhead of 1-4 % over total wall clock time Double node failure - overhead of 2-8 % over total wall clock time

15 Experimental Results 3. Behavior of Problem Scaling NPB 4 nodes Generally overhead increases with problem size (CG is exception ) Chart depicts only the overhead section Dark region represents the part for which the VM was halted The light region represents the delay incurred due to migration (diff operation.. Etc)

16 Experimental Results 4. Behavior of Task Scaling NPB Class C Generally we expect a decrease in overhead on increasing the # of nodes Some discrepancies for BT and LU observed (Migration duration is 40s but here we have 60s)

17 Experimental Results 5. Migration duration NPB 4 nodes Min 13s needed to transfer a 1GB VM w/o any active processes Max 40 seconds needed before migration is initiated Depends on the n/w bandwidth, RAM size & on the application NPB 4/8/16 nodes

18 Experimental Results 6. Scalability (Total execution time) NPB Class C Speedup is not very much affected

19 Related Work FT – Reactive approach is more common Automatic Checkpoint/restart (eg: BLCR – Berkeley Labs Checkpnt Restart)[S.Sankaran et.al LACSI ’03], [G.Stellner, IPPS ’ 96] Log based (Log msg + temporal ordering) [G.Bosilica, Supercomputing, 2002] Non-automatic Explicit invocation of checkpoint routines [R.T.Aulwes et. Al, IPDPS 2004], [G. E. Fagg and J. J. Dongarra, 2000] Virtualization in HPC is less/no overhead [W.Hunaf et al, ICS ’06] To make virtualization competitive for MP environments, vmm- bypass I/o in VM has been experimented[J.Liu et.al USENIX ’06] n/w virtualization can be optimized [A.Menon et.al USENIX ’06]

20 Conclusion In contrast to the currently available reactive FT schemes, we have come up with a proactive system with much less overhead Transparent and automatic FT for arbitrary MPI applications Ideally complements long running MPI jobs Proactive system will complement reactive systems greatly. (It will help to reduce the high overhead associated with reactive schemes greatly)