Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization.

Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization

2 Problem Statement Trends in HPC: high end systems with thousands of processors — Increased probability of a node failure: MTBF becomes shorter MPI widely accepted in scientific computing — Problem with MPI: no recovery from faults in the standard Currently FT exist but… — only reactive: process checkpoint/restart — must restart entire job – inefficient if only one (few) node(s) fails — overhead due to redoing some of the work — issues: checkpoint at what frequency? — 100 hr job will run for addln 150 hrs on a petaflop machine (w/o failure) [I.philip, 2005]

3 Our Solution Proactive FT — anticipates node failure — takes preventive action instead of a ‘reaction’ to a failure –migrate the whole OS to a better physical node –entirely transparent to the application (rather to the OS itself) — hence avoids high overhead compared to reactive scheme (associated overhead w/ our scheme is very little )

4 Design space 1. A mechanism to predict/anticipate the failure of a node — OpenIPMI — lm_sensors (more system specific x86 Linux) 2. A mechanism to identify the best target node — Custom centralized approaches – doesn’t scale + unreliable — Scalable distributed approach – Ganglia 3. More importantly, a mechanism (for preventive action) which supports the relocation of the running application with — its state preserved — minimum overhead on the application itself — Xen Virtualisation with live migration support [C.Clark et al, May2005] – Open source

5 Mechanisms explained 1. Health Monitoring with OpenIPMI — Baseboard Mgmt Controller (BMC) equipped with sensors to monitor diff. properties like temperature, fan speed, voltage etc. of each node — IPMI (Intelligent Platform Management Interface) – increasingly common in HPC – std. message-based interface to monitor H/W – raw messaging harder to use and debug — OpenIPMI: open source, higher level abstraction from raw IPMI message-response system to communicate w/ BMC ( ie. to read sensors) — We use OpenIPMI to gather health information of nodes

6 Mechanisms explained 2. Ganglia — widely used, scalable distributed load monitoring tool — All the nodes in the cluster run a ganglia daemon and each node has a approximate view of the entire cluster — UDP used to transfer messages — Measures –cpu usage, mem usage, n/w usage by default –We use ganglia to identify least loaded node  migration target — Also extended to distribute IPMI sensor data

7 Mechanisms explained Xen VMM Privileged VM Guest VM MPI Task MPI Task 3. Fault Tolerance w/ xen — para-virtualized environment – OS modified – application unchanged — Privileged VM & Guest VM runs on Xen hypervisor/ VMM — Guest VMs can live migrate to other hosts  little overhead – State of the VM preserved – VM halted for an insignificant period of time – Migration phases: – phase 1: send guest image  dst node, app running – phase 2: repeated diffs  dst node, app still running – phase 3: commit final diffs  dst node, OS/app frozen – phase 4: activate guest on dst, app running again H/w

8 Overall set-up of the components Stand-by Xen host, no guest Xen VMM Ganglia Privileged VM Xen VMM Privileged VM Ganglia PFT Daemon PFT Daemon Guest VM MPI Task MPI Task PFT Daemon PFT Daemon Migrate Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Deteriorating health  migrate guest (w/ MPI app) to stand-by host H/w BMC H/w BMC H/w BMC H/w BMC Baseboard Management Contoller

9 Overall set-up of the components Stand-by Xen host, no guest Xen VMM Ganglia Privileged VM Xen VMM Privileged VM Ganglia PFT Daemon PFT Daemon Guest VM MPI Task MPI Task PFT Daemon PFT Daemon Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Xen VMM Privileged VM Guest VM MPI Task MPI Task Ganglia PFT Daemon PFT Daemon Deteriorating health  migrate guest (w/ MPI app) to stand-by host The destination host generates unsolicited ARP reply advertising that Guest VM IP has moved to a new location [C.Clark et. Al 2005] - This will take care of peers to resend packets to the new host H/w BMC H/w BMC H/w BMC H/w BMC Baseboard Management Contoller

10 Proactive Fault Tolerance (PFT) Daemon Runs on privileged VM (host) Initialize Read safe threshold from config file CPU temperature, fan speeds extensible (corrupt sectors, network, voltage fluctuations, …) Init connection w/ IPMI BMC using authentication parameters and hostname Gathers a listing of available sensors in the system and validates it against out list Initialize Health Monitor IPMI Baseboard Mgmt Controller IPMI Baseboard Mgmt Controller Threshold Breach? Threshold Breach? Load Balance Ganglia N Y PFT Daemon Raise Alarm / Maintenance of the system

11 PFT Daemon Health Monitoring interacts w/ IPMI BMC (via OpenIPMI) to read sensors Periodic sampling of data (event driven is also supported) threshold exceeded  control handed over to load balancing PFTd determines migration target by contacting Ganglia — Load-based selection (lowest load) — Load obtained by /proc file system — Invokes Xen live migration for guest VM Xen user-land tools (at VM/host) — command line interface for live migration — PFT Daemon initiates migration for guest VM

12 Experimental Framework Cluster of 16 nodes (dual core, dual Opteron 265, 1 Gbps Ether) Xen-3.0.2-3 VMM Privileged and guest VM run ported Linux kernel version 2.6.16 Guest VM: — Very same configuration as privileged VM — Has 1GB RAM — Booted on VMM w/ PXE netboot via NFS — Has access to NFS (same as the privileged VM) Ganglia on Privileged VM (and also Guest VM) in all nodes Node sensors obtained via OpenIPMI

13 Experimental Framework NAS Parallel Benchmarks run on Guest Virtual Machine MPICH-2 w/ MPD ring on n GuestVMs (no job-pause required!) Process on Privileged domain — monitors MPI task runs — issues migration command (NFS used for synchronization) Measured: — wallclock time with and w/o migration — actual downtime + migration overhead (modified Xen migration) benchmarks run 10 times, results report avg. NPB V3.2.1: BT, CG, EP, LU and SP benchmarks — IS run is too short — MG requires > 1GB for class C

14 Experimental Results 1. Single node failure 2. Double node failure NPB Class B / 4 nodesNPB Class C / 4 nodes Single node failure – overhead of 1-4 % over total wall clock time Double node failure - overhead of 2-8 % over total wall clock time

15 Experimental Results 3. Behavior of Problem Scaling NPB 4 nodes Generally overhead increases with problem size (CG is exception ) Chart depicts only the overhead section Dark region represents the part for which the VM was halted The light region represents the delay incurred due to migration (diff operation.. Etc)

16 Experimental Results 4. Behavior of Task Scaling NPB Class C Generally we expect a decrease in overhead on increasing the # of nodes Some discrepancies for BT and LU observed (Migration duration is 40s but here we have 60s)

17 Experimental Results 5. Migration duration NPB 4 nodes Min 13s needed to transfer a 1GB VM w/o any active processes Max 40 seconds needed before migration is initiated Depends on the n/w bandwidth, RAM size & on the application NPB 4/8/16 nodes

18 Experimental Results 6. Scalability (Total execution time) NPB Class C Speedup is not very much affected

19 Related Work FT – Reactive approach is more common Automatic Checkpoint/restart (eg: BLCR – Berkeley Labs Checkpnt Restart)[S.Sankaran et.al LACSI ’03], [G.Stellner, IPPS ’ 96] Log based (Log msg + temporal ordering) [G.Bosilica, Supercomputing, 2002] Non-automatic Explicit invocation of checkpoint routines [R.T.Aulwes et. Al, IPDPS 2004], [G. E. Fagg and J. J. Dongarra, 2000] Virtualization in HPC is less/no overhead [W.Hunaf et al, ICS ’06] To make virtualization competitive for MP environments, vmm- bypass I/o in VM has been experimented[J.Liu et.al USENIX ’06] n/w virtualization can be optimized [A.Menon et.al USENIX ’06]

20 Conclusion In contrast to the currently available reactive FT schemes, we have come up with a proactive system with much less overhead Transparent and automatic FT for arbitrary MPI applications Ideally complements long running MPI jobs Proactive system will complement reactive systems greatly. (It will help to reduce the high overhead associated with reactive schemes greatly)

Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization.

Similar presentations

Presentation on theme: "Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization.

Similar presentations

Presentation on theme: "Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization."— Presentation transcript:

Similar presentations

About project

Feedback