Download presentation
Presentation is loading. Please wait.
Published byPrudence Andrews Modified over 9 years ago
1
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research Group Computer Science and Mathematics Division
2
2 Scott_RAS_SC07 Research and development goals Provide high-level RAS capabilities for current terascale and next-generation petascale high- performance computing (HPC) systems Eliminate many of the numerous single points of failure and control in today’s HPC systems Develop techniques to enable HPC systems to run computational jobs 24/7 Develop proof-of-concept prototypes and production-type RAS solutions
3
3 Scott_RAS_SC07 MOLAR: Adaptive runtime support for high-end computing operating and runtime systems Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultrascale high-end computers Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) MOLAR is a collaborative research effort (www.fastos.org/molar)
4
4 Scott_RAS_SC07 Many active head nodes Workload distribution Symmetric replication between head nodes Continuous service Always up to date No fail-over necessary No restore-over necessary Virtual synchrony model Complex algorithms Prototypes for Torque and Parallel Virtual File System metadata server Active/active head nodes Symmetric active/active redundancy Compute nodes
5
5 Scott_RAS_SC07 Writing throughputReading throughput Symmetric active/active Parallel Virtual File System metadata server Nodes Availability Est. annual downtime 1 98.58% 5d, 4h, 21m 2 99.97% 1h, 45m 3 99.9997% 1m, 30s 0 20 40 60 80 100 120 12481632 Number of clients Throughput (Requests/sec) PVFS A/A 1 A/A 2 A/A 4 0 50 100 150 200 250 300 350 400 12481632 Number of clients Throughput (Requests/sec) 1 server 2 servers 4 servers
6
6 Scott_RAS_SC07 Operational nodes: Pause BLCR reuses existing processes LAM/MPI reuses existing connections Restore partial process state from checkpoint Failed nodes: Migrate Restart process on new node from checkpoint Reconnect with paused processes Scalable MPI membership management for low overhead Efficient, transparent, and automatic failure recovery Reactive fault tolerance for HPC with LAM/MPI+BLCR job-pause mechanism Paused MPI process Paused MPI process Live node Migrated MPI process Migrated MPI process Spare node Paused MPI process Paused MPI process Live node Failed MPI process Failed node Shared storage Failed Process migration New connection Failed Existing connection
7
7 Scott_RAS_SC07 3.4% overhead over job restart, but No LAM reboot overhead Transparent continuation of execution 0 1 2 3 4 5 6 7 8 9 10 BTCGEPFTLUMGSP Seconds Job pause and migrate LAM rebootJob restart No requeue penalty Less staging overhead LAM/MPI+BLCR job pause performance
8
8 Scott_RAS_SC07 Proactive fault tolerance for HPC using Xen virtualization Standby Xen host (spare node without guest VM) Deteriorating health Migrate guest VM to spare node New host generates unsolicited ARP reply Indicates that guest VM has moved ARP tells peers to resend to new host Novel fault-tolerance scheme that acts before a failure impacts a system Xen VMM Ganglia Privileged VM PFT daemon PFT daemon H/wBMC Xen VMM Privileged VM Guest VM MPI task MPI task Ganglia PFT daemon PFT daemon H/wBMC Xen VMM Privileged VM Guest VM MPI task MPI task Ganglia PFT daemon PFT daemon H/wBMC Xen VMM Privileged VM Guest VM MPI task MPI task Ganglia PFT daemon PFT daemon H/wBMC Migrate
9
9 Scott_RAS_SC07 0 50 100 150 200 250 300 350 400 450 500 BTCGEPLUSP Seconds Single node failure Single node failure: 0.5–5% additional cost over total wall clock time Double node failure: 2–8% additional cost over total wall clock time VM migration performance impact without migration one migration Double node failure BTCGEPLUSP 0 50 100 150 200 250 300 without migration one migration two migrations Seconds
10
10 Scott_RAS_SC07 HPC reliability analysis and modeling Programming paradigm and system scale impact reliability Reliability analysis Estimate mean time to failure (MTTF) Obtain failure distribution: exponential, Weibull, gamma, etc. Feedback into fault-tolerance schemes for adaptation Exponencia l 2653.3 Weibull3532.8 Lognormal2604.3 Gamma2627.4 Negative likelihood value Cumulative probability 1 0 0.2 0.4 0.6 0.8 050100150200250300 Time between failure (TBF)
11
11 Scott_RAS_SC07 Contacts regarding RAS research Stephen L. Scott Computer Science Research Group Computer Science and Mathematics Division (865) 574-3144 scottsl@ornl.ornl Christian Engelmann Computer Science Research Group Computer Science and Mathematics Division (865) 574-3132 engelmannc@ornl.ornl
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.