A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to.

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.

Performance Testing - Kanwalpreet Singh.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Chapter 15 Application of Computer Simulation and Modeling.

Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

G Robert Grimm New York University Disco.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation.

L. Granado Cardoso, F. Varela, N. Neufeld, C. Gaspar, C. Haen, CERN, Geneva, Switzerland D. Galli, INFN, Bologna, Italy ICALEPCS, October 2011.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Software Faults and Fault Injection Models --Raviteja Varanasi.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.

SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.

Bottlenecks: Automated Design Configuration Evaluation and Tune.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.

Cluster Reliability Project ISIS Vanderbilt University.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Virtualized Execution Realizing Network Infrastructures Enhancing Reliability Application Communities PI Meeting Arlington, VA July 10, 2007.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

UCI Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert David F. Redmiles Information and Computer Science.

Next Generation of Apache Hadoop MapReduce Owen

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.

Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

In the name of God.

Outline Sensys SensMetrics Solution SensMetrics Performance Measures

OpenMosix, Open SSI, and LinuxPMI

Software Architecture in Practice

Software Defined Storage

Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.

Hadoop Clusters Tess Fulkerson.

Windows Server 2016 Software Defined Storage

Human Complexity of Software

Co-designed Virtual Machines for Reliable Computer Systems

Harrison Howell CSCE 824 Dr. Farkas

Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.

Presentation transcript:

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory Welcome to HPCVirt 2009

Goal of the Presentation Can we anticipate failures and avoid their impact on application execution?

Introduction  Traditional Fault Tolerance Policies in HPC Systems  Reactive policies  Other approach: pro-active fault tolerance  Two critical capabilities to make pro-active FT successful  Failure prediction  Anomaly detection  Application migration  Pro-active policy  Testing / Experimentation  Is proactive fault tolerance the solution?

Failure Detection & Prediction  System monitoring  Live monitoring  Study non-intrusive monitoring techniques  Postmortem failure analysis  System log analysis  Live analysis for failure prediction  Postmortem analysis  Anomaly analysis  Collaboration with George Ostrouchov  Statistical tool for anomaly detection

Anomaly Detection  Anomaly Analyzer (George Ostrouchov)‏  Ability to view groups of components as statistical distributions  Identify anomalous components  Identify anomalous time periods  Based on numeric data with no expert knowledge for grouping  Scalable approach, only statistical properties of simple summaries  Power from examination of high-dimensional relationships  Visualization utility used to explore data  Implementation uses  R project for statistical computing  GGobi visualization tool for high-dimensional data exploration  With good failure data, could be used for failure prediction

Anomaly Detection Prototype  Monitoring / Data collection  Prototype developed using XTORC  Ganglia monitoring system  Standard metrics, e.g., memory/cpu utilization  LM_sensor data, e.g., cpu/mb temperature  Leveraged RRD reader from Ovis v1.1.1

Proactive Fault Tolerance Mechanisms  Goal: move the application away from the component that is about to fail  Migration  Pause/unpause  Major proactive FT mechanisms  Process-level migration  Virtual machine migration  In our context  Do not care about the underlying mechanism  We can easily switch between solutions

System and application resilience  What policy to use for proactive FT?  Modular framework  Virtual machine ckpt/rsrt and migration  Process-level ckpt/rsrt and migration  Implementation of new policies via our SDK  Feedback loop  Policy simulator  Ease initial phase of study of new policies  Results match experimental virtualization results

Type 1 Feedback-Loop Control Architecture  Alert-driven coverage  Basic failures  No evaluation of application health history or context  Prone to false positives  Prone to false negatives  Prone to miss real-time window  Prone to decrease application heath through migration  No correlation of health context or history

Type 2 Feedback-Loop Control Architecture  Trend-driven coverage  Basic failures  Less false positives/negatives  No evaluation of application reliability  Prone to miss real-time window  Prone to decrease application heath through migration  No correlation of health context or history

Type 3 Feedback-Loop Control Architecture  Reliability-driven coverage  Basic and correlated failures  Less false positives/negatives  Able to maintain real-time window  Does not decrease application heath through migration  Correlation of short-term health context and history  No correlation of long-term health context or history  Unable to match system and application reliability patterns

Type 4 Feedback-Loop Control Architecture  Reliability-driven coverage of failures and anomalies  Basic and correlated failures, anomaly detection  Less prone to false positives  Less prone to false negatives  Able to maintain real-time window  Does not decrease application heath through migration  Correlation of short and long- term health context & history

Testing and Experimentation  How to evaluate a failure prediction mechanism?  Failure injection  Anomaly detection  How to evaluate the impact of a given proactive policy?  Simulation  Experimentation

Fault Injection / Testing  First purpose: testing our research  Inject failure at different levels: system, OS, application  Framework for fault injection  Controller: Analyzer, Detector & Injector  Target system & user level targets  Testing of failure prediction/detection mechanisms  Mimic behavior of other systems  “Replay” failures sequence on another system  Based on system logs, we can evaluate the impact of different policies

Fault Injection  Example faults/errors  Bit-flips - CPU registers/memory  Memory errors - mem corruptions/leaks  Disk faults - read/write errors  Network faults - packet loss, etc.  Important characteristics  Representative failures (fidelity)‏  Transparency and low overhead  Detection/Injection are linked  Existing Work  Techniques: Hardware vs. Software  Software FI can leverage perf./debug hardware  Not many publicly available tools

Simulator  System logs based  Currently based on LLNL ASCI White  Evaluate impact of  Alternate policies  System/FT mechanisms parameters (e.g., checkpoint cost)‏  Enable studies & evaluation of different configurations before actual deployment

Anomaly Detection: Experimentation on “XTORC”  Hardware  Compute nodes: ~ Ghz)‏  Head node: 1 1.7Ghz)‏  Service/log server: 1 1.8Ghz)  Network: 100 Mb Ethernet  Software  Operating systems span RedHat 9, Fedora Core 4 & 5  RH9: node53  FC4: node4, 58, 59, 60  FC5: node1-3, 5-52, 61  RH9 is Linux 2.4  FC4/5 is Linux 2.6  NFS exports ‘/home’

XTORC Idle 48-hr Results  Data classified and grouped automatically  However, those results were manually interpreted (admin & statistician)‏  Observations  Node 0 is the most different from the rest, particularly hours 13, 37, 46, and 47. This is the head node where most services are running.  Node 53 runs the older Red Hat 9 (all others run Fedora Core 4/5).  It turned out that nodes 12, 31, 39, 43, and 63 were all down.  Node 13 … and particularly its hour 47!  Node 30 hour 7 … ?  Node 1 & Node 5 … ?  Three groups emerged in data clustering  1. temperature/memory related, 2. cpu related, 3. i/o related

Anomaly Detection - Next Steps  Data  Reduce overhead in data gathering  Monitor more fields  Investigate methods to aid data interpretation  Identify significant fields for given workloads  Heterogeneous nodes  Different workloads  Base (no/low work)  Loaded (benchmark/app work)‏  Loaded + Fault Injection  Working toward links between anomalies and failures

Prototypes - Overview  Proactive & reactive fault tolerance  Process level: BLCR + LAM-MPI  Virtual machine level: Xen + any kind of MPI implementation  Detection  Monitoring framework: based on Ganglia  Anomaly detection tool  Simulator  System log based  Enable customization of policies and system/application parameters

Is proactive the answer?  Most of the time: prediction accuracy is not good enough, we may loose all the benefit of proactive FT  No “one-fit-all” solution  Combination of different policies  “Holistic” fault tolerance  Example: decrease the checkpoint frequency combining proactive and reactive FT policies  Optimization of existing policies  Leverage existing techniques/policies  Tuning  Customization

Resource Contacts Geoffroy Vallee

Performance Prediction  Important variance between different runs of the same experiment  Only few studies to address the problem  “System noise”  Critical to scale up  Scientists want strict answer  What are the problems:  Lack of tools?  VMMs are too big/complex?  Not enough VMM-bypass/optimization?

Fault Tolerance Mechanisms  FT mechanisms are not yet mainstream (out-of-the-box)‏  But different solutions start to be available (BLCR, Xen, etc.)‏  Support of as many mechanisms as possible  Reactive FT mechanisms  Process-level checkpoint/restart  Virtual machine checkpoint/restart  Proactive FT mechanisms  Process-level migration  Virtual machine migration

Existing System Level Fault Injection  Virtual Machines  FAUmachine  Pro: focused on FI & experiments, code available  Con: older project, lots of dependencies, slow  FI-QEMU (patch)‏  Pro: works with ‘qemu’ emulator, code available  Con: patch for ARM arch, limited capabilities  Operating System  Linux (>= )‏  Pro: extensible, kernel & user level targets, maintained by Linux community  Con: immature, focused on testing Linux

Future Work  Implementation of the RAS framework  Ultimately have an “end-to-end” solution for system resilience  From initial studies based on the simulator  To deployment and testing on computing platforms  Using different low-level mechanisms (process level versus virtual machine level mechanisms)‏  Adapting the policies to both the platform and the applications