Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.

Slides:

Advertisements

Similar presentations

1 Evaluating the Security Threat of Instruction Corruptions in Firewalls Shuo Chen, Jun Xu, Ravishankar K. Iyer, Keith Whisnant Center of Reliable and.

Advertisements

Software Fault Injection for Survivability Jeffrey M. Voas & Anup K. Ghosh Presented by Alison Teoh.

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

Spark: Cluster Computing with Working Sets

Making Services Fault Tolerant

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Low Overhead Fault Tolerant Networking (in Myrinet)

8. Fault Tolerance in Software

Bristlecone: A Language for Robust Software Systems Brian Demsky Alokika Dash University of California, Irvine.

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

PRASHANTHI NARAYAN NETTEM.

1 FM Overview of Adaptation. 2 FM RAPIDware: Component-Based Design of Adaptive and Dependable Middleware Project Investigators: Philip McKinley, Kurt.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

1 Reliable Web Services by Fault Tolerant Techniques: Methodology, Experiment, Modeling and Evaluation Term Presentation Presented by Pat Chan 3 May 2006.

Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.

CprE 458/558: Real-Time Systems

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.

Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

ATCA at UIUC M. Haney, M. Kasten High Energy Physics Z. Kalbarczyk, T. Pham, T. Nguyen Coordinated Science Laboratory ILLINOIS UNIVERSITY OF ILLINOIS AT.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Virtualized Execution Realizing Network Infrastructures Enhancing Reliability Application Communities PI Meeting Arlington, VA July 10, 2007.

Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Seminar On Rain Technology

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Free Transactions with Rio Vista Landon Cox April 15, 2016.

Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme 1 6 th International Conference on Information Warfare and Security, 2011.

E-Storm: Replication-based State Management in Distributed Stream Processing Systems Xunyun Liu, Aaron Harwood, Shanika Karunasekera, Benjamin Rubinstein.

Free Transactions with Rio Vista

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Modeling Stream Processing Applications for Dependability Evaluation

Performance Evaluation of Adaptive MPI

Hwisoo So. , Moslem Didehban#, Yohan Ko

Architectural Interactions in High Performance Clusters

Free Transactions with Rio Vista

ARMOR-based Hierarchical Fault/Error Management

Single Event Upset Simulation

RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Co-designed Virtual Machines for Reliable Computer Systems

Support for Adaptivity in ARMCI Using Migratable Objects

Distributed Systems and Concurrency: Distributed Systems

University of Wisconsin-Madison Presented by: Nick Kirchem

Presentation transcript:

Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable and High-Performance Computing University of Illinois; Urbana, IL D. Rennels University of California; Los Angeles, CA R. Some Jet Propulsion Laboratory; Pasadena, CA

University of Illinois at Urbana-Champaign Outline n Overview of REE project. n Description of the SIFT environment. n Description of the REE testbed. n Goals of the experiment. n Experimental methodology. n Results of experiment. n Conclusion.

University of Illinois at Urbana-Champaign REE Project Overview n Remote Exploration and Experimentation (REE) project: –Put cluster of COTS processors in space: cost, performance benefits. –Execute scientific MPI applications on COTS cluster. –Only send results back to Earth. n Spaceborne computers traditionally radiation-hardened. n COTS components not designed for harsh space environment. n Protect MPI applications through SIFT solutions. SIFT Environment Rad-hard platform Spacecraft Control Computer Scientific Applications COTS processors

University of Illinois at Urbana-Champaign ARMOR-based SIFT Environment n SIFT environment consists of ARMOR processes: –Adaptive Reconfigurable Mobile Objects of Reliability. –Execute on COTS processors. –Provide error detection and recovery services to applications. n ARMOR processes are reconfigurable: –Structured around elements that provide elementary functions or services. –ARMOR functionality can be reconfigured by changing elements that makeup the ARMOR process. n SIFT environment designed to be fault-tolerant: –Error detection and recovery responsibilities distributed throughout hierarchy of ARMOR processes. –ARMOR state protected through microcheckpointing.

University of Illinois at Urbana-Champaign SIFT Environment on REE Testbed n Execution ARMORs: –Directly oversee MPI application processes. –Detect crash failures through operating system calls. –Detect hang failures by progress indicators from application. n Daemons: –Detect ARMOR crash and hangs. –Route ARMOR-to-ARMOR messages n Fault Tolerance Manager (FTM): –Recover from ARMOR, application, and node failures. –Interface with Spacecraft Control Computer. –Instruct Execution ARMORs to launch applications. n Heartbeat ARMOR: –Detect and recover from FTM failures. Sun SPARC workstation Disk PowerPC MHz 128 MB RAM LynxOS3.0.1 Processing nodes:

University of Illinois at Urbana-Champaign Error Injection Methodology n Goals: –Stress the detection and recovery capabilities of the ARMOR runtime environment. –Assess performance impact of SIFT environment on application. n Target processes for error injection: FTM, Heartbeat ARMOR, Execution ARMOR, application. n Three sets of injections progressively stress the system: –SIGINT/SIGSTOP injections mimic clean crash/hang failures (no error propagation). –Register and text segment injections expand the failure scenarios by introducing possibility for error propagation and checkpoint corruption. –Targeted heap injections into dynamic data investigate error propagation and the effectiveness of internal assertion checks in preventing error propagation.

University of Illinois at Urbana-Champaign SIGINT/SIGSTOP Injections n Recovered from all crash/hang failures. n Process interaction leads to possible correlated failures: –Application-FTM correlated failure. –Application-Execution ARMOR correlated failure: Execution ARMOR App rank 1 App rank 0 X failure missing progress indicator update X hang detected recovery app blocks waiting for reply progress indicator n All correlated failures successfully recovered. Daemon Execution ARMOR Execution ARMOR SIFT Interface OTIS Process (rank 0) SIFT Interface OTIS Process (rank 1) Node 3Node 4

University of Illinois at Urbana-Champaign Impact of Recovery on Application n Measured application execution time when various target processes injected. n < 5% average overhead from recovery of ARMOR processes. n Negligible overhead when correlated failure scenarios are discounted.

University of Illinois at Urbana-Champaign Register and Text-Segment Injections n Random injections into registers and code of frequently- executed functions. n Injections uniformly distributed in time. n Inject until error manifests or application completes normally. n Most errors caused crash failures (segmentation fault or illegal instruction exception). n Errors propagated to other ARMOR processes or to ARMOR’s checkpoint in 11 of 700+ observed failure : –Error prevented application from starting. –Application completed, but SIFT environment was not able to recognize the successful completion. None affected an executing application

University of Illinois at Urbana-Champaign Heap Injections n Goal: investigate error propagation and effectiveness of internal assertions in preventing error propagation. n Inject one single bit flip into heap of ARMOR process per run. n Pointers not injected in order to maximize possibility of error propagation. n Targeted elements in FTM with substantial state. n Assertion checks within ARMORs: element-specific checks, generic checks done by common ARMOR infrastructure. n Impact of errors in dynamic heap data: –Error propagated to another ARMOR process or checkpoint without being detected by assertion check. –Assertion check detected data error, but not before error propagated. Attempted local recovery not successful. –Assertion check detected data error, and recovery able to restore system to clean state.

University of Illinois at Urbana-Champaign Conclusions n Progressive fault injections to stress error detection and recovery services of SIFT environment. n SIFT environment imposes negligible overhead during failure-free execution and < 5% overhead during recovery of ARMOR processes. n Correlated failures involving application and ARMOR processes can impact application availability. n Successful recovery of correlated failures due to hierarchical error detection and recovery. n Targeted heap injections show internal assertion checks and microcheckpointing useful in preventing error propagation.