Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.

Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable and High-Performance Computing University of Illinois; Urbana, IL D. Rennels University of California; Los Angeles, CA R. Some Jet Propulsion Laboratory; Pasadena, CA

University of Illinois at Urbana-Champaign Outline n Overview of REE project. n Description of the SIFT environment. n Description of the REE testbed. n Goals of the experiment. n Experimental methodology. n Results of experiment. n Conclusion.

University of Illinois at Urbana-Champaign REE Project Overview n Remote Exploration and Experimentation (REE) project: –Put cluster of COTS processors in space: cost, performance benefits. –Execute scientific MPI applications on COTS cluster. –Only send results back to Earth. n Spaceborne computers traditionally radiation-hardened. n COTS components not designed for harsh space environment. n Protect MPI applications through SIFT solutions. SIFT Environment Rad-hard platform Spacecraft Control Computer Scientific Applications COTS processors

University of Illinois at Urbana-Champaign ARMOR-based SIFT Environment n SIFT environment consists of ARMOR processes: –Adaptive Reconfigurable Mobile Objects of Reliability. –Execute on COTS processors. –Provide error detection and recovery services to applications. n ARMOR processes are reconfigurable: –Structured around elements that provide elementary functions or services. –ARMOR functionality can be reconfigured by changing elements that makeup the ARMOR process. n SIFT environment designed to be fault-tolerant: –Error detection and recovery responsibilities distributed throughout hierarchy of ARMOR processes. –ARMOR state protected through microcheckpointing.

University of Illinois at Urbana-Champaign SIFT Environment on REE Testbed n Execution ARMORs: –Directly oversee MPI application processes. –Detect crash failures through operating system calls. –Detect hang failures by progress indicators from application. n Daemons: –Detect ARMOR crash and hangs. –Route ARMOR-to-ARMOR messages n Fault Tolerance Manager (FTM): –Recover from ARMOR, application, and node failures. –Interface with Spacecraft Control Computer. –Instruct Execution ARMORs to launch applications. n Heartbeat ARMOR: –Detect and recover from FTM failures. Sun SPARC workstation Disk PowerPC 750 366 MHz 128 MB RAM LynxOS3.0.1 Processing nodes:

University of Illinois at Urbana-Champaign Error Injection Methodology n Goals: –Stress the detection and recovery capabilities of the ARMOR runtime environment. –Assess performance impact of SIFT environment on application. n Target processes for error injection: FTM, Heartbeat ARMOR, Execution ARMOR, application. n Three sets of injections progressively stress the system: –SIGINT/SIGSTOP injections mimic clean crash/hang failures (no error propagation). –Register and text segment injections expand the failure scenarios by introducing possibility for error propagation and checkpoint corruption. –Targeted heap injections into dynamic data investigate error propagation and the effectiveness of internal assertion checks in preventing error propagation.

University of Illinois at Urbana-Champaign SIGINT/SIGSTOP Injections n Recovered from all crash/hang failures. n Process interaction leads to possible correlated failures: –Application-FTM correlated failure. –Application-Execution ARMOR correlated failure: Execution ARMOR App rank 1 App rank 0 X failure missing progress indicator update X hang detected recovery app blocks waiting for reply progress indicator n All correlated failures successfully recovered. Daemon Execution ARMOR Execution ARMOR SIFT Interface OTIS Process (rank 0) SIFT Interface OTIS Process (rank 1) Node 3Node 4

University of Illinois at Urbana-Champaign Impact of Recovery on Application n Measured application execution time when various target processes injected. n < 5% average overhead from recovery of ARMOR processes. n Negligible overhead when correlated failure scenarios are discounted.

University of Illinois at Urbana-Champaign Register and Text-Segment Injections n Random injections into registers and code of frequently- executed functions. n Injections uniformly distributed in time. n Inject until error manifests or application completes normally. n Most errors caused crash failures (segmentation fault or illegal instruction exception). n Errors propagated to other ARMOR processes or to ARMOR’s checkpoint in 11 of 700+ observed failure : –Error prevented application from starting. –Application completed, but SIFT environment was not able to recognize the successful completion. None affected an executing application

University of Illinois at Urbana-Champaign Heap Injections n Goal: investigate error propagation and effectiveness of internal assertions in preventing error propagation. n Inject one single bit flip into heap of ARMOR process per run. n Pointers not injected in order to maximize possibility of error propagation. n Targeted elements in FTM with substantial state. n Assertion checks within ARMORs: element-specific checks, generic checks done by common ARMOR infrastructure. n Impact of errors in dynamic heap data: –Error propagated to another ARMOR process or checkpoint without being detected by assertion check. –Assertion check detected data error, but not before error propagated. Attempted local recovery not successful. –Assertion check detected data error, and recovery able to restore system to clean state.

University of Illinois at Urbana-Champaign Conclusions n Progressive fault injections to stress error detection and recovery services of SIFT environment. n SIFT environment imposes negligible overhead during failure-free execution and < 5% overhead during recovery of ARMOR processes. n Correlated failures involving application and ARMOR processes can impact application availability. n Successful recovery of correlated failures due to hierarchical error detection and recovery. n Targeted heap injections show internal assertion checks and microcheckpointing useful in preventing error propagation.

Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.

Similar presentations

Presentation on theme: "Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.

Similar presentations

Presentation on theme: "Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable."— Presentation transcript:

Similar presentations

About project

Feedback