A New Approach to Software-Implemented Fault Tolerance

Slides:



Advertisements
Similar presentations
An Overview of ABFT in cloud computing
Advertisements

Fault-Tolerant Systems Design Part 1.
725/ASP-DAC Using Loop Invariants to Fight Soft Errors in Data Caches Sri Hari Krishna N., Seung Woo Son, Mahmut Kandemir, Feihui Li Department of.
Programming Types of Testing.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Fehlererkennung in SW David Rigler. Overview Types of errors detection Fault/Error classification Description of certain SW error detection techniques.
ED 4 I: Error Detection by Diverse Data and Duplicated Instructions Greg Bronevetsky.
Design of SCS Architecture, Control and Fault Handling.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Instituto de Informática and Dipartimento di Automatica e Informatica Universidade Federal do Rio Grande do Sul and Politecnico di Torino Porto Alegre,
Concurrency, Mutual Exclusion and Synchronization.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Presenter: Jyun-Yan Li A hybrid approach to the test of cache memory controllers embedded in SoCs’ W. J. Perez, J. Velasco Universidad del Valle Grupo.
LOGO Soft-Error Detection Through Software Fault-Tolerance Techniques by Gökhan Tufan İsmail Yıldız.
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
Fault-Tolerant Systems Design Part 1.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
1 CzajkowskiMAPLD 2005/138 Radiation Hardened, Ultra Low Power, High Performance Space Computer Leveraging COTS Microelectronics With SEE Mitigation D.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Week#3 Software Quality Engineering.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Memory Management.
Basic Computer Organization and Design
Computer Architecture Chapter (14): Processor Structure and Function
John D. McGregor Session 9 Testing Vocabulary
The Development Process of Web Applications
Distributed Shared Memory
Soft-Error Detection through Software Fault-Tolerance Techniques
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
G.Anuradha Reference: William Stallings
DATA COMMUNICATION AND NETWORKINGS
Introduction to microprocessor (Continued) Unit 1 Lecture 2
CSC 4250 Computer Architectures
nZDC: A compiler technique for near-Zero silent Data Corruption
Concurrency Control.
Introduction of microprocessor
A Closer Look at Instruction Set Architectures
Fault Tolerance In Operating System
John D. McGregor Session 9 Testing Vocabulary
Database Performance Tuning and Query Optimization
John D. McGregor Session 9 Testing Vocabulary
Hwisoo So. , Moslem Didehban#, Yohan Ko
Sequential circuits and Digital System Reliability
Test Case Purification for Improving Fault Localization
Fault Tolerance Distributed Web-based Systems
Soft Error Detection for Iterative Applications Using Offline Training
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Information Redundancy Fault Tolerant Computing
Fundamentals of Data Representation
Chapter 1 Introduction(1.1)
Chapter 15 : Concurrency Control
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
Chapter 11 Database Performance Tuning and Query Optimization
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Programming with Shared Memory Specifying parallelism
Fault Tolerant Systems in a Space Environment
COMP755 Advanced Operating Systems
Seminar on Enterprise Software
Presentation transcript:

A New Approach to Software-Implemented Fault Tolerance Presented by Mustafa Özgür Akduran Boğaziçi University İstanbul, 2007

Outline Introduction Software-Based Fault Tolerance Detecting and Correcting Transient Faults Affecting Data Detecting and Correcting Transient Faults Affecting Code Experimental Results Conclusion and Future Work CMPE_516 SPRING 2007

Introduction Increasing popularity of low-cost safety-critical computer-based applications in new areas requires the availability of new methods for designing dependable systems. Computer-based dependable systems are currently being introduced but the cost is a major concern and adaptation of commercial hardware (Off-The-Shelf products) is a common practice. Such as automotive, biomedical, telecontrol… Cost  hence the design and development time CMPE_516 SPRING 2007

Introduction So, for this class of applications software fault tolerance is an extremely attractive solution. It allows the implementation of dependable systems without incurring the high cost coming from designing custom hardware or using hardware redundancy. On the other side, relying on software techniques for obtaining dependability often means accepting some overhead in terms of increased code size and reduced performance. However, in many applications, memory and performance constraints are relatively loose and the idea of trading off reliability and speed is often easily acceptable. CMPE_516 SPRING 2007

Introduction The classical approach to provide fault tolerance capabilities to a program whose high level code is available is so-called N-version programming technique which consists in instruction triplication and majority voting. The purposed approach in this paper is to harden programs against transient (not regular or permanent) faults without incurring the high overhead triplication implies. The triplication of the program data segment as well as the triplication of the code segment is required in order to repeat each operation three times and then vote the results. CMPE_516 SPRING 2007

Introduction The hardening strategy proposed here is intended for tolerating single bit-flips in the processor-based system memory. This model is used to represent Single Event Upsets (SEUs) in memory bits, which are likely to happen due to the decrease in the magnitude of the electric charges used to carry and store information. CMPE_516 SPRING 2007

Introduction Several transient phenomena, like electromagnetic interference, power disturbances and highly energized particles (like alpha or neutron) may cause temporary (transient) alteration to the data that programs manipulate or to the expected execution flow of programs. In this paper, we focus only the effects of highly energized particles that interacting with the silicon substrate of memory devices. Firstly ** The approach presented in this paper is intended to face the consequences of errors originating from transient faults, in particular those caused by charged particles hitting the circuit. This kind of fault is increasingly likely to occur in any integrated device due to the continuous improvements in the VLSI technology which reduces the size of the capacitance storing information and increases the operating frequency of circuits. The result is a significant increase in the chance that particles hitting the circuit can introduce a misbehavior. CMPE_516 SPRING 2007

Introduction Fault detection is performed by checking for coherence the two copies of a variable after every read operation on the variable. When a mismatch is found, an error recovering procedure is executed that corrects the error, by relying on the duplicated variables and the stored checksum. As a result, we are able to provide tolerance against transient faults without incurring in the high memory overhead that a straightforward approach such as triplication implies. The approach is thus able to detect and correct transient faults affecting the data and program execution flow. The proposed approach is also able to reduce the performance overhead with respect to triplication. The checksum computation it exploits is based on the XOR boolean function, which is normally computed in an efficient way by any processor. Transient faults in the processor internal memory elements like program counter and stack pointer. CMPE_516 SPRING 2007

Software-Based Fault Tolerance The program to be hardened has three phases Initialization phase Data Manipulation phase Result Representation phase Code transformation rules providing fault detection and correction are automatically applied to the program source code and are classified in two categories. 1. during which the data to be elaborated are acquired. 2. where an algorithm is executed over the acquired data. 3. at the end of the computation So the code transformation takes place during the data manipulation phase. CMPE_516 SPRING 2007

Detecting and Correcting Transient Faults Affecting Data Rules Every variable x must be duplicated After each read operation on x, the two copies will be checked Checksum c is associated to one set of variables Before every write op. on x, the checksum is re-computed After every write op. on x, the checksum is updated Memory hardening is performed according to the following rules 1. 2. If an inconsistency is detected a recovery procedure is activated. 3. The initial value of the checksum is equal to the XOR of all the already initialized variables in associated set. The recovery procedure re- computes the XOR on the set of variables associated to the checksum (set 0 for example) and compares it with the stored one. Then if the re-computed checksum matches the stored one, the associated set of variables is copied over the other one; otherwise the second set is copied over the first one. set0 is copied over set 1 otherwise vice versa function chk() computes the XOR (exclusive) of all the variables in the set of a zero and b zero CMPE_516 SPRING 2007

Detecting and Correcting Transient Faults Affecting Code The first technique consists in executing any operation twice and then verify the coherency of the resulting execution flow. Since most operations are already duplicated due to the application of the rules introduced in the previous section (Detecting and Correcting Transient Faults Affecting Data), this idea mainly requires the duplication of the jump instructions. Example of application of transformation rules targeting faults affecting data CMPE_516 SPRING 2007

Detecting and Correcting Transient Faults Affecting Code Duplication of the jump instructions can be achieved by repeating twice the evaluation of the condition. So, it verifies the coherency of the resulting execution flow. Example of application of transformation rule targeting faults affecting conditional statements CMPE_516 SPRING 2007

Detecting and Correcting Transient Faults Affecting Code The second technique aims at detecting those faults modifying the code so that incorrect jumps are executed, resulting in a faulty execution flow. This is obtained by associating an identifier to each basic block in the code. Example of application of transformation rule targeting faults introducing faulty jumps in the execution flow An additional instruction is added at the beginning of each block of instructions. The added instruction writes the identifier associated to the block in an ad hoc variable, whose value is then checked for consistency at the end of the block. CMPE_516 SPRING 2007

Detecting and Correcting Transient Faults Affecting Code The recovery procedure consists in a rollback scheme: as soon as a fault affecting the program execution flow is detected, the program is restarted from the data manipulation phase. Thus; Detect and correct transient faults located in the processor internal memory elements (program counter, stack pointer, stack memory) that temporarily modify the program execution flow Detect transient faults originated in the processor code segment (where the program binary code is stored) that permanently modify the program execution flow CMPE_516 SPRING 2007

Software-Based Fault Tolerance As soon as a SEU hits the program code memory, the bit-flip it produces is indeed permanently stored in the memory, causing permanent modification to the program binary code. Restarting the program execution when such a kind of fault is detected is insufficient for removing the fault from the system. As a result, the program enters in an end-less loop, since it is restarted every time the fault is detected. This situation can be easily identified by a watchdog timer which brings the system back into normal operation. There is an important point that we have to concern CMPE_516 SPRING 2007

Experimental Results Firstly, we obtained fault tolerant versions of the benchmark programs by applying proposed code transformation rules. Secondly, evaluated the area overhead our method introduces by measuring the size of the code and data segments. Moreover, we measured the time overhead the method introduces. Secondly, …………… .. …….. of the fault tolerant versions and by relating them with those of the unhardened ones. Moreover, …. …. ……… , as the ratio between the number of clock cycles needed for executing the fault tolerant programs and the unhardened ones. DURING the experiments randomly bit-flips are injected in the program data and code (binary codes the processor executes) segments. CMPE_516 SPRING 2007

Experimental Results Fault affects are classified according to the following categories; Wrong answer Effect-less Time-out Corrected Detected Wrong answer: the results produced by the faulty processor are different than those produced by the fault-free processor. Effect-less: the results produced by the faulty processor are equal to those produced by the faulty-free processor. Time-out: the faulty processor is not able to produce the expected result after a given amount of time Corrected: the fault is detected and corrected, and thus the processor is able to produce the expected results. Detected: the fault is detected but can not be corrected. CMPE_516 SPRING 2007

Experimental Results In experiment three benchmark programs are used Sieve of Eratosthenes Bubble sort Matrix Intel 8051 instruction set is used with 128-bytes internal memory the Sieve of Eratosthenes is a simple, ancient algorithm for finding all prime numbers up to a specified integer. CMPE_516 SPRING 2007

Experimental Results We can state that the proposed approach is able to provide fault tolerance while reducing the memory overhead with respect to the TMR approach, while the performance penalties introduced by the two methods are comparable. Data Segment Increase Code Segment Increase CPU Time Increase Proposed Method 2.2 3.5 3.1 TMR 3.0 While evaluating the overhead our approach introduces, we recorded an increase factor for the data segment of all the benchmarks of 2.2, an average increase factor of 3.5 for the code segment and an average time overhead of 3.1 To compare these figures with a reference approach, we implemented the TMR version of the considered benchmarks, obtaining an average data segment overhead of 3.5, an average code segment increase of 3.0 and an average time overhead of 3.1 CMPE_516 SPRING 2007

Experimental Results The results we gathered during fault injection experiments are reported in Table 1 and 2 where transient faults injected in the unhardened programs are categorized according to their affects and then compared with those injected in the fault tolerant versions. The figures show that the proposed method is able to significantly improve the error detection and correction capabilities of a given application. As far as faults inside the data segment are considered, it is observed that the method provides complete fault coverage: the number of wrong answer is indeed reduced from 3054 (total) to zero for the hardened programs. CMPE_516 SPRING 2007

Experimental Results The same result was observed when faults affecting the code segment were analyzed, where wrong answers are reduced to zero . From Table 1 and 2, we can also observe that many faults exist that can only be detected. Most of them are provoked by SEUs hitting the memory area storing the result of the program at the very beginning of the program execution. Many faults hitting the code area are classified as time-out. These are faults that let the program enter in an end-less loop as described before and that triger the watchdog timer. CMPE_516 SPRING 2007

Conclusion and Future Work The paper presented a software-implemented approach for tolerating the effects of transient faults affecting both the data and the code segments of unhardened processors. The major novelty of the approach is the possibility of correcting SEU effects before they result in program errors. Fault detection and correction is obtained by resorting to variable duplication and checksum computation for guaranteeing the coherency of the replicated information. CMPE_516 SPRING 2007

Conclusion and Future Work Moreover, control flow instruction duplication in conjunction with a rollback scheme is used to detect and correct faults modifying the program execution flow. It has seen that the method is able to tolerate all the injected faults, at a cost of about 2.5 times increase in the program execution time. As future work, the method would be extend to cope with other fault type, namely permanent faults and faults provoked by electromagnetic interference. CMPE_516 SPRING 2007

References P.Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, and M. Violante, “Experimentally Evaluating an Automatic Approach for Generating Safety-Critical Software with Respect to Transient Errors”, IEEE Trans. On Nuclear Science, vol.47, no.6, pp. 2231-2236, 2000. M. Rebaudengo, M. Sonza Reorda, and M. Violante, “A New Software-Based Technique for Low-Cost Fault-Tolerant Application”, IEEE Reliability and Maintainability Symposium, pp. 25-28, 2003. P.L. Civera, L. Macchiarulo, M. Rebaudengo, M. Sonza and M. Violante, “Exploiting Circuit Emulation for Fast Hardness Evaluation”, IEEE Transactions on Nuclear Science, Vol.48, No.6, pp.2210-2216, 2001 M. Rebaudengo, M. Sonza Reorda, M. Torchiano, and M. Violante, “A source-to-Source Compiler for Generating Dependable Software”, IEEE Intl. Workshop on Source Code Analysis and Manipulation, pp. 33-42, 2001 N. Oh, S. Mitra, and E.J. McCluskey, “ED4I: Error Detection by Divers Data and Duplicated Instructions”, IEEE Trans. on Computer, vol. 51, no.2, pp. 180-199, 2002. N. Oh, P.P. Shirvani, and E.J. McCluskey, “Control Flow Checking by Software Signature”, IEEE Trans. on Reliability, vol. 51, no. 1, pp. 111-122, 2002. CMPE_516 SPRING 2007

Any Questions? CMPE_516 SPRING 2007