Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer.

Slides:

Advertisements

Similar presentations

Automated Theorem Proving Lecture 1. Program verification is undecidable! Given program P and specification S, does P satisfy S?

Advertisements

Modeling issues Book: chapters 4.12, 5.4, 8.4, 10.1.

Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.

CS 267: Automated Verification Lecture 8: Automata Theoretic Model Checking Instructor: Tevfik Bultan.

Models of Concurrency Manna, Pnueli.

Interactive Configuration

Automatic Verification Book: Chapter 6. What is verification? Traditionally, verification means proof of correctness automatic: model checking deductive:

Rigorous Software Development CSCI-GA Instructor: Thomas Wies Spring 2012 Lecture 13.

CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking.

FIT FIT1002 Computer Programming Unit 19 Testing and Debugging.

Byzantine Generals Problem: Solution using signed messages.

Lecture 2: Reasoning with Distributed Programs Anish Arora CSE 6333.

1 Model Checking, Abstraction- Refinement, and Their Implementation Based on slides by: Orna Grumberg Presented by: Yael Meller June 2008.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Chapter 4 Repetitive Execution. 2 Types of Repetition There are two basic types of repetition: 1) Repetition controlled by a counter; The body of the.

Abstractions. Outline Informal intuition Why do we need abstraction? What is an abstraction and what is not an abstraction A framework for abstractions.

Synthesis of Fault-Tolerant Distributed Programs Ali Ebnenasir Department of Computer Science and Engineering Michigan State University East Lansing MI.

Lecture 4&5: Model Checking: A quick introduction Professor Aditya Ghose Director, Decision Systems Lab School of IT and Computer Science University of.

1 ACID Properties of Transactions Chapter Transactions Many enterprises use databases to store information about their state –e.g., Balances of.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

The Complexity of Adding Failsafe Fault-tolerance Sandeep S. Kulkarni Ali Ebnenasir.

Self-Stabilization An Introduction Aly Farahat Ph.D. Student Automatic Software Design Lab Computer Science Department Michigan Technological University.

5/6/2004J.-H. R. Jiang1 Functional Dependency for Verification Reduction & Logic Minimization EE290N, Spring 2004.

Automatic Synthesis of Fault-Tolerance Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer Science and Engineering Department Michigan.

1 Completeness and Complexity of Bounded Model Checking.

CS 603 Communication and Distributed Systems April 15, 2002.

1 Formal Engineering of Reliable Software LASER 2004 school Tutorial, Lecture1 Natasha Sharygina Carnegie Mellon University.

R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 4: SMT-based Bounded Model Checking of Concurrent Software.

Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.

Proofs of Correctness: An Introduction to Axiomatic Verification Prepared by Stephen M. Thebaut, Ph.D. University of Florida CEN 5035 Software Engineering.

Defining Programs, Specifications, fault-tolerance, etc.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Quantitative Abstraction Refinement Pavol Černý IST Austria joint work with Thomas Henzinger, Arjun Radhakrishna Haifa, Israel November 2012 TexPoint fonts.

Inferring Synchronization under Limited Observability Martin Vechev, Eran Yahav, Greta Yorsh IBM T.J. Watson Research Center (work in progress)

Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.

Lazy Annotation for Program Testing and Verification Speaker: Chen-Hsuan Adonis Lin Advisor: Jie-Hong Roland Jiang November 26,

Program Synthesis for Network Updates Pavol Černý CU Boulder Dagstuhl, February 2015.

Formal verification of skiplist algorithms Student: Trinh Cong Quy Supervisor: Bengt Jonsson Reviewer: Parosh Abdulla.

COP4020 Programming Languages Introduction to Axiomatic Semantics Prof. Robert van Engelen.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.

Symbolic Synthesis of Masking Fault-Tolerant Distributed Programs Borzoo Bonakdarpour Workshop APRETAF January 23, 2009 Joint work with Sandeep Kulkarni.

Software Engineering and Object-Oriented Design Topics: Solutions Modules Key Programming Issues Development Methods Object-Oriented Principles.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

Self-stabilization in NEST Mikhail Nesterenko (based on presentation by Anish Arora, Ohio State University)

Agreement in Distributed Systems n definition of agreement problems n impossibility of consensus with a single crash n solvable problems u consensus with.

CS 542: Topics in Distributed Systems Self-Stabilization.

SAT-Based Model Checking Without Unrolling Aaron R. Bradley.

Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.

1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.

A Distributed Component Software Development Process Andrew Olson Department of Computer & Information Science IUPUI May, 2001.

Program Correctness. The designer of a distributed system has the responsibility of certifying the correctness of the system before users start using.

Variants of LTL Query Checking Hana ChocklerArie Gurfinkel Ofer Strichman IBM Research SEI Technion Technion - Israel Institute of Technology.

/ PSWLAB Thread Modular Model Checking by Cormac Flanagan and Shaz Qadeer (published in Spin’03) Hong,Shin Thread Modular Model.

Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.

Design of Nonmasking Tree Algorithm Goal: design a tree construction protocol systematically by constructing its invariant and fault-span.

Design of Tree Algorithm Objectives –Learning about satisfying safety and liveness of a distributed program –Apply the method of utilizing invariants and.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“ Oğuzhan YILDIRIM – Erkin GÜVEL Boğaziçi University Computer Engineering Department

Automatic Test Generation

CPE555A: Real-Time Embedded Systems

Copyright © Cengage Learning. All rights reserved.

Objective of This Course

A Fusion-based Approach for Tolerating Faults in Finite State Machines

COP4020 Programming Languages

Program Correctness an introduction.

Presentation transcript:

Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer Science and Engineering Department Michigan State University

Acknowledgement This work is partially sponsored by: NSF, DARPA NEST, ONR URI, and Michigan State University

Motivation Programs are subject to unanticipated faults Encounter new classes of faults, add corresponding fault- tolerance How to add fault-tolerance? Develop from scratch (expensive approach) Incrementally add fault-tolerance Reuse of the behaviors of the fault-intolerant program Potential to preserve properties that are hard to specify (e.g., efficiency) How to ensure correctness? After the fact verification Automatic addition of fault-tolerance (correct by construction)

Motivation (Continued) Problem: Complexity of automatic addition Automatic addition of fault-tolerance to distributed programs is NP-hard [FTRTFT00], [ICDCS02] How do we deal with this complexity? Develop heuristics Identifying the boundary of polynomial-time addition Step-wise addition (weaker forms of fault-tolerance) The goal of this paper Enhance the fault-tolerance of nonmasking programs Partial automation of fault-tolerance programs

Outline Preliminary Concepts Enhancement Problem Enhancement in High Atomicity Model Enhancement for Distributed Programs Example: Byzantine Agreement Program Conclusion and Future Work

Preliminary Concepts: Programs and Faults Finite State space S p Invariant S, fault-span T  S p Program p, Fault f, Safety  { (s 0, s 1 ) | (s 0, s 1 )  S p  S p } Fault-tolerance Failsafe, Nonmasking, Masking S T p/fp f SpSp Program Fault

Step-Wise Addition Intolerant Program Nonmasking fault-tolerant Masking fault-tolerant This paper [FTRTFT00] Failsafe fault-tolerant [ICDCS02]

T SpSp Enhancement Problem Synthesis Algorithm Nonmasking program p Specification Spec Invariant S Masking program p' Invariant S' Faults f Requirements: Only fault-tolerance is added; no new functional behavior is added f S Fault-span T' S ' = T '  S T 'T '

Enhancement in High Atomicity Model

High Atomicity Model Each process can read/write all program variables TS ms ms: States from where safety will be violated by fault transitions f

Enhancement in High Atomicity Model – (Continued) T S Deadlock States appear due to removing some transitions ms Find a state predicate T ' such that: T ' is closed in the computations of the program in the presence of faults The specification is satisfied from every state of T ' (i.e., no deadlocks) Construct p' such that for every (s 0, s 1 )  p' : (s 0, s 1 ) does not violate safety s 0  T '  s 1  T ' T'T' S'S'

Enhancement Addition HighAtomicityEnhancement ( p,f: transitions, T:StatePredicate, specification spec ) { 1. Calculate ms; Calculate mt; 2. T' = ConstructFaultSpan( ); 3. if ( T' = {} ) declare no masking f-tolerant program exists; exit ; else Construct the transitions of p'; } AddMasking (p,f: transitions, S:StatePredicate, specification spec) { 1. Calculate ms; Calculate mt; repeat 4-1) ) ) T := ConstructFaultSpan( ); 4-4) ) if (S = {} \/ T = {}) declare no masking f-tolerant program exists; exit; until (ExitConditionHolds); 5. Remove cycles in outside the invariant in T ; 6. Construct the transitions of p'; } Fault-intolerant program Nonmasking program Masking program Manual Automatic: Enhancement Partial Automation [FTRTFT00]

Enhancement For Distributed Programs

Difficulties with Distribution Read/Write restrictions (low atomicity model). A program p Two processes j, k Two Boolean variables a and b Process j cannot read b Can we include the following transition ? a=0,b=0 a=1,b=0 Groups of transitions (instead of individual transitions) must be chosen. a=0,b=1 a=1,b=1 Only if we include the transition

Enhancement of Nonmasking Distributed Programs Calculate T' high Calculate S' init = S' low Calculate S reachable from S' low by fault/program transitions Calculate S recovery from where recovery is possible to S' low S recovery = {} S reachable = {} No Yes Declare failure No T' = S' low Calculate p' transitions Yes Search in (T' high – S' low ) Under distribution restrictions S' low = S' low  S recovery Stop Start

T A High Atomicity Fault-Span The largest possible domain for the states that can be included in the fault-span of the distributed program S T' high S' high = S  T' high ms

The Initial Low Atomicity Invariant Remove states from where an outgoing transition crosses the boundary of S ' high E.g., s 0 Removal is a non-deterministic choice, where we have more than one state to remove T' high S' high S0S0 S' init

T' high S reachable S' low Single-Step Reachable States Reachable by a fault/program transition (denoted S reachable ) S' init f S1S1 S1S1 S0S0 S2S2 S3S3 S2S2 S3S3

T' high S recovery Single-Step Recovery States Safer recovery in a single step (denoted S recovery ) Goal: infinite computations are possible from all states in S' low s 0 represents a typical recovery state S ' init S0S0 S2S2 S3S3 S2S2 S3S3 S ' low

Enhancement of Nonmasking Distributed Programs Calculate T' high Calculate S' init = S' low Calculate S reachable from S' low by fault/program transitions Calculate S recovery from where recovery is possible to S' low S recovery = {} S reachable = {} No Yes Declare failure No Start Yes S' low = S' low  S recovery T' = S' low Calculate p' transitions Stop

Example: Byzantine Agreement Why this example? Was used to illustrate the addition of masking fault-tolerance in [SRDS01] Manual enhancement has been already applied [TSE98] Processes: General, g, and three non-generals j, k, and l Variables d.g : {0, 1} d.j, d.k, d.l : {0, 1, ┴ } b.g, b.j, b.k, b.l : {0, 1} f.j, f.k, f.l : {0, 1} Safety Specification: Agreement: No two non-Byzantine non-generals can finalize with different decisions Validity: If g is not Byzantine, no process can finalize with different decision with respect to g A finalized process should not execute any transition g lkj

Example: Byzantine Agreement Read/Write restrictions Readable variables for process j b.j, d.j, f.j, d.g, d.k, d.l Process j can write d.j, f.j Disjkstra ’ s guarded commands Guard  Statement { (s 0, s 1 ) | Guard holds at s 0 and atomic execution of Statement yields s 1 } Nonmasking fault-tolerant program transitions d.j = ┴  f.j = 0  d.j := d.g d.j ≠ ┴  f.j = 0  f.j := 1 d.j = 1  d.k = 0  d.l = 0  d.j := 0 d.j = 0  d.k = 1  d.l = 1  d.j := 1 Fault transitions ¬ b.g  ¬ b.j  ¬ b.k  ¬ b.l  b.j := true b.j  d.j :=0|1

Example: Byzantine Agreement (Continued) d.j = d.k = ┴, d.g = 1, d.l = 1, f.l = 0 d.j = d.k = ┴, d.g = 1, d.l = 1, f.l = 1 S0S0 S1S1 A good transition inside the invariant d.j = d.k = 0, d.g = 0, d.l = 1, f.l = 1 S4S4 Fault transition A deadlock state Premature finalization b.g = 1 d.j = d.k = ┴, d.g = 0, d.l = 1, f.l = 1 S3S3 S2S2 Why enhancement is easier?

Example: Byzantine Agreement (Continued) d.j = ┴  f.j = 0  d.j := d.g d.j ≠ ┴  f.j = 0  f.j := 1 d.j = 1  d.k = 0  d.l = 0  d.j := 0 d.j = 0  d.k = 1  d.l = 1  d.j := 1  ((d.j = d.k)  (d.j = d.l))  (f.j = 0) Masking fault-tolerant program High atomicity reasoning Synthesize a masking program in high atomicity and then refine it to a distributed program

Enhancement vs. Addition Reuse the computations of the nonmasking program Reasoning in high atomicity model has the potential to reduce the complexity of addition

Synthesis Framework Development of a synthesis framework Developers of fault-tolerance can interactively add fault-tolerance to fault-intolerant programs Partial automation helps us to reap the benefits of automation as much as possible Enhancement identifies programs where partial automation is possible Implementation of enhancement algorithms in the synthesis framework

Conclusion and Future Work Enhancement simplifies automated design of masking programs Less asymptotic complexity Polynomial-time enhancement in the low atomicity model (in the state space of the nonmasking program) Sound, but not complete Reasoning in high atomicity simplifies the synthesis of masking distributed programs Future Work: A polynomial-time sound and complete enhancement algorithm for a restricted class of programs and specifications

Thank You! Questions?

Example: Triple Modular Redundancy Processes: Three processes: j, k, and l Variables and their domains in.j, in.k, and in.l are Boolean variables out belongs to { 0, 1, ┴ } Nonmasking program (+ addition in modulo 3): N1: (out = ┴ )  out := in.j N2: (out != ┴ ) /\ (out != in.j) /\ ((in.j = in.k) \/ (in.j = in.l))  out := in.j Faults: F: (in.j = in.k) /\ (in.j = in.l)  in.j := 0|1 Safety specification: Do not reach states where out is different than the majority of inputs. out should not be changed after it is assigned a value.

Example: Triple Modular Redundancy Invariant: S = ((out = ┴ ) /\ (in.j = in.k = in.k)) \/ (out = in.j = in.k) \/ (out = in.j = in.l) \/ (out = in.k = in.l) Fault-span: T = ( (in.j = in.k = in.l) => ((out = ┴ ) \/ (out = in.j = in.k = in.l)) ) Enhancement algorithm: Compute ms: ms = { } Remove bad transitions: {t: t violates safety} and {t: t reaches ms} Construct a new fault-span T ’ : T ’ = T – { s: (out != ┴ ) /\ (out is not equal to majority of inputs) } Masking program: M1: (out = ┴ ) /\ (in.j = in.k) \/ (in.j = in.l)  out := in.j

Enhancement of Nonmasking Distributed Programs Calculate T' high Calculate S' init = S' low Calculate S reachable from S' low by fault/program transitions Calculate S recovery from where recovery is possible to S' low S recovery = {} S reachable = {} No Yes Declare failure No Start T' = S' low, calculate p' transitions Yes S' low = S' low  S recovery

Enhancement of Nonmasking Distributed Programs Calculate T' high Calculate S' init = S' low Calculate S reachable from S' low by fault/program transitions Calculate S recovery from where recovery is possible to S' low S recovery = {} S reachable = {} No Yes Declare failure No Start T' = S' low, calculate p' transitions Yes S' low = S' low  S recovery

Enhancement of Nonmasking Distributed Programs Calculate T' high Calculate S' init = S' low Calculate S reachable from S' low by fault/program transitions Calculate S recovery from where recovery is possible to S' low S recovery = {} S reachable = {} No Yes Declare failure No Start T' = S' low, calculate p' transitions Yes S' low = S' low  S recovery S' init = S' low at the first iteration

Enhancement of Nonmasking Distributed Programs Calculate T' high Calculate S' init = S' low Calculate S reachable from S' low by fault/program transitions Calculate S recovery from where recovery is possible to S' low S recovery = {} S reachable = {} No Yes Declare failure No Start T' = S' low, calculate p' transitions Yes S' low = S' low  S recovery