Langley Research Center SPIDER Formal Models–Where are we now? Paul S. Miner In collaboration with: Alfons Geser (NIA), Jeff Maddalon,

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Byzantine Generals. Outline r Byzantine generals problem.
Fault-Tolerant Systems Design Part 1.
Bus Architectures for Satety- Critical Embedded Systems --by Harit Desai.
Aviation Safety ProgramSingle Aircraft Accident Prevention April NCC-1-377, Honeywell Tucson Design, Implementation, and Verification of Fault-Tolerant.
Byzantine Generals Problem: Solution using signed messages.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Software Engineering for Real- Time: A Roadmap H. Kopetz. Technische Universitat Wien, Austria Presented by Wing Kit Hor.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Bogdan Tanasa, Unmesh D. Bordoloi, Petru Eles, Zebo Peng Department of Computer and Information Science, Linkoping University, Sweden December 3, 2010.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Josef WidderBooting Clock Synchronization1 The  - Model, and how to Boot Clock Synchronization in it Josef Widder Embedded Computing Systems Group
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Strategic Directions in Real- Time & Embedded Systems Aatash Patel 18 th September, 2001.
Testing an individual module
CprE 458/558: Real-Time Systems
Misconceptions About Real-time Computing : A Serious Problem for Next-generation Systems J. A. Stankovic, Misconceptions about Real-Time Computing: A Serious.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
DCL Concepts STL Concepts ContainerIteratorAlgorithmFunctorAdaptor What New Concepts are Needed for a “DCL”? (Distributed Computing Library) Distributed.
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
1 25\10\2010 Unit-V Connecting LANs Unit – 5 Connecting DevicesConnecting Devices Backbone NetworksBackbone Networks Virtual LANsVirtual LANs.
Distributed Control Systems Emad Ali Chemical Engineering Department King SAUD University.
OIS Model TCP/IP Model.
Computer System Architectures Computer System Software
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
1 Albert Ferrer-Florit, Steve Parkes Space Technology Centre University of Dundee QoS for SpaceWire networks SpW-RT prototyping.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
CMSC 345 Fall 2000 Unit Testing. The testing process.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Enabling Dependable Communication in Cyber-Physical Systems with a Wireless Bus Federico Ferrari PhD Defense October 18, 2013 — Zurich, Switzerland Computer.
Securing Every Bit: Authenticated Broadcast in Wireless Networks Dan Alistarh, Seth Gilbert, Rachid Guerraoui, Zarko Milosevic, and Calvin Newport.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Architecting Web Services Unit – II – PART - III.
 Communication Tasks  Protocols  Protocol Architecture  Characteristics of a Protocol.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
Copyright John C. Knight SOFTWARE ENGINEERING FOR DEPENDABLE SYSTEMS John C. Knight Department of Computer Science University of Virginia.
Framework for the Development and Testing of Dependable and Safety-Critical Systems IKTA 065/ Supported by the Information and Communication.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
TTP and FlexRay. Time Triggered Protocols Global time by fault tolerant clock synchronisation Exact time point of a certain message is known (determinism)
Time Triggered Networks: use in space 2015 CCSDS spring SOIS Plenary 23 March 2015 Glenn Rakow/NASA-GSFC.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Intrusion Tolerant Software Architectures Bruno Dutertre and Hassen Saïdi System Design Laboratory, SRI International OASIS PI Meeting.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Failure Mode Assumptions and Assumption Coverage David Powell.
Fault-Tolerant Systems Design Part 1.
Langley Research Center Why is SPIDER Design Assurance based on Formal Methods? Paul S. Miner NASA Langley Internal Formal Methods.
Advantages of Time-Triggered Ethernet
Software Testing and Quality Assurance 1. What is the objectives of Software Testing?
CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
1 The Formal Verification of SPIDER Lee Pike Department of Computer Science Indiana University, Bloomington
Langley Research Center An Architectural Concept for Intrusion Tolerance in Air Traffic Networks Jeffrey Maddalon Paul Miner {jeffrey.m.maddalon,
Seminar On Rain Technology
ARTEMIS SRA 2016 Trust, Security, Robustness, and Dependability Dr. Daniel Watzenig ARTEMIS Spring Event, Vienna April 13, 2016.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Week#3 Software Quality Engineering.
The Consensus Problem in Fault Tolerant Computing
CHAPTER 3 Architectures for Distributed Systems
Active replication for fault tolerance
Abstractions for Fault Tolerance
Presentation transcript:

Langley Research Center SPIDER Formal Models–Where are we now? Paul S. Miner In collaboration with: Alfons Geser (NIA), Jeff Maddalon, and Lee Pike Internal Formal Methods Workshop NASA Langley Research Center September 3, 2015

Langley Research Center October 22, 2003SPIDER Update2 What is SPIDER? A family of fault-tolerant IMA architectures –Architecure concept due to Paul Miner, Mahyar Malekpour, and Wilfredo Torres-Pomales Inspired by several earlier designs –Main concept inspired by Palumbo’s Fault-tolerant processing system (U.S. Patent 5,533,188) Developed as part of Fly-By-Light/Power-By-Wire project –Other ideas from Draper’s FTPP, FTP, and FTMP; Allied- Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; …

Langley Research Center October 22, 2003SPIDER Update3 SPIDER Architecture N general purpose Processing Elements (PEs) logically connected via a Reliable Optical BUS (ROBUS) –A PE could be a general purpose processor, remote data concentrator, sensor, actuator, or any other device that needs to reliably communicate with other PEs SPIDER must be sufficiently reliable to support several aircraft functions –Persistent loss of single function could be catastrophic The ROBUS is an ultra-reliable unit providing basic fault-tolerant communication services ROBUS contains no software

Langley Research Center October 22, 2003SPIDER Update4 Logical view of SPIDER (Sample Configuration) ROBUS

Langley Research Center October 22, 2003SPIDER Update5 Design Objectives FT-IMA Architecture proven to survive a bounded number of physical faults –Both permanent and transient –Must survive Byzantine faults Capability to survive or quickly recover from massive correlated transient failure (e.g. in response to HIRF)

Langley Research Center October 22, 2003SPIDER Update6 Byzantine Faults Characterized by asymmetric error manifestations –different manifestations to different fault-free observers –including dissimilar values Can cause redundant computations to diverge If not properly handled, single Byzantine fault can defeat several layers of redundancy Many architectures neglect this class of fault –Assumed to be rare or even impossible

Langley Research Center October 22, 2003SPIDER Update7 Byzantine faults are real Several examples cited in Byzantine Faults: From Theory to Reality, Driscoll, et al. (to appear in SAFECOMP 2003) –Byzantine failures nearly grounded a large fleet of aircraft –Quad-redundant system failed in response to a single fault –Typical cases are faulty transmitters (resulting in indeterminate voltage levels at receivers) or faults that cause timing violations (so that multiple observers perceive the same event differently) H eavy Ion fault-injection results for TTP/C (Sivencrona, et al.) –more than 1 in 1000 of observed errors had Byzantine manifestations

Langley Research Center October 22, 2003SPIDER Update8 SPIDER Advantages Fault-Tolerance independent of applications Tolerates more failures –including any single Byzantine fault (and some combinations) –including many combinations of less severe failures –Hybrid fault model: good, asymmetric, symmetric, benign Does not require that nodes fail silent –But can take advantage when they do Simpler, stronger protocols with stronger assurance Can gracefully evolve to accommodate parts obsolescence –Off-the-shelf processors and low-level communication

Langley Research Center October 22, 2003SPIDER Update9 Failures contained by ROBUS Arbitrary failure in any attached Processing Element –Hardware or Software –Converts potential asymmetric error manifestations to symmetric –ROBUS provides a partitioning mechanism between PEs Must also operate correctly if a bounded number of internal hardware devices fail Cannot tolerate design error within ROBUS

Langley Research Center October 22, 2003SPIDER Update10 Design Assurance Strategy Fault-tolerance protocols and reliability models use the same fault classifications Reliability analysis using SURE (Butler & White) –Calculates P(enough good hardware) Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation

Langley Research Center October 22, 2003SPIDER Update11 Strength of Formal Verification Proofs equivalent to testing the protocols –for all specified ROBUS configurations –for all combinations of faults that satisfy the maximum fault assumption for each specified ROBUS configuration –for all specified message values The PVS proofs provides verification coverage equivalent to an infinite number of test cases. –Provided that the PVS model of the protocols is faithful to the VHDL design

Langley Research Center October 22, 2003SPIDER Update12 ROBUS Characteristics All good nodes agree on communication schedule –Currently bus access schedule statically determined similar to SAFEbus, Time-Triggered Architecture (TTA) –Architecture supports on-the-fly schedule updates similar to FTPP Preliminary capability will be in our next prototype Some fault-tolerance capabilities must be provided by processing elements –Analogous to Fault Tolerance Layer in TTA Processing Elements need not be uniform –Some support for dissimilar architectures

Langley Research Center October 22, 2003SPIDER Update13 Logical View of ROBUS ROBUS operates as a time-division multiple access broadcast bus ROBUS strictly enforces write access –no babbling idiots (prevented by ROBUS topology) Processing nodes may be grouped to provide differing degrees of fault-tolerance –PEs cannot exhibit Byzantine errors (prevented by ROBUS topology) –Simple N-modular redundancy strategies sufficient for PEs –Redundancy management for these groupings done by the PEs

Langley Research Center October 22, 2003SPIDER Update14 SPIDER Topology PE 1 PE 2 PE 3 ROBUS N,M BIU N BIU 3 BIU 2 BIU 1 RMU M RMU 2 RMU 1 PE N

Langley Research Center October 22, 2003SPIDER Update15 First ROBUS Prototype

Langley Research Center October 22, 2003SPIDER Update16 PE & BIU 1 PE & BIU 2 PE & BIU 3RMU 3 RMU 2 RMU 1 First SPIDER Prototype Picture provided by Derivation Systems, Inc. (

Langley Research Center October 22, 2003SPIDER Update17 ROBUS Requirements All fault-free PEs receive identical message sequences –If the source is also fault-free, they receive the message sent ROBUS provides a reliable time source (RTS) –The PEs are synchronized relative to this RTS ROBUS provides correct and consistent ROBUS diagnostic information to all fault-free PEs For 10 hour mission, P(ROBUS Failure) <

Langley Research Center October 22, 2003SPIDER Update18 Other Requirements Primary focus is on fault-tolerance requirements –Other requirements unspecified Message format/encoding Performance –These are implementation dependent Product Family –capable of range of performance –trade-off performance and reliability –Formal analysis valid for any instance

Langley Research Center October 22, 2003SPIDER Update19 ROBUS Protocols Interactive Consistency (Byzantine Agreement) –loop unrolling of classic Oral Messages algorithm –Inspired by Draper FTP Distributed Diagnosis (Group Membership) –Initially adapted MAFT algorithm to SPIDER topology Depends on Interactive Consistency protocol –Verification process suggested more efficient protocol Improved protocol due to Alfons Geser Suggested further generalizations Clock Synchronization –adaptation of Srikanth & Toueg protocol to SPIDER topology –Corresponds to Davies & Wakerly approach

Langley Research Center October 22, 2003SPIDER Update20 Recap from last year All SPIDER fault-tolerance requirements may be realized using a repeated execution of single abstract protocol Basic operation is single stage middle value select –Useful for readmission of failed nodes Two stage middle value select ensures validity and agreement properties for Interactive Consistency, Distributed Diagnosis, and Clock Synchronization

Langley Research Center October 22, 2003SPIDER Update21 Single Stage Middle Value Select x y z mvs(x,y,z) mvs(a,b,c) selects middle value from set {a, b, c}

Langley Research Center October 22, 2003SPIDER Update22 Single Stage Middle Value Select Properties Validity : If there is a majority of good sources, then all good receivers select a value in the range of the good sources Agreement Propagation: If all good sources agree, and form a majority, then all good receivers will agree Agreement Generation: If there are no asymmetric- faulty sources, then all good receivers will agree

Langley Research Center October 22, 2003SPIDER Update23 Single Stage Middle Value Select (Validity) x Any Fault z mvs(x,a,z) mvs(x,b,z) mvs(x,c,z) min(x,z)  mvs(x,?,z)  max(x,z) No guarantee of agreement! DemoDemo

Langley Research Center October 22, 2003SPIDER Update24 Single Stage Middle Value Select (Agreement Propagation) x Any fault x mvs(x,?,x) = x

Langley Research Center October 22, 2003SPIDER Update25 Single Stage Middle Value Select (Agreement Generation) x Symmetric z mvs(x,a,z)

Langley Research Center October 22, 2003SPIDER Update26 Current Efforts Constructing new PVS proofs of all protocols based on generalized middle value select –Have to address conflict between mathematical generality and engineering utility –Exploiting structure to further generalize diagnosis protocol Support a flexible group membership policy Non-existence of ideal policy established this summer by Beth Latronico (NIA Intern) Adding transient fault recovery capabilities –to protocols, reliability model, and formal proofs –to lab prototypes

Langley Research Center October 22, 2003SPIDER Update27 Current Efforts (2) Evaluating commercial embedded real-time operating systems for use on SPIDER Processing Elements Evolving requirements for Processing Elements –Adapt/extend existing embedded real-time operating system Time and Space Partitioning Fault-tolerance middleware –Dynamic computation of communication schedules

Langley Research Center October 22, 2003SPIDER Update28 Current Efforts (3) Building up PVS library of reusable fault-tolerance results –SPIDER protocols expressed within this framework Framework supports other network topologies –Improved generic clock synchronization properties Improved accuracy results (tighter bounds) Cleaner structure for precision results –Proof framework for general approximate agreement protocols (clock synchronization is special case) –Results generalized to accomodate weaker fault assumptions (including Azadmanesh & Kieckhafer model of strictly omissive asymmetric faults) Preliminary support for wireless fault models

Langley Research Center October 22, 2003SPIDER Update29 Additional Resources A Conceptual Design for a Reliable Optical BUS (ROBUS) ; Paul Miner, Mahyar Malekpour, and Wilfredo Torres; in Proceedings 21st Digital Avionics Systems Conference (DASC) 2002 A New On-Line Diagnosis Protocol for the SPIDER Family of Byzantine Fault Tolerant Architectures, Alfons Geser and Paul Miner, NASA/TM A Comparison of Bus Architectures for Safety-Critical Embedded Systems, John Rushby, NASA/CR