Automatic Data Structure Repair for Self-Healing Systems Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Advertisements

Part IV: Memory Management
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 3 Memory Management Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Virtual Memory Introduction to Operating Systems: Module 9.
INF 212 ANALYSIS OF PROG. LANGS Type Systems Instructors: Crista Lopes Copyright © Instructors.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Chapter 10.
1 Static Testing: defect prevention SIM objectives Able to list various type of structured group examinations (manual checking) Able to statically.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Static Specification Analysis for Termination of Specification-Based Data Structure Repair Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts.
Automatic Detection and Repair of Errors in Data Structures Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Data Structure Repair Using Goal-Directed Reasoning Brian Demsky Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology.
Automatic Detection and Repair of Errors in Data Structures Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Specification-Based Error Localization Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Data Structure Repair Brian Demsky Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology.
Testing an individual module
Applied Software Project Management Andrew Stellman & Jennifer Greene Applied Software Project Management Applied Software.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Data Structure Repair Using Goal-Directed Reasoning Brian Demsky Martin Rinard Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
Program Analysis Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
Specification-Based Error Localization Brian Demsky Cristian Cadar Daniel Roy Martin Rinard Computer Science and Artificial Intelligence Laboratory Massachusetts.
Data Structure Repair Using Goal-Directed Reasoning Brian Demsky Martin Rinard Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 5 Data Flow Testing
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.
CSCI 5801: Software Engineering
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
1 Thread Synchronization: Too Much Milk. 2 Implementing Critical Sections in Software Hard The following example will demonstrate the difficulty of providing.
Chapter 5: Control Structures II (Repetition)
CHAPTER 5: CONTROL STRUCTURES II INSTRUCTOR: MOHAMMAD MOJADDAM.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Data Structure Repair. Data structure repair problem F = 20 G = 5 F = 20 G = 10 I = 5 J = 2 Broken Data Structure Errors Missing elements Inappropriate.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Chapter 5: Control Structures II (Repetition). Objectives In this chapter, you will: – Learn about repetition (looping) control structures – Learn how.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
An Undergraduate Course on Software Bug Detection Tools and Techniques Eric Larson Seattle University March 3, 2006.
Disk & File System Management Disk Allocation Free Space Management Directory Structure Naming Disk Scheduling Protection CSE 331 Operating Systems Design.
Journaled Component Files John Scholes and Richard Smith 13 October, 2008 Or – How to never see FILE DAMAGED again!
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Redesigning Air Traffic Control: An Exercise in Software Design Daniel Jackson and John Chapin, MIT Lab for Computer Science Presented by: Jingming Zhang.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 7 – Buffer Management.
/ PSWLAB Evidence-Based Analysis and Inferring Preconditions for Bug Detection By D. Brand, M. Buss, V. C. Sreedhar published in ICSM 2007.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
Agenda  Quick Review  Finish Introduction  Java Threads.
W4118 Operating Systems Instructor: Junfeng Yang.
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
Free Transactions with Rio Vista Landon Cox April 15, 2016.
File-System Management
Global Register Allocation Based on
Free Transactions with Rio Vista
Software Testing.
Journaling File Systems
Unit Test Pattern.
Chapter 9: Virtual-Memory Management
Design and Programming
Free Transactions with Rio Vista
Chapter 15: File System Internals
Foundations and Definitions
Software Testing and QA Theory and Practice (Chapter 5: Data Flow Testing) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Presentation transcript:

Automatic Data Structure Repair for Self-Healing Systems Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Motivation F = 20 G = 5 F = 20 G = 10 I = 5 J = 2 Broken Data Structure Errors Missing elements Inappropriate sharing Dangling references Out of bounds array indices Inconsistent values

Goal F = 10 G = 5 F = 20 G = 10 I = 3 J = 2 F = 2 G = 1 F = 20 G = 5 F = 20 G = 10 I = 5 J = 2 Broken Data StructureConsistent Data Structure Repair Algorithm

Goal F = 10 G = 5 F = 20 G = 10 I = 3 J = 2 F = 2 G = 1 F = 20 G = 5 F = 20 G = 10 I = 5 J = 2 Broken Data StructureConsistent Data Structure Repair Algorithm Consistency Properties From Developer

What Does Repair Algorithm Produce? Data structure that Satisfies consistency properties, and Heuristically close to broken data structure Not necessarily the same data structure as (hypothetical) correct program would produce But enough to keep program operating successfully

Precursors Data structure repair has historically appeared in systems with extreme reliability goals 5ESS switch – hand coded audit routines IBM MVS operating system – hand coded failure recovery routines Key component of these systems

Where Is This Likely To Be Useful? Not for systems with slack - can just reboot Cause of error must go away after reboot Must be OK to lose volatile state Must be OK to wait for reboot Persistent data structures (file systems, application files) Autonomous and/or safety critical systems Monitor/control unstable physical phenomena Largely independent subcomputations Moving time window

Architecture Broken Bits Broken Abstract Model Repaired Abstract Model Repaired Bits Model Definition & Translation Internal Consistency Properties External Consistency Properties

Architecture Rationale Why go through the abstract model? Simple, uniform structure Sets of objects Relations between objects Simplifies both Expression of consistency properties Repair algorithm Enables system to support full range of efficient, heavily encoded data structures

File System Example abstintro 021 Directory EntriesDisk Blocks struct Entry { byte name[Length]; int firstBlock; } struct Block { int nextBlock; data byte[BlockSize]; } struct Disk { Entry dir[NumEntries]; Block block[NumBlocks]; } Disk D; -5 1

Model Definition Sets of objects set blocks of integer : partition used | free; Relations between objects – values of object fields, referencing relationships between objects relation next : used, used; blocks usedfree next

Model Translation Bits translated to sets and relations in abstract model using statements of the form: Quantifiers, Condition  Inclusion Constraint for i in 0..NumEntries, 0  D.dir[i].firstBlock and D.dir[i].firstBlock < NumBlocks  D.dir[i].firstBlock in used for b in used, 0  D.block[b].nextBlock and D.block[b].nextBlock < NumBlocks   b,D.block[b].nextBlock  in next for  b,n  in next, true  n in used for b in 0..NumBlocks, not (b in used)  b in free

Model in Example next used free 3 blocks abstintro 021 Directory EntriesDisk Blocks -5 1

Internal Consistency Properties Quantifiers, Body Body is first-order property of basic propositions Inequality constraints on values of numeric fields V.R = E, V.R E Presence of required number of objects size(S) = C, size(S)  C, size(S)  C Topology of region surrounding each object size(V.R) = C, size(V.R)  C, size(V.R)  C size(R.V) = C, size(R.V)  C, size(R.V)  C Inclusion constraints: V in S, V 1 in V 2.R,  V 1,V 2  in R Example: for b in used, size(next.b)  1

Internal Consistency Violations Evaluate consistency properties, find violations for b in used, size(next.b)  1 is false for b = next used free 3 blocks

Repairing Violations of Internal Consistency Properties Violation provides binding for quantified variables Convert Body to disjunctive normal form (p 1  …  p n )  …  (q 1  …  q m ) p 1 … p n, q 1 … q m are basic propositions Choose a conjunction to satisfy Repair violated basic propositions in conjunction

Repairing Violations of Basic Propositions Inequality constraints on values of numeric fields V.R = E, V.R E Compute value of expression, assign field Presence of required number of objects size(S) = C, size(S)  C, size(S)  C Remove or insert objects from/to set Topology of region surrounding each object size(V.R) = C, size(V.R)  C, size(V.R)  C size(R.V) = C, size(R.V)  C, size(R.V)  C Remove or insert pairs from/to relation Inclusion constraints: V in S, V 1 in V 2.R,  V 1,V 2  in R Remove or add the object or pair from/to set or relation

Repair in Example for b in used, size(next.b)  1 is false for b = 1 Must repair size(next.1)  1 Can remove either  0,1  or  2,1  from next next used free 3 blocks

Repair in Example for b in used, size(next.b)  1 is false for b = 1 Must repair size(next.1)  1 Can remove either  0,1  or  2,1  from next next used free 3 blocks

Acyclic Repair Dependences Questions Isn’t it possible for the repair of one constraint to invalidate another constraint? What about infinite repair loops? What about unsatisfiable specifications? Answer We require specifications to have no cyclic repair dependences between constraints So all repair sequences terminate Repair can fail only because of resource limitations

External Consistency Constraints Quantifiers, Condition  Body Body of form V = E, V.F = E, V.F[I] = E Example for b in free, true  D.block[b].nextBlock = -2 for  i,j  in next, true  D.block[i].nextBlock = j for b in used, size(b.next) = 0  D.block[b].nextBlock = -1 Repair simply performs assignments Translates model repairs to bit repairs

abstintro 021 Directory EntriesDisk Blocks -5 1 abstintro 021 Directory EntriesDisk Blocks -2 Repaired File System Repair in Example Inconsistent File System

When to Test for Consistency and Repair Persistent data structures Repair can be independent activity, or Repair when data written out or read in Volatile data structures in running program Under programmer control Transaction-based approach Identify transaction start and end Repair at start, end, or both Failure-based approach Wait until program fails Repair and restart from latest safe point

Experience We acquired four benchmarks (written in C/C++) CTAS (air-traffic control tool) Simplified Linux file system Freeciv interactive game Microsoft Word files We developed specifications for all four Very little development time (days, not weeks) Most of time spent figuring out Freeciv and CTAS Each benchmark has Workload Fault insertion methodology Ran benchmarks with and without repair

CTAS Set of air-traffic control tools Traffic management Arrival planning Flow visualization Shortcut planning Deployed in centers around country (Dallas/Ft. Worth, Los Angeles, Denver, Miami, Minneapolis/St. Paul, Atlanta, Oakland) Approximately 1 million lines of C/C++ code

CTAS Screen Shot

Results Workload – recorded radar feed from DFW Fault insertion Simulate error in flight plan processing Bad airport index in flight plan data structure Without repair System crashes – segmentation fault With repair Aircraft has different origin or destination System continues to execute Anomaly eventually flushed from system

Aspects of CTAS Lots of independent subcomputations System processes hundreds of aircraft – problem with one should not affect others Multipurpose system (visualization, arrival planning, shortcuts, …) – problem in one purpose should not affect others Sliding time window: anomalies eventually flushed Rebooting ineffective – system will crash again as soon as it sees the problematic flight plan

intro directory block inode bitmap block bitmap block inode … inode block disk blocks Simplified Linux File System Some Consistency Properties inode bitmap consistent with inode usage block bitmap consistent with block usage directory entries refer to valid inodes files contain valid blocks only files do not share blocks super block group block

Results Workload – write and verify several files Fault insertion – crash file system Inode and block bitmap errors Partially initialized directory and inode entries Without repair Incorrect file contents because of inode and disk block sharing With repair Bitmaps repaired preventing illegal sharing, correct file contents

POMM OOMP POMM PPMP loc: 3,0 loc: 2,3 Terrain Grid City Structures Freeciv Consistency Properties Tiles have valid terrain values Cities are not in the ocean Each city has exactly one reference from city location grid City locations are consistent in City structures and tile grid O = Ocean P = Plain M = Mountain

Results Workload – Freeciv software plays against itself Fault insertion – randomly corrupt terrain values Without repair – program fails (seg fault) With repair Game runs just fine But game plays out differently because of the different terrain values

Microsoft Word Files Files consist of a sequence of streams Streams stored using FAT-based data structure Consistency Properties FAT blocks exist and contain valid entries FAT streams are properly terminated Free blocks properly marked Streams contain valid blocks No sharing of blocks between streams abst1intro Directory EntriesFATDisk Blocks

Results Workload – several Microsoft Word files Fault insertion – scramble FAT Without repair If blocks containing the FAT were incorrectly marked as free, Word successfully loads file Otherwise, “The document name or path is not valid” With repair Word loads all files

Extensions Elimination of external consistency constraints Eliminates problems with translating repairs on the abstract model to the actual data structure Repair algorithm analyzes model definition rules to generate repair actions for the actual data structure

Extensions Support for doubly linked data structures Enables the repair algorithm to regenerate back links

Extensions Compilation and optimization of consistency checking Achieved significant speedups (n x) by compiling the specification Achieved further speedups () by partially optimizing away the construction of the abstract model

Related Work Hand-coded repair Lucent 5ESS switch IBM MVS operating system Self-stabilizing algorithms Log-based recovery for database systems Recovery-oriented computing Recursive restartability Undo framework

Conclusion Data structure repair interesting way to (potentially) improve reliability Specification-based approach promises to make technique more widely applicable Moving towards more robust, probabilistic, continuous concept of system behavior

Formalizing Repair Dependences: Constraint Dependence Graph P (a 1  …  a n ) (b 1  …  b n ) Q (c 1  …  c n )(d 1  …  d n ) T (e 1  …  e n )(f 1  …  f n ) Nodes: Constraint, Conjuncts from DNF Edges constraint to its conjunctions conjunction to dependent propositions if repairing conjunction could falsify proposition, or if repairing conjunction could increase quantifier scope

Consistency Properties The FAT blocks exist FAT contains valid values only -1 – terminates FAT streams -2 – indicates free blocks Valid disk block index – next block in stream FAT streams properly terminated Free blocks properly marked Streams contain valid blocks only Streams do not share blocks

Formalizing Repair Dependences: Constraint Dependence Graph P (a 1  …  a n ) (b 1  …  b n ) Q (c 1  …  c n )(d 1  …  d n ) T (e 1  …  e n )(f 1  …  f n ) Absence of cycles implies valid repair schedule Conjunction removal for cycle elimination (must leave at least one conjunction per constraint)

Formalizing Repair Dependences: Constraint Dependence Graph P (a 1  …  a n ) (b 1  …  b n ) Q (c 1  …  c n )(d 1  …  d n ) T (e 1  …  e n ) Absence of cycles implies valid repair schedule Conjunction removal for cycle elimination (must leave at least one conjunction per proposition)

Pointers Sets in model can include Primitive types (int, char, …) Structs (identified by pointer to struct) Standard linked list example struct node { int value; node *next; } set nodes of node; relation next : node, node; for n in nodes, true  n.next in nodes for n in nodes, true   n,n.next  in next

What About Corrupted Pointers? System only allows valid structs in model struct must be completely in valid memory one struct may be nested inside another struct (but must agree on memory format) If encounter invalid or null pointer, the (invalid) struct does not appear in model Implementation must track operations that affect valid regions of address space malloc, free mmap, munmap

CTAS in Action TMA at Fort Worth Center FAST at DFW TRACON

Usage Scenarios Reduced development effort Invest less effort in finding and fixing bugs Rely on repair to deliver reliable system Afraid to fix bug Cheap insurance policy No good quantitative justification But repair seems like a good idea

Issues Unclear relationship between repaired bits and bits from “correct” execution of program Identifying results involving repaired data Characterizing likely errors Data races in multithreaded programs Failure to update correlated data structures Caching inconsistencies Unanticipated failures/exit points Constraint language expressivity Coverage of desired properties Limitations from acyclicity requirement When to test for consistency and repair

What About Corrupted Pointers? Sets may contain pointers to structs System only allows valid structs in model struct must be completely in valid memory one struct may be nested inside another struct (but must agree on memory format) Valid Memory Invalid Memory Valid Struct Valid Structs Invalid Struct

Interesting Nuggets Small specifications Global invariant advantages