ARMOR-based Hierarchical Fault/Error Management

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Jaringan Informasi Pengantar Sistem Terdistribusi oleh Ir. Risanuri Hidayat, M.Sc.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
Lesson 1: Configuring Network Load Balancing
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Achieving Qualities 1 Võ Đình Hiếu. Contents Architecture tactics Availability tactics Security tactics Modifiability tactics 2.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Illinois Center for Wireless Systems Wireless Security Quantification and Mechanisms Bill Sanders Professor, Electrical and Computer Engineering Director,
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Distributed Systems: Concepts and Design Chapter 1 Pages
Source: George Colouris, Jean Dollimore, Tim Kinderberg & Gordon Blair (2012). Distributed Systems: Concepts & Design (5 th Ed.). Essex: Addison-Wesley.
Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at.
Chapter 2: System Models. Objectives To provide students with conceptual models to support their study of distributed systems. To motivate the study of.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
CprE 458/558: Real-Time Systems
Jini Architecture Introduction System Overview An Example.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Chapter 1 Characterization of Distributed Systems
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Chapter 19: Network Management
Introduction to Operating Systems
Definition of Distributed System
Cluster Communications
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Grid Computing.
Living in a Network Centric World
Living in a Network Centric World
Chapter 16: Distributed System Structures
Congestion Control, Internet transport protocols: udp
Ch > 28.4.
Replication Middleware for Cloud Based Storage Service
Enterprise Service Bus (ESB) (Chapter 9)
QNX Technology Overview
Outline Announcements Fault Tolerance.
Mobile Agents.
Living in a Network Centric World
Fault Tolerance Distributed Web-based Systems
Software Architecture
CLUSTER COMPUTING.
Middleware for Fault Tolerant Applications
Living in a Network Centric World
Analysis models and design models
An Introduction to Software Architecture
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Living in a Network Centric World
CS510 - Portland State University
Living in a Network Centric World
AIMS Equipment & Automation monitoring solution
Co-designed Virtual Machines for Reliable Computer Systems
Introduction To Distributed Systems
Living in a Network Centric World
Design Yaodong Bi.
Living in a Network Centric World
Database System Architectures
Living in a Network Centric World
Abstractions for Fault Tolerance
Design.
Software Development Process Using UML Recap
Distributed Systems and Concurrency: Distributed Systems
Living in a Network Centric World
Presentation transcript:

ARMOR-based Hierarchical Fault/Error Management Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign 1308 W. Main St.; Urbana, IL 61801

Networked/Distributed Systems Key Questions How do we integrate components with varying fault tolerance (detection and recovery) characteristics into a coherent high availability networked system? How do you guarantee reliable communications? How do you synchronize actions of dispersed processors and processes? How do you contain errors (or achieve fail-silent behavior of components) to prevent error propagation? How do you reconfigure the system in response to failures? University of Illinois at Urbana-Champaign

Failure Categories Necessity to cope with machine (node), process, and network failures A component specification defines what output should be produced in response to any sequence of inputs as well as the real-time interval within which this output should occur University of Illinois at Urbana-Champaign

Failure Categories (cont.) Crash failures are a proper subclass of omission failures a crash failure occurs when after a first omission to send/receive a message a process systematically omits to send/receive messages Omission failures are a proper subclass of timing failures a process which suffers an omission failure can be understood as having an infinite response time Timing failures are a proper subclass of incorrect computation failures a timing failure occurs when a process takes some action too soon or too late Incorrect computation failures are a proper subclass of the class of all possible failures, the Byzantine or malicious failures a faulty process may send spurious messages to other processes, may lie, may not respond to received messages correctly University of Illinois at Urbana-Champaign

What Do We Propose in Approaching the Problems? ARMOR-based programming environment that provides A process architecture offers flexibility in assigning functionality to specific processes, including error detection and recovery techniques, that can be reconfigured according to dependability and application requirements scales according to the number of nodes available and to the number of applications simultaneously executing in the system. A runtime environment provides external process management to applications allows fine-tuning of fault tolerance services provided to and embedded in the application. Hierarchy of error detection and recovery to avoid single point(s) of failure to provide protection not only to the applications, but to the entities supporting detection and recovery services University of Illinois at Urbana-Champaign

What are ARMORs? Adaptive Reconfigurable Mobile Objects of Reliability: Multithreaded processes composed of replaceable building blocks. Provide error detection and recovery services to user applications via three levels of interaction. Hierarchy of ARMOR processes form runtime environment: System management, error detection, and error recovery services distributed across ARMOR processes. ARMOR runtime environment is self-checking. ARMOR support for the application: Completely transparent and external support. Transparent extension of standard libraries. Instrumentation with ARMOR API. University of Illinois at Urbana-Champaign

ARMOR Configuration Repository of Elements ARMOR Microkernel ARMOR HB Data dependency checking element Progress Indicator element Data dependency checking element Text-segment signature Checksum Element HB Checkpoint Progress Indicator Assertion check element Checksum Element Text-segment signature element Control flow signature element Range-check element Checkpoint element University of Illinois at Urbana-Champaign

ARMOR Computation Model opAction1 opAction3 element opAction2 Elements invoked through events called operations. A thread consists of a sequence of operations that execute. In response to an operation, element can: Read/write thread variables that serve as input/output for operation. Read/write element state. Generate additional operations to be processed within thread. Element-based detection and recovery: Monitor generates operation when it detects an error. Policy elements subscribe to notification operation, and generate sequence of operations to effect recovery. Service elements carry out individual recovery steps. Response to errors can be reconfigured by changing policy elements in ARMORs. University of Illinois at Urbana-Champaign

ARMOR Runtime Environment Node Node Manager ARMOR App. Heartbeat ARMOR multi-node solution Daemon ARMOR Daemon ARMOR Node Primary ARMOR App. network Node Node Backup ARMOR Daemon ARMOR Daemon ARMOR single-node solution ARMOR Manager ARMOR Exec. ARMOR App. Various kinds of ARMORs execute in environment depending upon requirements. Distribution of detection and recovery responsibilities makes environment resilient to ARMOR failures. Solutions scalable to one-node configuration. University of Illinois at Urbana-Champaign

Daemons Each node in runtime environment executes a daemon. Provide services to local ARMORs: Install ARMORs on local node. Detect ARMOR process crash/hang failures. Channel for ARMOR-to-ARMOR communications. Node 1 Node 2 Daemon Daemon ARMOR Microkernel Detection Policy Process Mgmt. Process Mgmt. Named Pipe Mgmt. TCP Connection Mgmt. Network Node 3 Daemon ARMOR ARMOR ARMOR Local ARMORs Remote daemons University of Illinois at Urbana-Champaign

Managers Manage a group of ARMORs. Responsible for recovering failed ARMORs. Contain information about each ARMOR: Location in the network. Current configuration. Recovery policy. Associated application. Detect and recover from node failures. Allocate nodes (including spares) for application and for ARMOR processes. Interface with user. Manager functionality can be consolidated into one Manager ARMOR or distributed across hierarchy of Manager ARMORs. University of Illinois at Urbana-Champaign

Hierarchy of Error Detection & Recovery – Attributes (1) Adaptivity and composability of individual levels. Detection and recovery composition and invocation at the individual levels should be customizable to meet . the application’s needs, the types of faults being experienced in the system, the reliability characteristic desired Applications with varying availability requirements should coexist in the same environment Detection levels should allow to be: selectively turned on or off independent so that they can be composed in various ways University of Illinois at Urbana-Champaign

Hierarchy of Error Detection & Recovery – Attributes (2) Intra-level interactions interactions between techniques placed within each level should be evaluated taking into account: cost, coverage, and intrusiveness factors e.g., placing assertion checks in certain points of the application code, may not required to generate control flow signatures for that portion of the code. Inter-level interactions interactions between error detection and recovery levels should be carefully defined to eliminate redundant invocation of multiple detection mechanisms. errors that escape a given level should be detectable by higher levels Recovery responsibilities an appropriate recovery strategy should be selected based upon the failure and circumstances of the failure event avoid a competition during error recovery – make sure that one and only one entity is responsible for recovery of a failed process or node University of Illinois at Urbana-Champaign

Hierarchy of Error Detection & Recovery Techniques encapsulated in separate elements Can be selectively turned on or off, inserted or removed Arranged in a hierarchy of layers Detection Watchdog timer (livelock detection) Built-in assertion checks Control and data flow check Recovery Restart a process/thread Hardware reset Layer 1: Process Inside ARMOR process Detection Progress indicator Smart heartbeats Data audits; OS detection Recovery Checkpointing/Rollback Process restart on the same node Layer 2: ARMOR Error Detection & Recovery Node At the node Increasing overhead Layer 3: Network Between ARMORs Detection Signature exchange between processes for consistency check Global heartbeats Recovery Checkpointing/Rollback Process migration/restart Masking University of Illinois at Urbana-Champaign

ARMOR Applications Base station controller: protecting call-processing application and database of digital mobile telephone network controller. Embedded wireless applications: protecting wireless communication channel through ARMOR-based proxies. Providing automated detection and recovery to wireless telephones and servers Network services: DHCP (Dynamic Host Configuration Protocol). Spaceborne applications: runtime environment for protecting distributed spaceborne applications. University of Illinois at Urbana-Champaign

ARMOR-Based Fault Management in RTES Environment – Design Options

Manager on DSP Local Managers: Regional Managers: Level 1 DSP Farm Execute on dedicated DSP per board. Detect and recover from errors localized to board. Regional Managers: Execute on Linux clusters. Handle recovery that spans multiple boards. Level 1 DSP Farm Mgr App App App Mgr App App App com com Board Board Daemon Daemon Daemon Level 2/3 Linux Farm App Exec ARMOR App Exec ARMOR Region Mgr. Node Node Node University of Illinois at Urbana-Champaign

Manager on PC Local Managers execute on PC assigned to board or group of boards. Mgr App App App App Mgr App App App App com com Linux/Win32 Board Linux/Win32 Board Daemon Daemon Daemon App Exec ARMOR App Exec ARMOR Region Mgr. Node Node Node University of Illinois at Urbana-Champaign

ARMOR-based Manager – Design Details (1) Local Managers are ARMOR processes: Reconfigurable monitoring functionality, detection policy, recovery policy. Communicate with Linux farm through common ARMOR infrastructure. Linux/Win32 ARMOR App App App App ARMOR Microkernel Daemon com Board ARMOR Interface App Recovery Policy DSP Interface Local Manager ARMOR Daemon Daemon Daemon App Exec ARMOR Region Mgr. Node Node University of Illinois at Urbana-Champaign

ARMOR-based Manager – Design Details (2) All functionality found in replaceable elements. Individual ARMORs can be customized based upon role they play in the system: Local Manager ARMORs include element to interface with DSP. Daemon ARMORs contain elements to communicate with local ARMORs. Execution ARMORs contain elements to oversee user application. All ARMORs consists of “microkernel” used to add elements, remove elements, communicate among elements. Each element found in separate shared library: Elements are explicitly loaded by microkernel through dl_open() and dl_sym(). Dynamic reconfiguration can be done on demand. Elements subscribe to event messages that they are designed to process. Tcl interface used to construct messages that are sent to ARMORs. University of Illinois at Urbana-Champaign