Real-Time Fault Tolerant CORBA

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Reliability on Web Services Presented by Pat Chan 17/10/2005.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Model for Supporting High Integrity and Fault Tolerance Brian Dobbing, Aonix Europe Ltd Chief Technical Consultant.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1: Operating Systems Overview
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Embedded and Real Time Systems Lecture #4 David Andrews
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Investigating Lightweight Fault Tolerance Strategies for Enterprise Distributed Real-time Embedded Systems Tech-X Corporation Boulder, Colorado Vanderbilt.
EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Computer System Architectures Computer System Software
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
B.Ramamurthy9/19/20151 Operating Systems u Bina Ramamurthy CS421.
The Starfish System: Intrusion Detection and Intrusion Tolerance for Middleware Systems Kim Potter Kihlstrom Westmont College Santa Barbara, CA, USA Priya.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Survival by Defense- Enabling Partha Pal, Franklin Webber, Richard Schantz BBN Technologies LLC Proceedings of the Foundations of Intrusion Tolerant Systems(2003)
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems Dr. Aniruddha S. Gokhale
Real-Time Operating Systems RTOS For Embedded systems.
Chapter 1 Characterization of Distributed Systems
Primary-Backup Replication
Processes and threads.
CS 325: Software Engineering
Introduction to Distributed Platforms
2. OPERATING SYSTEM 2.1 Operating System Function
Distributed Systems – Paxos
Wayne Wolf Dept. of EE Princeton University
Introduction to Operating System (OS)
Fault Tolerance In Operating System
Real Time Operating System
Real-time Software Design
Transparent Adaptive Resource Management for Middleware Systems
Replication Middleware for Cloud Based Storage Service
Shanna-Shaye Forbes Ben Lickly Man-Kit Leung
COT 5611 Operating Systems Design Principles Spring 2012
Operating Systems Bina Ramamurthy CSE421 11/27/2018 B.Ramamurthy.
Fault Tolerance Distributed Web-based Systems
Active replication for fault tolerance
Software models - Software Architecture Design Patterns
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating Systems : Overview
Operating Systems : Overview
Operating Systems : Overview
Operating Systems : Overview
Quality-aware Middleware
EEC 688/788 Secure and Dependable Computing
Design.
Distributed Systems and Concurrency: Distributed Systems
COT 5611 Operating Systems Design Principles Spring 2014
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Presentation transcript:

Real-Time Fault Tolerant CORBA Tom Bracewell Senior Principal Software Engineer Raytheon IDS, Sudbury, MA bracewell@raytheon.com Dr. Priya Narasimhan Asst. Professor of ECE and CS Carnegie Mellon University, Pittsburgh, PA priya@cs.cmu.edu

Motivation Growing need for middleware that supports both dependability and real-time QoS Two CORBA standards Real-Time (RT) CORBA standard Fault-Tolerant (FT) CORBA standard Applications that need both are left out in the cold Our focus Why real-time and fault tolerance aren’t an easy mix How to overcome these issues and support both needs

OA Resource Management RT FT CORBA standard work within and without OA (stand-alone middleware) Includes some resource management Resource Management infrastructure Work with RT FT CORBA and other middleware who handles fault detection, isolation & recovery RM not easily decoupled from a RT FT CORBA Further issues support multi-level security

RT FT CORBA Application-transparent fault tolerance Bounded predictable real-time recovery times Multilevel fault tolerance Support for real-time QOS Scalable Basic CORBA benefits

Two Goals Real-time CORBA Fault tolerant CORBA End-to-end predictability Scheduling entities (threads) Assigning priorities to tasks Managing process, storage and communication resources Fault tolerant CORBA Strong replica consistency Replicating entities (CORBA objects or processes) Managing and distributing replicas Logging messages, checkpointing and recovery

Conflicting Worlds

What RT FT Middleware Must Do Handle RT- FT tradeoffs Order operations to meet RT and FT requirements Resolve non-deterministic conflicts (timers, multithreading) Lessen the impact of RT and FT on one other Faults and fault recovery impact real-time performance Schedule and bound recovery to avoid missed deadlines Support system scalability Scalable fault detection and recovery Consider nested (multi-tiered) applications Tolerate partitioning faults

Faults to Tolerate / Reduce Crash faults Hardware and/or OS crashes in isolation Process and/or object crashes Omission faults missed deadline in a real-time system Communication faults Message loss and message corruption Network partitioning Malicious faults Processor/process/object maliciously subverted Design faults correlated software/programming/design errors

Architectural Approach Replicate to protect Application objects RT FT middleware (scheduler, global resource manager) Objective Keep replicas state-consistent despite faults, missed deadlines, recovery and non-determinism in system RT-FT Scheduler Performs real-time resource-aware scheduling Fault-tolerant-aware - decides when to initiate recovery Resource Manager Hierarchical: local resource managers feed global RM Coordinates with RT-FT scheduler to meet objective

Resource-Aware Middleware Predict and control resource usage Input from resource managers Limits and usage of CPUs, memory, network, etc. Proactive actions Predict and perform new resource allocations Place resource-hogging objects on idle machines Reactive actions Respond to overload conditions and transients Migrate replicas of offending objects to idle machines Policies determine best tradeoff if limits are met Know which to relax or recover first - QoS or dependability Supports graceful degradation of ‘ilities’

Interceptors offer Transparency Extend CORBA features with Interceptors Interceptor – a user-level extension to OS Works with unmodified operating systems unmodified ORBs and JVMs unmodified applications Enhance application at run time with monitoring, security, protocol adaptation, fault tolerance Interceptors are middleware-agnostic can make our RT-FT middleware strategy more portable

RT FT CORBA Architecture Local Fault Detector Interceptor Operating System Application Resource Monitor HOST Replication Manager RT-FT Scheduler

Proactive Dependability Rejuvenate replicas and OS resources, avoid faults Fault predictor know when and what types of common faults are likely to occur predict resource exhaustion, network congestion Recovery predictor Offline: Analyze source code for worst-case recovery time Look at object’s data structures, ORB interactions, etc. Can’t predict dynamic memory allocation Runtime: Profile object’s execution and memory allocation Intercept and observe runtime memory allocations Prepare for worst-case replica recovery times

Fault-Tolerance Advisor Delivers deployment / run-time advice on Number or replicas Replication style Checkpoint rate Fault detection rate Inputs to Fault-Tolerance Advisor Application (size of state, quiescent points, resource usage) System (reliability, recovery times, platform, OS, network) Advisor works with other components Enforces reliability advice Sustains system reliability in the presence of faults

Offline RT-FT Hazard Analysis Applications may be non-deterministic Multithreading Direct access to I/O devices Local timers Hazard oracle sifts through source code offline To pinpoint sources of non-determinism To insert code to sanitize/wrap non-deterministic code To determine size of state and recovery time Reduce hazards, feed recovery times to scheduler

Measure the Benefits Why component-level software fault tolerance fine-grained fault tolerance offers fast failover reduce total system hardware, power, weight, cost extend mission life - support graceful degradation Use fault injection to measure benefits Inject software faults into applications, ORBs Measure real-time predictability, recoverability, survivability Evaluate how RT FT ORB affects system availability Several fault injectors (CMU, LAAS, Georgia Tech)

Summary Strategy Ongoing efforts Order tasks to meet replica consistency and deadlines Bound fault detection and recovery times Plan worst-case performance during fault recovery Support proactive dependability Take the guesswork out of configuring for reliability Detect and reduce non-determinism Measure dependability gains Ongoing efforts DARPA PCES program OMG standardization effort - RT FT CORBA RFP

Backup

End-to-End Predictability Most important property in RT-CORBA Priorities attached to threads and invocations Maps to native priorities on the operating system Bounds on temporal properties of application Bounded message transmission latency across network Bounded message processing time within ORB and Task schedules computed ahead of time (offline) Schedule respects task priorities and task deadlines Fixed-priority scheduling Priority banding Multiple client-to-server connections, each at a different priority Client-dictated or server-dictated

Strong Replica Consistency Most important property of an FT-CORBA system Requires deterministic behavior in application Message transmission and delivery guarantees Same sequence of messages in the same order No loss of messages over the communication medium No delivery of duplicate invocations or responses Transfer state to new and recovering replicas Both active and passive replication need it Passive replication cannot cure non-determinism

Determinism Determinism in the real-time sense Equivalent to predictability Real-time invocation is deterministic if execution and processing times are bounded and predictable ahead of time Lack of RT determinism can result in missed deadlines Determinism in the fault tolerance sense Equivalent to reproducibility Fault-tolerant invocation is deterministic if execution on replicas, starting in same state on different processors, produces the same state changes and responses Lack of FT determinism can result in replica inconsistency

Multithreading Real-time systems use multithreading To allow concurrent tasks to execute simultaneously Multithreading a problem in fault tolerant systems Unrestricted multi-threading leads to non-determinism server runs replicas on two different processors each replica runs two tasks on different concurrent threads threads modifying shared state lead to inconsistency Shared state exists in the ORB - even if application is stateless Scheduler needed to enforce single-threading for determinism Task management conflict Multithreading for task scheduling vs. single-threading for determinism

Time Real-time systems use the notion of wall-clock time Timeouts and timers used to finesse real-time consensus issues Clients can run a timeout if server doesn’t respond in time Wall-clock time is problematic in fault-tolerant systems Timeouts and timers can lead to non-determinism & inconsistency Replicated (middle-tier) client with two replicas C1 and C2 C1’s and C2’s timeouts might expire at different times C1 might think operation missed its deadline; C2 might think otherwise Fault-tolerant systems use clock synchronization & global time service Time management Maintain determinism without making global time service a bottleneck

Ordering Ordering in the real-time sense Tasks and invocations ordered to meet application deadlines Ordering in the fault tolerance sense Tasks and invocations ordered to meet replica consistency What if the two orders conflict? Processor P1 hosts replicas of objects A, B and C Processor P2 hosts replicas of objects A and D Schedules on the two processors might vary with current resources P1’s replica of A and P2’s replica of A might see different orders What if different machines need different task mixtures? Some tasks ordered a la real-time, others ordered a la fault tolerance

Synchronous Operation Real-time assumes mostly synchronous operation Events, tasks, operations known ahead of time Bounded latencies, bounded response time Fault tolerance considers asynchronous environment Distributed asynchronous system Unbounded latency & response time, unreliable fault detection Fault tolerance assumes inherent unpredictability Faults cannot be predicted ahead of time; they are asynchronous events What if faults “upset” the pre-computed real-time schedule? Can we get synchronous operation in an asynchronous setting? Especially in the presence of transient faults

Fault Detection and Recovery Real-time requires bounded operation time What about operations such as fault detection and recovery? Time-consuming fault detection What can we do about common-mode (correlated) faults? Crash of processor hosting 100 objects can lead to 100 fault reports Time-consuming recovery Recovery must account for ORB, application and infrastructure state Recovery of trivial objects is straightforward (state=simple data structure) What if recovery involves object instantiation? Recovery of a process that requires 100 objects to be instantiated FT CORBA supports object-centric recovery; shared state calls for process-centric recovery