Real-Time Fault Tolerant CORBA

Real-Time Fault Tolerant CORBA
Tom Bracewell Senior Principal Software Engineer Raytheon IDS, Sudbury, MA Dr. Priya Narasimhan Asst. Professor of ECE and CS Carnegie Mellon University, Pittsburgh, PA

Motivation Growing need for middleware that supports both dependability and real-time QoS Two CORBA standards Real-Time (RT) CORBA standard Fault-Tolerant (FT) CORBA standard Applications that need both are left out in the cold Our focus Why real-time and fault tolerance aren’t an easy mix How to overcome these issues and support both needs

OA Resource Management
RT FT CORBA standard work within and without OA (stand-alone middleware) Includes some resource management Resource Management infrastructure Work with RT FT CORBA and other middleware who handles fault detection, isolation & recovery RM not easily decoupled from a RT FT CORBA Further issues support multi-level security

RT FT CORBA Application-transparent fault tolerance
Bounded predictable real-time recovery times Multilevel fault tolerance Support for real-time QOS Scalable Basic CORBA benefits

Two Goals Real-time CORBA Fault tolerant CORBA
End-to-end predictability Scheduling entities (threads) Assigning priorities to tasks Managing process, storage and communication resources Fault tolerant CORBA Strong replica consistency Replicating entities (CORBA objects or processes) Managing and distributing replicas Logging messages, checkpointing and recovery

Conflicting Worlds

What RT FT Middleware Must Do
Handle RT- FT tradeoffs Order operations to meet RT and FT requirements Resolve non-deterministic conflicts (timers, multithreading) Lessen the impact of RT and FT on one other Faults and fault recovery impact real-time performance Schedule and bound recovery to avoid missed deadlines Support system scalability Scalable fault detection and recovery Consider nested (multi-tiered) applications Tolerate partitioning faults

Faults to Tolerate / Reduce
Crash faults Hardware and/or OS crashes in isolation Process and/or object crashes Omission faults missed deadline in a real-time system Communication faults Message loss and message corruption Network partitioning Malicious faults Processor/process/object maliciously subverted Design faults correlated software/programming/design errors

Architectural Approach
Replicate to protect Application objects RT FT middleware (scheduler, global resource manager) Objective Keep replicas state-consistent despite faults, missed deadlines, recovery and non-determinism in system RT-FT Scheduler Performs real-time resource-aware scheduling Fault-tolerant-aware - decides when to initiate recovery Resource Manager Hierarchical: local resource managers feed global RM Coordinates with RT-FT scheduler to meet objective

Resource-Aware Middleware
Predict and control resource usage Input from resource managers Limits and usage of CPUs, memory, network, etc. Proactive actions Predict and perform new resource allocations Place resource-hogging objects on idle machines Reactive actions Respond to overload conditions and transients Migrate replicas of offending objects to idle machines Policies determine best tradeoff if limits are met Know which to relax or recover first - QoS or dependability Supports graceful degradation of ‘ilities’

Interceptors offer Transparency
Extend CORBA features with Interceptors Interceptor – a user-level extension to OS Works with unmodified operating systems unmodified ORBs and JVMs unmodified applications Enhance application at run time with monitoring, security, protocol adaptation, fault tolerance Interceptors are middleware-agnostic can make our RT-FT middleware strategy more portable

RT FT CORBA Architecture
Local Fault Detector Interceptor Operating System Application Resource Monitor HOST Replication Manager RT-FT Scheduler

Proactive Dependability
Rejuvenate replicas and OS resources, avoid faults Fault predictor know when and what types of common faults are likely to occur predict resource exhaustion, network congestion Recovery predictor Offline: Analyze source code for worst-case recovery time Look at object’s data structures, ORB interactions, etc. Can’t predict dynamic memory allocation Runtime: Profile object’s execution and memory allocation Intercept and observe runtime memory allocations Prepare for worst-case replica recovery times

Fault-Tolerance Advisor
Delivers deployment / run-time advice on Number or replicas Replication style Checkpoint rate Fault detection rate Inputs to Fault-Tolerance Advisor Application (size of state, quiescent points, resource usage) System (reliability, recovery times, platform, OS, network) Advisor works with other components Enforces reliability advice Sustains system reliability in the presence of faults

Offline RT-FT Hazard Analysis
Applications may be non-deterministic Multithreading Direct access to I/O devices Local timers Hazard oracle sifts through source code offline To pinpoint sources of non-determinism To insert code to sanitize/wrap non-deterministic code To determine size of state and recovery time Reduce hazards, feed recovery times to scheduler

Measure the Benefits Why component-level software fault tolerance
fine-grained fault tolerance offers fast failover reduce total system hardware, power, weight, cost extend mission life - support graceful degradation Use fault injection to measure benefits Inject software faults into applications, ORBs Measure real-time predictability, recoverability, survivability Evaluate how RT FT ORB affects system availability Several fault injectors (CMU, LAAS, Georgia Tech)

Summary Strategy Ongoing efforts
Order tasks to meet replica consistency and deadlines Bound fault detection and recovery times Plan worst-case performance during fault recovery Support proactive dependability Take the guesswork out of configuring for reliability Detect and reduce non-determinism Measure dependability gains Ongoing efforts DARPA PCES program OMG standardization effort - RT FT CORBA RFP

Backup

End-to-End Predictability
Most important property in RT-CORBA Priorities attached to threads and invocations Maps to native priorities on the operating system Bounds on temporal properties of application Bounded message transmission latency across network Bounded message processing time within ORB and Task schedules computed ahead of time (offline) Schedule respects task priorities and task deadlines Fixed-priority scheduling Priority banding Multiple client-to-server connections, each at a different priority Client-dictated or server-dictated

Strong Replica Consistency
Most important property of an FT-CORBA system Requires deterministic behavior in application Message transmission and delivery guarantees Same sequence of messages in the same order No loss of messages over the communication medium No delivery of duplicate invocations or responses Transfer state to new and recovering replicas Both active and passive replication need it Passive replication cannot cure non-determinism

Determinism Determinism in the real-time sense
Equivalent to predictability Real-time invocation is deterministic if execution and processing times are bounded and predictable ahead of time Lack of RT determinism can result in missed deadlines Determinism in the fault tolerance sense Equivalent to reproducibility Fault-tolerant invocation is deterministic if execution on replicas, starting in same state on different processors, produces the same state changes and responses Lack of FT determinism can result in replica inconsistency

Multithreading Real-time systems use multithreading
To allow concurrent tasks to execute simultaneously Multithreading a problem in fault tolerant systems Unrestricted multi-threading leads to non-determinism server runs replicas on two different processors each replica runs two tasks on different concurrent threads threads modifying shared state lead to inconsistency Shared state exists in the ORB - even if application is stateless Scheduler needed to enforce single-threading for determinism Task management conflict Multithreading for task scheduling vs. single-threading for determinism

Time Real-time systems use the notion of wall-clock time
Timeouts and timers used to finesse real-time consensus issues Clients can run a timeout if server doesn’t respond in time Wall-clock time is problematic in fault-tolerant systems Timeouts and timers can lead to non-determinism & inconsistency Replicated (middle-tier) client with two replicas C1 and C2 C1’s and C2’s timeouts might expire at different times C1 might think operation missed its deadline; C2 might think otherwise Fault-tolerant systems use clock synchronization & global time service Time management Maintain determinism without making global time service a bottleneck

Ordering Ordering in the real-time sense
Tasks and invocations ordered to meet application deadlines Ordering in the fault tolerance sense Tasks and invocations ordered to meet replica consistency What if the two orders conflict? Processor P1 hosts replicas of objects A, B and C Processor P2 hosts replicas of objects A and D Schedules on the two processors might vary with current resources P1’s replica of A and P2’s replica of A might see different orders What if different machines need different task mixtures? Some tasks ordered a la real-time, others ordered a la fault tolerance

Synchronous Operation
Real-time assumes mostly synchronous operation Events, tasks, operations known ahead of time Bounded latencies, bounded response time Fault tolerance considers asynchronous environment Distributed asynchronous system Unbounded latency & response time, unreliable fault detection Fault tolerance assumes inherent unpredictability Faults cannot be predicted ahead of time; they are asynchronous events What if faults “upset” the pre-computed real-time schedule? Can we get synchronous operation in an asynchronous setting? Especially in the presence of transient faults

Fault Detection and Recovery
Real-time requires bounded operation time What about operations such as fault detection and recovery? Time-consuming fault detection What can we do about common-mode (correlated) faults? Crash of processor hosting 100 objects can lead to 100 fault reports Time-consuming recovery Recovery must account for ORB, application and infrastructure state Recovery of trivial objects is straightforward (state=simple data structure) What if recovery involves object instantiation? Recovery of a process that requires 100 objects to be instantiated FT CORBA supports object-centric recovery; shared state calls for process-centric recovery

Real-Time Fault Tolerant CORBA

Similar presentations

Presentation on theme: "Real-Time Fault Tolerant CORBA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-Time Fault Tolerant CORBA

Similar presentations

Presentation on theme: "Real-Time Fault Tolerant CORBA"— Presentation transcript:

Similar presentations

About project

Feedback