Budapest University of Technology and Economics Department of Measurement and Information Systems 1 Fault Tolerant CORBA (FT-CORBA) - Modeling and Analysis István Majzik Budapest University of Technology and Economics Department of Measurement and Information Systems June 2000
Budapest University of Technology and Economics Department of Measurement and Information Systems 2 Introduction Basis: –FT-CORBA specification –UML-based automatic dependability modeling Topics: –Support to construct optimal FT-CORBA schemes –Evaluate existing architectures Part I: The FT-CORBA proposal Part II: UML-based dependability analysis Part III: Dep. modeling of FT-CORBA
Budapest University of Technology and Economics Department of Measurement and Information Systems 3 Part I The FT-CORBA Proposal
Budapest University of Technology and Economics Department of Measurement and Information Systems 4 CORBA OMG CORBA: standard of open OO systems –Provides transparent access to services of remote objects (like local method calls) –ORB: Object Request Broker communication of requests/responses (location, activation, parameter passing etc.) IOR: interoperable object reference GIOP: general inter-ORB protocol IIOP: Internet inter-ORB protocol –IDL: Interface definition language consistency between client and server interfaces
Budapest University of Technology and Economics Department of Measurement and Information Systems 5 FT-CORBA Goal: Fault tolerance in CORBA environment History: –April 1998: Request for Proposal issued –October 1998: Initial submissions –December 1999: Joint revised submission by Ericsson, Inprise, Iona, Lucent, Oracle, Sun,... –April 2000: Final adopted specification
Budapest University of Technology and Economics Department of Measurement and Information Systems 6 FT-CORBA Concepts Avoiding SPOF of single (server) objects Fault tolerance by entity redundancy, fault detection and recovery –creation of (server) object groups –infrastructure to maintain object replicas Basic properties: –replication transparency (access independent of number/location) –failure transparency (access independent of faulty server objects)
Budapest University of Technology and Economics Department of Measurement and Information Systems 7 Fault Tolerance Domains FT domain: –Object groups of server object replicas –Single Replication Manager Object groups: –different hosts –single object per host Replication Manager: –Creation and management of object groups –Support of application-controlled management
Budapest University of Technology and Economics Department of Measurement and Information Systems 8 Fault Tolerance Domain Domains, object groups, hosts and replicas
Budapest University of Technology and Economics Department of Measurement and Information Systems 9 Architecture Overview Set of CORBA objects to support FT –Replication Manager –Fault Detector –Fault Notifier –Fault Analyzer ORB extensions –logging mechanism –recovery mechanism Commercial implementations?
Budapest University of Technology and Economics Department of Measurement and Information Systems 10 Fault Tolerance Infrastructure
Budapest University of Technology and Economics Department of Measurement and Information Systems 11 Replication Management Infrastructure controlled case: –application: create_object() method of the RM –RM: invokes local factory objects on hosts –RM manages membership, consistency Application controlled case: –application’s responsibility to manage replicas Parameters: ReplicationStyle: stateless, cold / warm passive, active MembershipStyle ConsistencyStyle InitialNumberReplicas, MinimumNumberReplicas
Budapest University of Technology and Economics Department of Measurement and Information Systems 12 Fault Detection and Notification Fault model: –object crash (incorrect results are not tolerated) Fault detection by polling –application objects inherit the PullMonitorable interface: is_alive() method –Fault Detector invokes it periodically –hierarchy of fault detectors Fault notification and fault analysis Parameters: –FaultMonitoring (Style, Granularity, IntervalAndTimeout)
Budapest University of Technology and Economics Department of Measurement and Information Systems 13 Logging and Recovery Application objects inherit: –Checkpointable interface: get_state(), set_state() –Updateable interface: get_update(), set_update() Logging Mechanism: –storing GIOP messages –periodically storing state of the objects Recovery Mechanism: –restore object state and retrieve stored messages Parameters: –CheckpointInterval
Budapest University of Technology and Economics Department of Measurement and Information Systems 14 Client Failover Identification of object groups: –IOGR: interoperable object group reference –multiple IIOP profiles addressing object group members or gateways Basic mechanisms of the client ORB: –retry all alternative IIOP profiles –transparent reinvocation of requests (“at most once” execution semantics at the server) –heartbeating of the server IIOP
Budapest University of Technology and Economics Department of Measurement and Information Systems 15 Part II Dependability Modeling of Object-Oriented Systems Described in UML
Budapest University of Technology and Economics Department of Measurement and Information Systems 16 Dependability Analysis Approach by A. Bondavalli, I. Majzik, I. Mura HIDE - High-level Integrated Design Environment for Dependability ESPRIT Open LTR No From UML-based models (class, object, deployment diagrams) to Timed Petri Nets standard PN evaluation tools can be used Supports –comparison of design choices –identification of bottlenecks System-wide, structural model
Budapest University of Technology and Economics Department of Measurement and Information Systems 17 Modeling Approach 1. UML model: Diagrams with extensions stereotypes to identify roles (variant, tester,...) tagged values to assign parameters 2. Intermediate model: Simplified structure elements: software, hardware, with/wo states dependencies: „uses the service of” „is composed of” class based redundancy fault tree 3. Dependability model: Timed Petri net sub-nets for elements and dependencies
Budapest University of Technology and Economics Department of Measurement and Information Systems 18 Failure/Propagation Sub-models > UML model elements Petri net modules O1O2O1
Budapest University of Technology and Economics Department of Measurement and Information Systems 19 Repair Sub-model > UML model Petri net module O1
Budapest University of Technology and Economics Department of Measurement and Information Systems 20 Redundancy Sub-models RMV1V2 UML model Fault treePetri-net
Budapest University of Technology and Economics Department of Measurement and Information Systems 21 Part III Dependability Modeling of FT-CORBA Architectures
Budapest University of Technology and Economics Department of Measurement and Information Systems 22 Approach UML models: –identification of elements/structures –additional parameters support of automatic modeling Tailoring to FT-CORBA –subnets to specific mechanisms –based on the parameters Restrictions: –non-replicated client, static structure –infrastructure controlled replication management
Budapest University of Technology and Economics Department of Measurement and Information Systems 23 UML Modeling Identification of elements/structures –Fault Tolerance Domain: package independent of deployment –Object groups: sub-package –Roles: stereotypes FT-CORBA properties as tagged values –ReplicationStyle –MembershipStyle –ConsistencyStyle –FaultMonitoring (Style, Granularity, Interval) –(Initial, Minimum) NumberReplicas
Budapest University of Technology and Economics Department of Measurement and Information Systems 24 Overall Structure FT Domain Alpha Domain2 FTI RMFNFD OG4 OG3 OG2 OG1 S11S12FD1 Domain1 > C1C2
Budapest University of Technology and Economics Department of Measurement and Information Systems 25 Modularity Available building blocks: –failure subnet –propagation subnet –repair subnet –fault tree Sub-models in FT-CORBA: 1. Client failover 2. Server object failure 3. Fault management (detection and notification) 3. Recovery (replication management)
Budapest University of Technology and Economics Department of Measurement and Information Systems Client Failover Semantics: –Primary is tried first –Failover conditions: „crash” Communication failure No response No failover: erroneous response –No failure exception until all profiles have been tried
Budapest University of Technology and Economics Department of Measurement and Information Systems 27 Dependability Sub-model Fault tree (passive replication): –Top event: Client failure –Basic events: Server object crash Server object erroneous response –Composite events (OR): number n of profiles S1 (primary) erroneous S1 crash AND S2 erroneous S1 crash AND S2 crash AND S3 erroneous... S1 crash AND S2 crash AND... AND Sn crash
Budapest University of Technology and Economics Department of Measurement and Information Systems Server Object Failure Distinction of failures: –Crash Failover in client Error detected in the object group –Erroneous response (commission fault) Propagated to clients, application-specific error detection
Budapest University of Technology and Economics Department of Measurement and Information Systems 29 Dependability Sub-model Failure process: –failure subnet –distinguished cases: crash/erroneous response Propagation subnets –standard subnets (toward the client fault tree)
Budapest University of Technology and Economics Department of Measurement and Information Systems Fault Management Fault detection+notification: Chain of events –Source: Fault Detector latency = MonitoringInterval coverage depends on MonitoringGranularity: –each member / single per host / single per host and type –Propagation: Fault Notifier(s) communication failures –Destination: Replication Manager Hierarchy of Fault Detectors Infrastructure objects: Replication is possible
Budapest University of Technology and Economics Department of Measurement and Information Systems 31 Dependability Sub-model Error detection delay –timed PN transition Fault notification subsystem –fault tree (AND) Replicated infrastructure objects –local fault trees (AND)
Budapest University of Technology and Economics Department of Measurement and Information Systems Recovery in the Object Group Triggered by the Fault Notifier in the Replication Manager Goal: Maintain the number of replicas –crashed object is removed –creation of new replica, restoring state –only a single replica on a given host! Repair is possible if: –current host is fault-free –current host is faulty, but there are available hosts i.e. number of hosts >= NumberReplicas
Budapest University of Technology and Economics Department of Measurement and Information Systems 33 Dependability Sub-model Repair subnet: Explicit repair –latency: CheckpointInterval, ReplicationStyle Recovery of the replica: –Static deployment: Standard repair subnet –Pool of identical hosts: Logic condition for repair Free hosts (PN place) marking increased by host repair and server object crash marking decreased by host crash and server object repair Guard on the transition for explicit repair
Budapest University of Technology and Economics Department of Measurement and Information Systems 34 Overall Structure of Subnets Notification Prop. Client Fault Tree S1 err Prop. S1 crash RecoveryRepair NumberReplica FaultMonitoringGranularity FaultMonitoringInterval ReplicationStyle CheckpointInterval Prop.
Budapest University of Technology and Economics Department of Measurement and Information Systems 35 System-wide Dependability Model Analysis of the Petri-net: –standard tools (SPNP, PANDA,...) Sensitivity analysis –system-wide reliability, availability Optimal selection of FT-CORBA parameters –replication (membership, consistency) styles –number of replicas –monitoring granularity, interval