Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Availability and Fault- Tolerance in Real-Time Databases Jan Lindström University of Helsinki Department of Computer Science.

Similar presentations


Presentation on theme: "High Availability and Fault- Tolerance in Real-Time Databases Jan Lindström University of Helsinki Department of Computer Science."— Presentation transcript:

1 High Availability and Fault- Tolerance in Real-Time Databases Jan Lindström University of Helsinki Department of Computer Science

2 Overview b The causes of the downtime b Availability solutions b CASE 1: Clustra b CASE 2: TelORB b CASE 3: RODAIN

3 The Causes of Downtime b Planned downtime Hardware expansionHardware expansion Database software upgradesDatabase software upgrades Operating system upgradesOperating system upgrades b Unplanned downtime Hardware failureHardware failure OS failureOS failure Database software bugsDatabase software bugs Power failurePower failure DisasterDisaster Human errorHuman error

4 Traditional Availability Solutions b Replication b Failover b Primary restart

5 CASE 1: Clustra b Developed for telephony applications such as mobility management and intelligent networks. b Relational database with location and replication transparency. b Real-Time data locked in main memory and API provides precompiled transactions. b NOT a Real-Time Database !

6 Clustra hardware architecture

7 Data distribution and replication

8 How Clustra Handles Failures b Real-Time failover: Hot-standby data is up to date, so failover occurs in milliseconds. b Automatic restart and takeback: Restart of the failed node and takeback of operations is automatic, and again transparent to users and operators. b Self-repair: If a node fails completely, data is copied from the complementary node to standby. This is also automatic and transparent. b Limited failure effects

9 How Clustra Handles Upgades b Hardware, operating system, and database software upgrades without ever going down. Process called “rolling upgrade”Process called “rolling upgrade” –I.e. required changes are performed node by node. –Each node upgraded to catch up to the status of complementary node. –When this is completed, the operation is performed to next node.

10 CASE 2: TelORB Characteristics  Very high availability (HA), robustness implemented in SW  (soft) Real Time  Scalability by using loosely coupled processors Openness  Hardware: Intel/Pentium  Language: C++, Java  Interoperability: CORBA/IIOP, TCP/IP, Java RMI  3:rd party SW: Java

11 TelORB Availability  Real-time object-oriented DBMS supporting  Distributed Transactions  ACID properties expected from a DBMS  Data Replication (providing redundancy)  Network Redundancy  Software Configuration Control  Automatic restart of processes that originally executed on a faulty processor on the ones that are working  Self healing  In service upgrade of software with no disturbance to operation  Hot replacement of faulty processors

12 Automatic Reconfiguration reloading

13 Software upgrade  Smooth software upgrade when old and new version of same process can coexist  Possibility for application to arrange for state transfer between old and new static process (unless important states aren’t already stored in the database)

14 Partioning: Types and Data 21 22 18 17 AB 2019 2019 A B 18 17 2122

15 Advantages  Standard interfaces through Corba  Standard languages: C++, Java  Based on commercial hardware  (Soft) Real-time OS  Fault tolerance implemented in software  Fully scalable architecture  Includes powerful middleware: A database management system and functions for software management  Fully compatible simulated environment for development on Unix/Linux/NT workstations

16 CASE 3: RODAIN b Real-Time Object-Oriented Database Architechture for Intelligent Networks b Real-Time Main-Memory Database System b Runs on Real-Time OS: Chorus/ClassiX (and Linux)

17 Rodain Cluster

18 Rodain Database Node Distributed Database Subsystem User Request Interpreter Subsystem Watchdog Subsystem Fault-Tolerance and Recovery Subsystem Object- Oriented Database Management Subsystem Database Primary Unit User Request Interpreter Subsystem Watchdog Subsystem Object- Oriented Database Management Subsystem Database Mirror Unit Distributed Database Subsystem Fault-Tolerance and Recovery Subsystem shared disk

19 Distributed Database Subsystem User Request Interpreter Subsystem Watchdog Subsystem Fault-Tolerance and Recovery Subsystem Object- Oriented Database Management Subsystem Database Primary Unit User Request Interpreter Subsystem Watchdog Subsystem Object- Oriented Database Management Subsystem Database Mirror Unit Distributed Database Subsystem Fault-Tolerance and Recovery Subsystem shared disk RODAIN Database Node II

20 ORD Architechture TRP FTRS DDSORD OCCDataIndex

21 Fault-Tolerance b Based on logs and mirroring b Logs send to Mirror b Mirror stores the logs on disk in SSS b Mirror maintains copy of main-memory database b Mirror makes disk copies of its database image

22 Recovery b Based on role switching b When Primary fails Mirror updates its MMDB up to dateMirror updates its MMDB up to date Mirror starts acting as new PrimaryMirror starts acting as new Primary Active transactions are restarted or lostActive transactions are restarted or lost b When Mirror fails Primary stores logs directly to SSSPrimary stores logs directly to SSS

23 Recovery II b During recovery the failed Node always starts as a mirror nodealways starts as a mirror node loads most recent database image from disks in SSSloads most recent database image from disks in SSS updates the log tail to loaded imageupdates the log tail to loaded image receives the logs from primary nodereceives the logs from primary node continues as normal mirror nodecontinues as normal mirror node

24 Further reading b Bratsberg, Humborstad: Online Scaling in a Highly Available Database, Proceedings of the 27th VLDB Conference, Rome, Italy, pp 451-460, 2001. b Clustra Database: Technical Overview, http://www.clustra.com b Björnerstedt, Ketoja, Sintorn, Sköld: Replication between Geographically Separated Clusters - An Asynchronous Scalable Replication Mechanism for Very High Availability, Proceedings of the International Workshop on Databases in Telecommunications II, LNCS vol 2209, pp. 102-115, 2001. b Lindström, Niklander, Porkka, Raatikainen: A Distributed Real-Time Main-Memory Database for Telecommunications, Proceedings of the International Workshop on Databases in Telecommunications, LNCS vol 1819, pp 158-173, 2000.


Download ppt "High Availability and Fault- Tolerance in Real-Time Databases Jan Lindström University of Helsinki Department of Computer Science."

Similar presentations


Ads by Google