Operating System Reliability

Slides:



Advertisements
Similar presentations
Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Advertisements

Database Recovery Unit 12 Database Recovery 12-1.
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
Chapter 19 Database Recovery Techniques
Jan. 2014Dr. Yangjun Chen ACS Database recovery techniques (Ch. 21, 3 rd ed. – Ch. 19, 4 th and 5 th ed. – Ch. 23, 6 th ed.)
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
1 Minggu 8, Pertemuan 16 Transaction Management (cont.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
©Silberschatz, Korth and Sudarshan17.1Database System Concepts 3 rd Edition Chapter 17: Recovery System Failure Classification Storage Structure Recovery.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Joonwon Lee Recovery. Lightweight Recoverable Virtual Memory Rio Vista.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Chapter 15 Recovery. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.15-2 Topics in this Chapter Transactions Transaction Recovery System.
XA Transactions.
Database Systems Recovery & Concurrency Lecture # 20 1 st April, 2011.
Chapter 10 Recovery System. ACID Properties  Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Recovery.
Section 06 (a)RDBMS (a) Supplement RDBMS Issues 2 HSQ - DATABASES & SQL And Franchise Colleges By MANSHA NAWAZ.
Lecture 12 Fault Tolerance, Logging and recovery Thursday Oct 8 th, Distributed Systems.
Transactions.
Recovery technique. Recovery concept Recovery from transactions failure mean data restored to the most recent consistent state just before the time of.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
1 Fault Tolerance and Recovery Mostly taken from
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Database recovery techniques
Database Recovery Techniques
Remote Backup Systems.
Database Recovery Techniques
DURABILITY OF TRANSACTIONS AND CRASH RECOVERY
8.6. Recovery By Hemanth Kumar Reddy.
Enforcing the Atomic and Durable Properties
Prepared by Ertuğrul Kuzan
Chapter 10 Recover System
EEC 688/788 Secure and Dependable Computing
Fault Tolerance.
Operating System Reliability
Operating System Reliability
Transaction Management
CS 632 Lecture 6 Recovery Principles of Transaction-Oriented Database Recovery Theo Haerder, Andreas Reuter, 1983 ARIES: A Transaction Recovery Method.
Outline Announcements Fault Tolerance.
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Outline Introduction Background Distributed DBMS Architecture
Module 17: Recovery System
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
Transactions in Distributed Systems
Abstractions for Fault Tolerance
Operating System Reliability
Presentation transcript:

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems

Some Axioms Some simple systems, designed from scratch, sometimes work A complex system that works is invariably found to have evolved from a simple system that works A complex system, designed from scratch never works

Failure-Mode Theorems Complex systems usually operate in failure mode A system should have safe behaviors when encountering failures When a “fail-safe” system fails, it fails by failing to fail safe

Some definitions Failure occurs when the system does not perform its services in the manner specified Failures can be subtle (e.g., performance fault) Fault is anomalous physical condition Includes system specification/implementation mistakes Error is part of system state that differs from its intended value

Classification of Failures Process failures System failures Secondary storage failures Communication medium failures

Process Failures Examples Errors leading to failure Computation results in incorrect outcome System state deviates from specification Process fails to progress Errors leading to failure Deadlock, timeout, protection violation Bad input, consistency violation Ignoring malicious behavior

System Failures Processor fails to execute Fail-stop behavior assumed Software error, hardware error (CPU, bus, etc.) Fail-stop behavior assumed Failure types Amnesia Partial-amnesia Pause Halting

Secondary Storage Failures Stored data inaccessible Parity error Head crash Contaminated medium Reconstructable from archive + log, maybe Mirrored disks (independent failure mode)

Communication Medium Failures Site can’t communicate with another site Causes Switching node failure Hardware failure Software failure Congestion Link failure Hardware Implementation failure Network partitions can result

Recovery Restart process/processor Reclaim resources Undo/finish incomplete transactions Concurrency makes things harder

Forward Error Recovery Goal: To restore system from erroneous state to error-free state If nature of error is completely known Remove error from state Proceed with execution from error-free state Rarely possible to do

Backward Error Recovery When error source unknown Restore state to previous error-free state; restart Independent of fault, errors causing fault Problems Performance penalty No guarantee fault will not reoccur Possible unrecoverable component of state Recovery point: state used to replace error

Backward Error Recovery Basic approaches Operation-based Logs Update-in-place Write-ahead-log State-based

Update-in-Place Every update to object also records the log Name of object Old and new states of object Recoverable update operation implements as Do, undo, redo operations

Write-ahead Log Update-in-place has problem if crash occurs between update and log recorded to stable storage Update object only after undo log recorded Before committing updates, record both redo and undo logs Expensive to write log to stable storage

State-Based Recovery Save entire process state at recovery point Recovery point called checkpoint Rolling back process: restoring to checkpoint Tradeoff: frequent checkpoints vs. completion delay Shadow pages Save unmodified page copy on stable storage Update only volatile copy; discard on rollback

Concurrent Systems Recovery Rollback issues Orphan messages Domino effect Lost messages Livelocks

Orphan Messages (a message prior to a checkpoint is sent to the future) x1 x2 X [ [ y1 m y2 Y [ [ z1 z2 Z [ [ [ recovery point

Domino Effect Suppose Y rolls back to y2 Suppose Z rolls back to z2 m is orphan message Process Y must rollback to y1 Suppose Z rolls back to z2 Y rolls back to y1 Now a message from future is sent to the past prior to a checkpoint Forcing Z to roll back to z1

Lost Messages x1 X [ m z1 Z [ failure [ recovery point

Live Locks x1 X [ z1 Z [ repeated failure [ recovery point

Concurrent Recovery Coordination required at either time of establishing checkpoints Beginning of recovery

Checkpoint Assumptions Communication via messages Unreliable FIFO channels Higher-level end-to-end protocols assumed Subsumes rollback-caused message loss No network partitions from communication failures

Checkpoint Algorithm Concepts Permanent and tentative checkpoints Saved on stable storage Permanent: part of known consistent global checkpoint Tentative: until successful termination of checkpoint algorithm Rolls back only to permanent checkpoints

Synchronous Checkpoint Algorithms Two-phase commit Problems: Message overhead for synchronizations Synchronization delays Costly when failures are rare

Asynchronous Checkpointing Local checkpoints taken independently Log all incoming messages on stable storage Minimizes undone computation Allows reprocessing of messages after rollback

Asynchronous Checkpointing Assumptions Reliable FIFO communication channels Infinite buffers Event-driven computation A process idle until message received Processes message and change state Sends zero or more messages Can identify each event with monotonically increasing counter

Event-Driven Computation x1 x2 X y1 y2 Y z1 z2 Z

Asynchronous Checkpointing Basic idea Save states, messages sent at each event Volatile logging Each processor notes number of messages sent to others, and received from others Use counters to determine orphan messages

Summary Failures caused by errors Can remove errors by forward/backward error recovery Backward error-recovery more costly, more general Synchronous checkpoints helpful, costly Asynchronous checkpoints messier, domino effects