Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operating System Reliability

Similar presentations


Presentation on theme: "Operating System Reliability"— Presentation transcript:

1 Operating System Reliability
Andy Wang COP 5611 Advanced Operating Systems

2 Some Axioms Some simple systems, designed from scratch, sometimes work
A complex system that works is invariably found to have evolved from a simple system that works A complex system, designed from scratch never works

3 Failure-Mode Theorems
Complex systems usually operate in failure mode A system should have safe behaviors when encountering failures When a “fail-safe” system fails, it fails by failing to fail safe

4 Some definitions Failure occurs when the system does not perform its services in the manner specified Failures can be subtle (e.g., performance fault) Fault is anomalous physical condition Includes system specification/implementation mistakes Error is part of system state that differs from its intended value

5 Classification of Failures
Process failures System failures Secondary storage failures Communication medium failures

6 Process Failures Examples Errors leading to failure
Computation results in incorrect outcome System state deviates from specification Process fails to progress Errors leading to failure Deadlock, timeout, protection violation Bad input, consistency violation Ignoring malicious behavior

7 System Failures Processor fails to execute Fail-stop behavior assumed
Software error, hardware error (CPU, bus, etc.) Fail-stop behavior assumed Failure types Amnesia Partial-amnesia Pause Halting

8 Secondary Storage Failures
Stored data inaccessible Parity error Head crash Contaminated medium Reconstructable from archive + log, maybe Mirrored disks (independent failure mode)

9 Communication Medium Failures
Site can’t communicate with another site Causes Switching node failure Hardware failure Software failure Congestion Link failure Hardware Implementation failure Network partitions can result

10 Recovery Restart process/processor Reclaim resources
Undo/finish incomplete transactions Concurrency makes things harder

11 Forward Error Recovery
Goal: To restore system from erroneous state to error-free state If nature of error is completely known Remove error from state Proceed with execution from error-free state Rarely possible to do

12 Backward Error Recovery
When error source unknown Restore state to previous error-free state; restart Independent of fault, errors causing fault Problems Performance penalty No guarantee fault will not reoccur Possible unrecoverable component of state Recovery point: state used to replace error

13 Backward Error Recovery
Basic approaches Operation-based Logs Update-in-place Write-ahead-log State-based

14 Update-in-Place Every update to object also records the log
Name of object Old and new states of object Recoverable update operation implements as Do, undo, redo operations

15 Write-ahead Log Update-in-place has problem if crash occurs between update and log recorded to stable storage Update object only after undo log recorded Before committing updates, record both redo and undo logs Expensive to write log to stable storage

16 State-Based Recovery Save entire process state at recovery point
Recovery point called checkpoint Rolling back process: restoring to checkpoint Tradeoff: frequent checkpoints vs. completion delay Shadow pages Save unmodified page copy on stable storage Update only volatile copy; discard on rollback

17 Concurrent Systems Recovery
Rollback issues Orphan messages Domino effect Lost messages Livelocks

18 Orphan Messages (a message prior to a checkpoint is sent to the future)
x1 x2 X [ [ y m y2 Y [ [ z1 z2 Z [ [ [ recovery point

19 Messages from Future Sent to the Past
x1 x2 X [ [ y m y2 Y [ [ z1 z2 Z [ [ [ recovery point

20 Messages from Future Sent to the Past
x1 x2 X [ [ y m y2 Y [ [ z1 z2 Z [ [ [ recovery point

21 Domino Effect Completed
x1 x2 X [ [ y m y2 Y [ [ z1 z2 Z [ [ [ recovery point

22 Lost Messages x1 X [ m z1 Z [ failure [ recovery point

23 Live Locks x1 X [ z1 Z [ repeated failure [ recovery point

24 Concurrent Recovery Coordination required at either time of establishing checkpoints Beginning of recovery

25 Checkpoint Assumptions
Communication via messages Unreliable FIFO channels Higher-level end-to-end protocols assumed Subsumes rollback-caused message loss No network partitions from communication failures

26 Checkpoint Algorithm Concepts
Permanent and tentative checkpoints Saved on stable storage Permanent: part of known consistent global checkpoint Tentative: until successful termination of checkpoint algorithm Rolls back only to permanent checkpoints

27 Synchronous Checkpoint Algorithms
Two-phase commit Problems: Message overhead for synchronizations Synchronization delays Costly when failures are rare

28 Asynchronous Checkpointing
Local checkpoints taken independently Log all incoming messages on stable storage Minimizes undone computation Allows reprocessing of messages after rollback

29 Asynchronous Checkpointing Assumptions
Reliable FIFO communication channels Infinite buffers Event-driven computation A process idle until message received Processes message and change state Sends zero or more messages Can identify each event with monotonically increasing counter

30 Event-Driven Computation
x1 x2 X y1 y2 Y z1 z2 Z

31 Asynchronous Checkpointing
Basic idea Save states, messages sent at each event Volatile logging Each processor notes number of messages sent to others, and received from others Use counters to determine orphan messages

32 Summary Failures caused by errors
Can remove errors by forward/backward error recovery Backward error-recovery more costly, more general Synchronous checkpoints helpful, costly Asynchronous checkpoints messier, domino effects


Download ppt "Operating System Reliability"

Similar presentations


Ads by Google