Download presentation
Presentation is loading. Please wait.
1
Presented by: Daniel Taylor
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures Presented by: Daniel Taylor
2
Outline Motivation Approaches to surviving failures Rx Approach
Rx Design Experimental Results Future Work Evaluation Discussion
3
Motivation System Availability
Gartner report: 1 hour of downtime = $6 million Affected by software failures Software defects cause up to 40% of system failures Memory-related and concurrency bugs account for over 60% of system vulnerabilities
4
Motivation Treat bugs as allergies Examples of environmental bugs
Memory management Buffer overflows Dangling pointers Timing Races Message ordering User Request Malicious users Bad requests
5
Approaches to surviving failures
1) Rebooting/System restart Designed for hardware failures Fail in fixing deterministic bugs Unavailability Warm-up period Micro-rebooting
6
Approaches to surviving failures
2) Checkpointing and recovery Checkpoint, rolback on failure, re-execute Designed for hardware failures Fail in fixing deterministic bugs Progressive retry – method to re-order messages Only works for message ordering bugs N-version programming – run different implementation on re-execution Requires extra software development
7
Approaches to surviving failures
3) Application-specfic recovery Multi-process model Spawn new processes if old ones fail Cannot deal with deterministic bugs Cannot deal with shared data corruption Exception handling Programmer must expect failures
8
Approaches to surviving failures
4) Non-conventional methods Failure-oblivious computing Artificial values for buffer overflows Reactive immune systems Speculative error code for crashed functions Unsafe methods, not appropriate for critical applications Hard to debug if the “fix” does something strange
9
Rx Approach Treat bugs like real-life allergies Goals:
Remove the allergen to see if it helps Goals: Comprehensive – survive software bugs Safe - no uncertainty or introduced errors Noninvasive – no modifications Efficient – good performance, reduce downtime Informative – help diagnose bugs
10
Rx Approach Keep checkpoints
Fail > Rollback > Change Environment > Re-Execute Disable modifications if it succeeds
11
Rx Approach Execution Environment
Anything external to the application affecting it Low Level – Hardware Middle Level – OS Kernel: scheduling, VM system, FS, drivers, etc. High Level – libraries Change must be: Correctness-preserving – follow API’s, do the same thing Avoid bugs – potentially fix a software defect
12
Rx Approach Environmental changes and bugs
13
Rx Design 5 parts Sensors Checkpoint and Rollback (CR)
Environment Wrapper Proxy Control Unit
14
Rx Design: Sensors Detect failures and inform the control unit
Two types of sensors: Detect software errors Detect bugs before they cause crashes Only the 1st is implemented Provide information about the type of exception, memory address, and stack signature
15
Rx Design: Checkpoint and Rollback
Checkpoints are automatically and transparently taken Application memory, accessed files and file pointers are copied by copy-on-write Kept in memory (no disk accesses), old checkpoints can be written to disk Using checkpoints too far back takes too long
16
Rx Design: Checkpoint and Rollback
Based on previous work, Flashback in 2004 Because Rx doesn’t require determinism, it avoids overhead
17
Rx Design: Environment Wrappers
Carry out the environment changes during re-execution Memory Wrapper Intercepts memory library calls (malloc, free, etc) Supports 4 environmental changes Delaying free Padding buffers Allocation isolation Zero-filling Safe, no changes to API
18
Rx Design: Environment Wrappers
Message Wrapper Implemented in the proxy, controls message ordering Changes include Shuffling requests Randomized packet sizes Helps avoid non-deterministic bugs No change to execution – server should not expect any ordering or size
19
Rx Design: Environment Wrappers
Process Scheduling Change priority Signal Delivery Signals are recorded and can be delivered randomly Dropping User Requests Binary search to narrow down possible bad user request and drop
20
Rx Design: Proxy Records and replays messages on re-execution
Simply forwards messages during normal execution On recovery, the proxy Replays requests Carries out message-related environment changes Buffers incoming messages for after failure recovery Keeps track which requests received responses
21
Rx Design: Control Unit
Coordinates the other components and performs 3 functions: Directs checkpointing and requests rollbacks Diagnoses failures based on symptoms and experiences and chooses changes to use Gives an information report for programmers Keeps a failure table to judge how well each environmental change works for future reference
22
Rx Design Multi-threaded process checkpointing
Threads must be at the user level before taking a checkpoint because of kernel locks and state issues A signal makes threads exit blocked calls to take the checkpoint, then Rx retries them Big I/O problems with this method, cannot set checkpoint interval too short
23
Experimental Results 4 different sets of tests
Surviving failures Performance overhead Malicious requests Learning from previous failures Tested with 4 real applications Apache httpd – web server Squid – proxy server MySQL – database server CVS – version control server 6 bugs: data race, buffer overflow, uninitialized read, dangling pointer, stack overflow, double free
24
Experimental Results Alternatives are the whole program restart or a rollback and re-execute method Rx provides availability and is faster than restart methods except in the case of very simple programs (CVS) If the bug is deterministic, restarting will likely cause a crash again
25
Experimental Results
26
Experimental Results
27
Experimental Results
28
Future Work Inter-server communication Unavoidable bugs/failures
If Rx is on all systems, it can rollback any that it needs to when a failure occurs Coordinated checkpoints Unavoidable bugs/failures Memory leaks – requires whole program restart Deadlocks Semantic bugs that have nothing to do with the environment Undetectable bugs – need better sensors Implement Proxy in the kernel level
29
Evaluation Safe/fast recovery of certain bugs, but not all bugs
Masks failures to users, provides availability Rx was only tested on I/O bound applications, overhead may be larger for computation-based applications
30
Discussion Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.