Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher R. Andrews, and Yuanyuan Zhou Review by M. Kozuch
Motivation Better debugging Reproducing errors is hard because 1. Bug setup may require significant time 2. Exact inputs may be difficult to reproduce 3. Some are Heisenbugs Solution: Provide a mechanism for deterministic replay after a software error is detected.
Related Work Compile-time static checking Run-time dynamic checking Hardware support
Overview State = checkpoint() discard(State) replay(State)
Main Idea Part I Use fork()-like mechanism to create shadow copy of a process Copy-on-write memory image Copy-on-write memory image Register values Register values Some process state Some process state Not exactly fork Shadow not runnable Shadow not runnable Not all reference counts are incremented Not all reference counts are incremented
Memory Image Copy-on-write semantics Reduces cost of checkpoint() Reduces cost of checkpoint() Reduces memory footprint Reduces memory footprint Reduces impact of multiple checkpoints Reduces impact of multiple checkpoints Reduces cost of replay() Reduces cost of replay()
Multithreaded Processes Option 1: Rollback the process “Trivially” ensures consistency “Trivially” ensures consistency Option 2: Rollback the thread Requires ordering log for memory and files Requires ordering log for memory and files Rollback thread set with inter-dependencies Rollback thread set with inter-dependencies Problems: Problems: Logic adds overhead and is error-proneLogic adds overhead and is error-prone Data races may require watching all threadsData races may require watching all threads Overhead paid even when no errorsOverhead paid even when no errors Multithreaded state capture is still hard if some of the threads can be in a different context (i.e. kernel).
Main Idea Part II Can re-execute code Must replay inputs (e.g. from read()) After checkpoint(), log syscall return values* After replay(), replay the log
Log Weaknesses Shared memory Regions must be identified, and all accesses set to generate #PF Regions must be identified, and all accesses set to generate #PF Not currently handled Not currently handled Signals Replay of asynchronous events is challenging Replay of asynchronous events is challenging Not currently handled Not currently handled
uBenchmark Performance I Checkpoint() is us Discard()/Replay() is /7500us
uBenchmark Performance II read()/write() syscalls Used cold caches “for consistency”
Application Performance Experiments One application Network bound Logging disabled Conclusion: the overhead of checkpointing is low.