Execution Replay and Debugging
Contents
Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing Developing parallel programs is considered difficult: –normal errors as in sequential programs –synchronisation errors (deadlock, races) –performance errors We need good development tools
Debugging of parallel programs Most used technique: cyclic debugging Requires repeatable equivalent executions Is a problem for parallel programs: lots of non-determinism present Solution: execution replay mechanism: –record phase: trace information about the non- deterministic choices –replay phase: force an equivalent re-execution using the trace allowing the use of intrusive debugging techniques
Non-determinism Classes: –external vs. internal non-determinism –desired vs. undesired non-determinism Important: the amount of non-determinism depends on the abstraction level. E.g. a semaphore P()-operation can be fully deterministic while consisting of e number of non-deterministic spinlocking operations.
Causes of Non-determinism –In sequential programs: program code (self modifying code?) program input (disk, keyboard, network,...) certain system calls ( gettimeofday() ) interrupts, signals,... –In parallel programs: accesses to shared variables: race conditions (synchronisation races and data races) –In distributed programs: promiscuous receive operations test operations for non-blocking messages operations
Main Issues in Execution Replay recorded execution = original execution: –trace as little as possible in order to limit the overhead in time in space replayed execution = recorded execution: –faithful re-execution: trace enough
Execution Replay Methods Two types: content- vs. ordering-based –content-based: force each process to read the same value or to receive the same message as during the original execution –ordering-based: force each process to access the variables or to receive the message in the same logical order as during the original execution
Logical Clocks for Ordering-based Methods A clock C() attaches a timestamps C(x) to an event x Used for tracing the logical order of events Clock condition: Clocks are strongly consistent if New timestamp is the increment of the maximum of the old timestamps of the process and the object
Scalar Clocks Aka Lamport Clocks Simple and fast update algorithm: Scales very well with the number of processes Provides only limited information:
Vector Clocks A vector clock for a program using N processes consist of N scalar values Such a clock is strongly consistent: by comparing vector timestamps one can deduce concurrency information:
An Example Program A parallel program with two threads, communicating using shared variables: A, B MA and MB. Local variables are x and y. M is used as a mutex using an atomic swap operation provided by the CPU:
An Example Program (II) Lock operation on a mutex M is implemented (in a library): Unlock operation on a mutex M is implemented as: All variables are initially 0
An Example Program (III) The example program: Thread 1: L(MA); A=8; U(MA); L(MB); B=7; U(MB); Thread 2: B=6; L(MB); x=B; U(MB); L(MA); y=A; U(MA);
A Possible Execution: Low Level View A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1) 1
A Possible Execution: High Level View A=8 L(MA) U(MA) L(MB) B=7 U(MB) x=B L(MB) U(MB) L(MA) y=A U(MA) B=6 time
Recap A content-based replay method: the value read by each load operation is stored Trace generation of 1MB/s was measured on a VAX 11/780 Undoable method: time needed to record the large amount of trace information modifies the initial execution One advantage: possible to replay a subset of the processes in isolation.
Recap: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1)
Instant Replay First ordering-based replay method Developed for CREW-algorithms Each shared object receives a version number that is updated or logged at each CREW-operation: –read: the version number is logged –write: the version number is incremented the number of preceding read operations is logged
Instant Replay: Example A=8 Lw(MA) Uw(MA) Lw(MB) B=7 Uw(MB) x=B Lr(MB) Ur(MB) Lr(MA) y=A Ur(MA) B=6 version: 1 log 0 reads version: 1 log 0 reads log version 1 PROBLEM
Netzer Widely cited method Attaches a vector clock to each process. The clocks attach a timestamp to each memory operations. Uses vector clocks to detect concurrent (racing) memory operations Automatically traces transitive reduction of the dependencies
Netzer: Basic Idea B=6 Is this order guaranteed? swap(MB,1) 0 B=7 B=6
Netzer: Transitive Reduction B=7 MB=0 x=B swap(MB,1) 0
Netzer: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1) 1
Netzer: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1) 1 (1,0) (2,0) (4,0) (5,1) (6,4) (3,0) (0,1) (4,3) (4,4) (6,5) (6,6) (6,7) (6,8) (6,9) (6,10) (4,2)
Netzer: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1) 1 (1,0) (2,0) (4,0) (5,1) (6,4) (3,0) (0,1) (4,3) (4,4) (6,5) (6,6) (6,7) (6,8) (6,9) (6,10) (4,2)
Netzer: Problems Size of vector clock grows with the number of processes –the method doesn’t scale well –programs that create thread dynamically? A vector timestamp has to be attached to all shared memory locations: huge space overhead. The method basically detects all data and synchronisation races and replays them.
ROLT Attaches a Lamport clock to each process. The clocks attach a timestamp to each memory operations. Does not detect racing operation, but merely re-executes them in the same order. Also automatically traces transitive reduction of the dependencies
ROLT: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1)
ROLT: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1) (5,8)(1,5),(7,9)Traced:
ROLT: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1)
ROLT: Example A=8 swap(MA,1) 0 MA=0 swap(MB,1) 0 B=7 MB=0 x=B swap(MB,1) 0 MB=0 swap(MA,1) 0 y=A MA=0 B=6 swap(MB,1)
ROLT using three phases Problem: high overhead due to the tracing of all memory operations Solution: only record/replay the synchronisation operations (subset of all race conditions) Problem: no correct replay possible if the execution contains a data race Solution: add a third phase for detecting the data races
ROLT using three phases Phase 1: record the order of the synchronisation races Phase 2: replay the synchronisation races while using intrusive data race detection techniques Phase 3: replay the synchronisation races and use cyclic debugging techniques to find the `normal’ errors
ROLT: Example A=8 L(MA) U(MA) L(MB) B=7 U(MB) x=B L(MB) U(MB) L(MA) y=A MA=0 B= (0,5)Traced:
ROLT ROLT replays synchronisation races end detects data races. The method scales well and has a small space and time overhead. Produces small trace files. A total order is imposed artificial dependencies.
Conclusions