AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium
AADEBUG Munchen2 Contents Introduction Non-determinism & data races RecPlay Method Implementation Example Experimental Evaluation Conclusions
AADEBUG Munchen3 Introduction Developing parallel programs for multiprocessors with shared memory is considered difficult: number of threads running simultaneously co-operation & synchronisation through shared memory: too much synchronisation: deadlock too little synchronisation: race condition cyclic debugging is impossible due to non- deterministic nature of most parallel programs program execution is not repeatable
AADEBUG Munchen4 Causes of non-determinism Sequential Programs: input (keyboard, disk, network), signals, interrupts, certain system calls ( gettimeofday(),…) Parallel programs: race conditions: two threads accessing the same shared variable (memory location) in an unsynchronised way and at least one thread modifies the variable
AADEBUG Munchen5 Example code #include unsigned global=5; thread1(){ global=global+6; } thread2(){ global=global+7; } main(){ pthread_t t1,t2; pthread_create(&t1, NULL, thread1, NULL); pthread_create(&t2, NULL, thread2, NULL); pthread_join(t1, NULL); pthread_join(t2, NULL); printf(“global=%d\n”, global); }
AADEBUG Munchen6 Possible executions L(5) global=12 global=18global=11 L(5) L(11) S(11) S(12) S(11) S(12) S(11) S(18) A A A A A A
AADEBUG Munchen7 Race conditions Two types: synchronisation races: doesn’t allow us to use cycli debugging is not a bug, is desired non-determinism data races: doesn’t allow us to use cyclic debugging is a bug, is undesired non-determinism distinction is a matter of abstraction Automatic of data races detection is possible collect all memory references check parallel references
AADEBUG Munchen8 Detecting data races Static methods: checking the source code for all possible executions with all possible input NP complete not feasible Dynamic methods: during an actual execution => only detects data races during this execution Removal requires cyclic debugging
AADEBUG Munchen9 Dynamic data race detection Piece of code between two consecutive synchronisation operations: a segment We collect two sets for all segments i of all thread: L(i) and S(i) with the addresses of all load and store operations For all parallel segments, gives the list of conflicting addresses.
AADEBUG Munchen10 Existing race detection methods Huge overhead causing probe effect and Heisenbugs Only detect the existence of a data race (and the variable), not the instructions involved. It is a bug, we need cyclic debugging!
AADEBUG Munchen11 RecPlay Synchronisation races: execution replay Data races: detect also enables cyclic debugging Allows you to detect/remove the first data race Three phases: record the order of the synchronisation operations replay the synchronisation operations and check for data races normal replay, without checking for data races
AADEBUG Munchen12 Overview Choose input Record Replay+ detect Replay+ ident. Replay+ debug Replay+ debug Choose new input The end AutomaticRequires user intervention
AADEBUG Munchen13 Instrumentation JiTI (Just in Time Instrumentation) was developed especially for RecPlay, but it is a generic instrumentation tool Instruments memory and synchronisation operations Deals correctly with data in code, code in data, self- modifying code Clones processes: the original process is used for the data and the instrumented clone is used for the code No need for recompilation, relinking or instrumentation of files.
AADEBUG Munchen14 Execution replay ROLT (Reconstruction of Lamport Timestamps) is used for tracing/replaying the synchronisation operations Attaches a scaler Lamport timestamp to each synchronisation operation Delaying synchronisation operations for operations with a smaller timestamp suffices for a correct replay We only need to log a small subset of all operations
AADEBUG Munchen15 Collecting memory operations We need two lists of adresses per segment i: L(i) and S(i) A multilevel bitmap is used low memory consumption comparing two bitmaps is easy We lose information: two accesses to the same variable are counted once. This is however no problem for data race detection
AADEBUG Munchen16 Memory bitmap 9 bit 14 bit
AADEBUG Munchen17 Detecting parallel segments A vectorclock is attached to each segment All segment information (two bitmaps+vector timestamps) is kept on a list L. Each new segment is compared against the segments on list L.
AADEBUG Munchen18 Detecting obsolete segments Obsolete segments should be removed from list L. We use snooped matrix clock in order to detect these segments
AADEBUG Munchen19 Detecting obsolete segments segment on list L obsolete segment segment in execution point of execution the future
AADEBUG Munchen20 Identification phase If a data race is detected, we know the address involved the type of operations involved (load or store) the threads involved the segments containing the racing instructions We need another replayed execution to find the racing instructions themselves (+ call stack, …) This replay executes at full speed till the racing segments start executing.
AADEBUG Munchen21 B2B2 An Example
AADEBUG Munchen22 B2B2 A1A1 C4C4P(S1) An Example
AADEBUG Munchen23 B2B2 A1A1 C4C4P(S1) An Example
AADEBUG Munchen24 B2B2 A1A1 C4C4P(S1) V(S1) An Example
AADEBUG Munchen25 B2B2 A1A1 C4C4P(S1) V(S1) An Example
AADEBUG Munchen26 B2B2 A1A1 C4C4P(S1) V(S1) An Example
AADEBUG Munchen27 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 V(S2) An Example
AADEBUG Munchen28 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 V(S2) An Example
AADEBUG Munchen29 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 V(S2) P(S2) An Example
AADEBUG Munchen30 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 V(S2) P(S2) An Example
AADEBUG Munchen31 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 V(S2) P(S2) An Example
AADEBUG Munchen32 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 V(S2) P(S2) An Example
AADEBUG Munchen33 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) An Example
AADEBUG Munchen34 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) An Example
AADEBUG Munchen35 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) P(S3) An Example
AADEBUG Munchen36 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) P(S3) An Example
AADEBUG Munchen37 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) P(S3) An Example
AADEBUG Munchen38 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) P(S3) An Example
AADEBUG Munchen39 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) P(S3) An Example
AADEBUG Munchen40 B2B2 A1A1 C4C4P(S1) V(S1) C A+B A3A3 P(S2) V(S3) V(S2) P(S3) An Example
AADEBUG Munchen41 Experimental Evaluation RecPlay has been implemented for Solaris running on SPARC multiprocessors Tested on a SUN SparcServer 1000 with 4 processors SPLASH-2 was used as a benchmark number of multithreaded numeric applications, such as fast fourier transform, a raytracer,... Several data races were found, including in SPLASH-2
AADEBUG Munchen42 Basic performance of RecPlay
AADEBUG Munchen43 Segments with memory accesses
AADEBUG Munchen44 Efficiency of the ROLT mechanism
AADEBUG Munchen45 Conclusions RecPlay is a practical and effictient tool for detecting and removing data races RecPlay also make cyclic debugging possible Three types of clocks (scalar, vector and matrix) are used to enable a fast and memory-effictient implementation Data races have been found