FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha
[FlFr09] PLDI’09 2Contents Introduction Background The FastTrack Algorithm Implementation Evaluation Conclusions
[FlFr09] PLDI’09 3Introduction Motivation vector clocks are expensive VC requires O(n) storage space and each VC operation requires O(n) time motivated in part by the performance limitations of vector clocks limitations imprecise race detectors or static race detector can report false alarms precise race detectors never produce false alarms, but it limited by the performance overhead of VC vector clock’s full generality is not actually necessary in most cases the vast majority of data in multithreaded programs is either thread local, lock protected, or read shared can provide constant-time fast paths for common cases without any loss of precision or correctness in the general case
[FlFr09] PLDI’09 4 FastTract Overview using ephoch a pair of a clock and a thread identifier for write accesses: records information only about the very last write to x all write to x are totally ordered by the happens-before relation for read accesses: records only the epoch of the last read to x read operations on thread-local and lock-protected data are totally ordered reduces overhead of almost all monitored operations for analysis: from O(n)-time to O(1)-time n is the number of threads in the program for space: from O(n) to O(1) only thread-local and lock-protected data
[FlFr09] PLDI’09 5Background Multithreaded Programs Traces a thread t has the set of operations rd(t, x) and wr(t, x): read and write a value from x acq(t, m) and rel(t, m): acquire and release a lock m fork(t, u): forks a new thread u join(t, u): blocks until thread u terminates happens-before relation < α the smallest transitively-closed relation over the operations in a trace α a < α b: one of the states, Program order, Locking, Fork-join race condition two operations in a trace are not related by the happens-before relation a trace has two concurrent conflicting accesses
[FlFr09] PLDI’09 6 Review: the DJIT + Algorithm based on vector clocks maintains an additional vector clock for each lock m to identify conflicting accesses keeps two vector clock for read and write C 0 C 1 l m W x wr(0, x) rel(0, m) acq(1, m) wr(1, x)
[FlFr09] PLDI’09 7 The FastTrack Algorithm Empirical data gathered from the action of race detection full VC is not necessary in almost read and write operations lightweight representation of the happens-before rel. can be used instead only a small fraction of operations need full vector clock operations How to catch each type of race condition? each race condition is either a read-write race: a read concurrent with a later write to the same variable a write-read race: a write concurrent with a later read a write-write race: involving two concurrent writes rd(0, x) wr(1, x) wr(0, x) rd(1, x) wr(0, x) wr(1, x) a read-write racea write-read racea write-write race
[FlFr09] PLDI’09 8 Detecting write-write races all writes to x are totally ordered (no races have been detected) an epoch a pair of a clock c and a thread t epochs reduce the space and analysis overhead (write-write): O(1) C0C0 C1C1 LmWx ⊥e⊥e wr(0, x) rel(0, m) acq(1, m) wr(1, x)
[FlFr09] PLDI’09 9` Detecting write-read races uses epoch of Wx and current vector clock Ct check that the read happens after the last write need O(1)-time for comparison Wx ≤ Ct C0C0 C1C1 LmWx ⊥e⊥e wr(0, x) rel(0, m) acq(1, m) wr(1, x)rd(0, x)
[FlFr09] PLDI’09 10 Detecting read-write races read-write race condition is more difficult a write could potentially conflict with the last read performed by any other thread need to record an entire VC of the last read from x by thread t common situations for using epoch (totally ordered in practice) Thread-local data: only one thread accesses a variable, and hence these accesses are totally ordered (program order) Lock-protected data: a protecting lock is held on each access to a variable, and hence all access are totally ordered (program order or synch. order) reads are typically unordered only when data is read-shared uses an adaptive representation for tracking the read history
[FlFr09] PLDI’09 11 Analysis Details an online algorithm that maintains an analysis state σ σ = (Ct, Lm, Rx, Wx) Rx: identifies either the epoch of the last read of x (all other read is ordered) or a vector clock that is the join of all reads of x reads: 82.3% of all operations requires O(n)-time for shared-read: 0.1% of reads requires O(1)-time for other reads writes: 14.5% of all operations requires O(n)-time for shared-write: 0.1% of writes requires O(1)-time for other writes
[FlFr09] PLDI’09 12 An Example of FT C0C0 C1C1 WxRx ⊥e⊥e wr(0, x) ⊥e⊥e fork(0, 1) rd(0, x) join(0, 1) wr(0, x) rd(0, x) rd(1, x) ⊥e⊥e ⊥e⊥e ⊥e⊥e
[FlFr09] PLDI’09 13Implementation FT Instrumentation State and Code represents an epoch as a 32-bit integer the top 8 bits: store the thread identifier t the bottom 24 bits: store the clock c associates with each thread a ThreadState object containing a unique thread identifier tid and a vector clock C for instrumentation: t. C [t. tid ] Granularity supports two levels of granularity for analyzing memory locations fine-grain analysis (default) and coarse-grain analysis coarse-grain analysis reduces the memory footprint but may produce false alarms if two fields of an object are protected by different locks
[FlFr09] PLDI’09 14 Extensions supports additional synchronization primitives wait and notify, volatile variables, and barriers models a wait operation on lock m does not need additional analysis rules a notify operation can be ignored guarantees that a write of vx happens before every subsequent read of vx extends the L component to map volatile variables to the VC of the last write volatile writes and reads modify the same way as lock acquire and release consider release operation barrier_rel(T) for a barrier the first post-barrier step happens after all pre-barrier steps is unordered with respect to the next steps taken by other threads
[FlFr09] PLDI’09 15Evaluation Precision and Performance compares the precision and performance of 7 dynamic analyses Empty, FastTrack, Eraser, DJIT +, MultiRace, GoldiLocks, and BasicVC all tools were implemented on top of RoadRunner Benchmark Configuration performed experiments on 16 benchmarks report at most one race for each field of each class and each array access Summary of Results FT outperforms other tools provides almost a 10x speedup over BasicVC and a 2.3x speedup even over the DJIT+ algorithm provides a substantial increase in precision over Eraser without loss in performance
[FlFr09] PLDI’09 16Conclusion FastTrack is a new precise race detection algorithm uses an adaptive lightweight representation for the happens-before relation that reduces both space and time overheads despite its efficiency, it is a comparatively simple algorithm that is straightforward to implement contains optimized constant-time fast paths that handle upwards of 96% of the operations in benchmarks provides a 2.3x performance improvement over the DJIT + algorithm, and incurs less than half the memory overhead of DJIT +