Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

Slides:



Advertisements
Similar presentations
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Advertisements

Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Chapter 6: Process Synchronization
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO and THOMAS ANDERSON.
Dynamic Data Race Detection. Sources Eraser: A Dynamic Data Race Detector for Multithreaded Programs –Stefan Savage, Michael Burrows, Greg Nelson, Patric.
PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.
Today’s Agenda  Midterm: Nov 3 or 10  Finish Message Passing  Race Analysis Advanced Topics in Software Engineering 1.
Dynamic Data-Race Detection in Lock-Based Multi-Threaded Programs Prepared by Eli Pozniansky under Supervision of Prof. Assaf Schuster.
1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.
Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.
/ PSWLAB Atomizer: A Dynamic Atomicity Checker For Multithreaded Programs By Cormac Flanagan, Stephen N. Freund 24 th April, 2008 Hong,Shin.
Cormac Flanagan and Stephen Freund PLDI 2009 Slides by Michelle Goodstein 07/26/10.
CS533 Concepts of Operating Systems Class 3 Data Races and the Case Against Threads.
©Silberschatz, Korth and Sudarshan16.1Database System Concepts 3 rd Edition Chapter 16: Concurrency Control Lock-Based Protocols Timestamp-Based Protocols.
Mayur Naik Alex Aiken John Whaley Stanford University Effective Static Race Detection for Java.
CS533 Concepts of Operating Systems Class 3 Monitors.
02/17/2010CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.
Cormac Flanagan UC Santa Cruz Velodrome: A Sound and Complete Dynamic Atomicity Checker for Multithreaded Programs Jaeheon Yi UC Santa Cruz Stephen Freund.
02/19/2007CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
/ PSWLAB Eraser: A Dynamic Data Race Detector for Multithreaded Programs By Stefan Savage et al 5 th Mar 2008 presented by Hong,Shin Eraser:
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO, and THOMAS ANDERSON Ethan.
4061 Session 21 (4/3). Today Thread Synchronization –Condition Variables –Monitors –Read-Write Locks.
CSC321 Concurrent Programming: §5 Monitors 1 Section 5 Monitors.
Dynamic Data Race Detection. Sources Eraser: A Dynamic Data Race Detector for Multithreaded Programs –Stefan Savage, Michael Burrows, Greg Nelson, Patric.
11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CIS 842: Specification and Verification of Reactive Systems Lecture INTRO-Examples: Simple BIR-Lite Examples Copyright 2004, Matt Dwyer, John Hatcliff,
Bugs (part 1) CPS210 Spring Papers  Bugs as Deviant Behavior: A General Approach to Inferring Errors in System Code  Dawson Engler  Eraser: A.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Detecting Atomicity Violations via Access Interleaving Invariants
1 Previous Lecture Overview  semaphores provide the first high-level synchronization abstraction that is possible to implement efficiently in OS. This.
On-the-Fly Data-Race Detection in Multithreaded Programs Prepared by Eli Pozniansky under Supervision of Prof. Assaf Schuster.
Eraser: A dynamic Data Race Detector for Multithreaded Programs Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, Thomas Anderson Presenter:
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Using Escape Analysis in Dynamic Data Race Detection Emma Harrington `15 Williams College
FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.
Detecting Data Races in Multi-Threaded Programs
Presenter: Godmar Back
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Data Race Detection Assaf Schuster.
Healing Data Races On-The-Fly
CSE 120 Principles of Operating
Memory Consistency Models
Memory Consistency Models
CS533 Concepts of Operating Systems Class 3
Effective Data-Race Detection for the Kernel
Concurrency Control.
On-the-Fly Data Race Detection (in Multi-threaded C++ Programs)
Threads and Memory Models Hal Perkins Autumn 2011
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Threads and Memory Models Hal Perkins Autumn 2009
Chapter 15 : Concurrency Control
Lecture 2 Part 2 Process Synchronization
CS533 Concepts of Operating Systems Class 3
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
CSE 153 Design of Operating Systems Winter 19
Programming with Shared Memory Specifying parallelism
Eraser: A dynamic data race detector for multithreaded programs
Presentation transcript:

Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster

2 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug

3 Lock(m) Unlock(m)Lock(m) Unlock(m) How Can Data Races be Prevented? Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc. Thread 1Thread 2 X++ T=X

4 Is This Sufficient? Yes! No! Programmer dependent Correctness – programmer may forget to synch Need tools to detect data races Expensive Efficiency – to achieve correctness, programmer may overdo. Need tools to remove excessive synch ’ s

5 #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found … Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Where is Waldo?

6 #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found … Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Can You Find the Race? Similar problem was found in java.util.Vector write read

7 Detecting Data Races? NP-hard [Netzer&Miller 1990] Input size = # instructions performed Even for 3 threads only Even with no loops/recursion Execution orders/scheduling (# threads) thread_length # inputs Detection-code ’ s side-effects Weak memory, instruction reorder, atomicity

8 Apparent Data Races Based only the behavior of the explicit synch not on program semantics Easier to locate Less accurate Exist iff “ real ” (feasible) data race exist Detection is still NP-hard  Initially : grades = oldDatabase; updated = false; grades = newDatabase; updated = true; while (updated == false); X:=grades.gradeOf(lecturersSon); Thread T.A. Thread Lecturer

9 Detection Approaches Restricted pgming model Usually fork-join Static Emrath, Padua 88 Balasundaram, Kenedy 89 Mellor-Crummy 93 Flanagan, Freund 01 Postmortem Netzer, Miller 90, 91 Adve, Hill 91 On-the-fly Dinning, Schonberg 90, 91 Savage et.al. 97 Itskovitz et.al. 99 Perkovic, Keleher 00 Choi 02 Issues: pgming model synch ’ method memory model accuracy overhead granularity coverage fork join fork join

10 MultiRace Approach On-the-fly detection of apparent data races Two detection algorithms (improved versions) Lockset [ Savage, Burrows, Nelson, Sobalvarro, Anderson 97] Djit + [ Itzkovitz, Schuster, Zeev-ben-Mordechai 99] Correct even for weak memory systems Flexible detection granularity Variables and Objects Especially suited for OO programming languages Source-code (C++) instrumentation + Memory mappings Transparent Low overhead

11 Djit + [Itskovitz et.al. 1999] Apparent Data Races Lamport’s happens-before partial order a,b concurrent if neither a hb → b nor b hb → a  Apparent data race Otherwise, they are “ synchronized ” Djit + basic idea: check each access performed against all “ previously performed ” accesses Thread 1Thread 2. a. Unlock(L). Lock(L). b a hb → b

12 Djit + Local Time Frames (LTF) The execution of each thread is split into a sequence of time frames. A new time frame starts on each unlock. For every access there is a timestamp = a vector of LTFs known to the thread at the moment the access takes place ThreadLTF x = 1 lock( m1 ) z = 2 lock( m2 ) y = 3 unlock( m2 ) z = 4 unlock( m1 ) x =

13 Thread 1Thread 2Thread 3 (1 1 1) write X release( m1 ) read Z(2 1 1) acquire( m1 ) read Y release( m2 ) write X (2 1 1) (2 2 1) acquire( m2 ) write X(2 2 1) Djit+ Vector Time Frames

14 Djit+ Local Time Frames Claim 1: Let a in thread t a and b in thread t b be two accesses, where a occurs at time frame T a and the release in t a corresponding to the latest acquire in t b which precedes b, occurs at time frame T sync in t a. Then a hb → b iff T a < T sync. TF a tata tbtb T a T release T sync acq. a. rel. rel(m). acq. acq(m). b Possible sequence of release-acquire

15 Djit+ Local Time Frames Proof: - If T a < T sync then a hb → release and since release hb → acquire and acquire hb → b, we get a hb → b. - If a hb → b and since a and b are in distinct threads, then by definition there exists a pair of corresponding release an acquire, so that a hb → release and acquire hb → b. It follows that T a < T release ≤ T sync.

16 Djit + Checking Concurrency P(a,b) ≜ ( a.type = write ⋁ b.type = write ) ⋀ ⋀ ( a.ltf ≥ b.timestamp[a.thread_id] ) ⋀ ( b.ltf ≥ a.timestamp[b.thread_id] ) P returns TRUE iff a and b are racing. Problem: Too much logging, too many checks.

17 Djit + Checking Concurrency P(a,b) ≜ ( a.type = write ⋁ b.type = write ) ⋀ ⋀ ( a.ltf ≥ b.timestamp[a.thread_id] ) Given a was logged earlier than b, And given Sequential Consistency of the log (a HB b  a logged before b  not b HB a) P returns TRUE iff a and b are racing. no need to log full vector timestamp!

18 Thread 2Thread 1 lock( m ) write X read X unlock( m ) read X lock( m ) write X unlock( m ) lock( m ) read X write X unlock( m ) race Djit + Which Accesses to Check? No logging c No logging a in thread t 1, and b and c in thread t 2 in same ltf b precedes c in the program order. If a and b are synchronized, then a and c are synchronized as well.  It is sufficient to record only the first read access and the first write access to a variable in each ltf. b a

19 a occurs in t 1 b and c “previously” occur in t 2 If a is synchronized with c then it must also be synchronized with b. Thread 1Thread 2. lock(m). a b. unlock. c. unlock(m). Djit + Which LTFs to Check?  It is sufficient to check a “current” access with the “most recent” accesses in each of the other threads.

20 Djit + Access History For every variable v for each of the threads: The last ltf in which the thread read from v The last ltf in which the thread wrote to v w-ltf n... w-ltf 2 w-ltf 1 r-ltf n... r-ltf 2 r-ltf 1 VV LTFs of recent writes to v – one for each thread LTFs of recent reads from v – one for each thread On each first read and first write to v in a ltf every thread updates the access history of v If the access to v is a read, the thread checks all recent writes by other threads to v If the access is a write, the thread checks all recent reads as well as all recent writes by other threads to v

21 Djit + Pros and Cons No false alarms No missed races (in a given scheduling)  Very sensitive to differences in scheduling  Requires enormous number of runs. Yet: cannot prove tested program is race free. Can be extended to support other synchronization primitives, like barriers, counting semaphores, massages, …

22 Lockset [Savage et.al. 1997] Locking Discipline A locking discipline is a programming policy that ensures the absence of data-races. A simple, yet common locking discipline is to require that every shared variable is protected by a mutual-exclusion lock. The Lockset algorithm detects violations of locking discipline. The main drawback is a possibly excessive number of false alarms.

23 Lockset (2) What is the Difference? [1] hb → [2], yet there is a feasible data-race under different scheduling. Thread 1Thread 2 Y = Y + 1; [1] Lock( m ); V = V + 1; Unlock( m ); Lock( m ); V = V + 1; Unlock( m ); Y = Y + 1; [2] Thread 1Thread 2 Y = Y + 1; [1] Lock( m ); Flag = true; Unlock( m ); Lock( m ); T = Flag; Unlock( m ); if ( T == true ) Y = Y + 1; [2] No any locking discipline on Y. Yet [1] and [2] are ordered under all possible schedulings.

24 Lockset (3) The Basic Algorithm For each shared variable v let C(v) be as set of locks that protected v for the computation so far. Let locks_held(t) at any moment be the set of locks held by the thread t at that moment. The Lockset algorithm: - for each v, init C(v) to the set of all possible locks - on each access to v by thread t: - C(v)  C(v) ∩ locks_held(t) - if C(v) = ∅, issue a warning

25 Lockset (4) Explanation Clearly, a lock m is in C(v) if in execution up to that point, every thread that has accessed v was holding m at the moment of access. The process, called lockset refinement, ensures that any lock that consistently protects v is contained in C(v). If some lock m consistently protects v, it will remain in C(v) till the termination of the program.

26 Lockset (5) Example The locking discipline for v is violated since no lock protects it consistently. Programlocks_heldC(v) Lock( m1 ); v = v + 1; Unlock( m1 ); Lock( m2 ); v = v + 1; Unlock( m2 ); { } {m1} { } {m2} { } {m1, m2} {m1} { } warning

27 Lockset (6) Improving the Locking Discipline The locking discipline described above is too strict. There are three very common programming practices that violate the discipline, yet are free from any data-races: Initialization: Shared variables are usually initialized without holding any locks. Read-Shared Data: Some shared variables are written during initialization only and are read-only thereafter. Read-Write Locks: Read-write locks allow multiple readers to access shared variable, but allow only single writer to do so.

28 Lockset (7) Initialization When initializing newly allocated data there is no need to lock it, since other threads can not hold a reference to it yet. Unfortunately, there is no easy way of knowing when initialization is complete. Therefore, a shared variable is initialized when it is first accessed by a second thread. As long as a variable is accessed by a single thread, reads and writes don’t update C(v).

29 Lockset (8) Read-Shared Data There is no need to protect a variable if it’s read-only. To support unlocked read-sharing, races are reported only after an initialized variable has become write-shared by more than one thread.

30 Lockset (9) Initialization and Read-Sharing Newly allocated variables begin in the Virgin state. As various threads read and write the variable, its state changes according to the transition above. Races are reported only for variables in the Shared-Modified state. The algorithm becomes more dependent on scheduler. Virgin Shared- Modified Exclusive Shared wr by any thr rd by any thr wr by first thr wr by new thr rd by new thr rd/wr by first thr

31 Lockset (10) Initialization and Read-Sharing The states are: Virgin – Indicates that the data is new and have not been referenced by any other thread. Exclusive – Entered after the data is first accessed (by a single thread). Subsequent accesses don’t update C(v) (handles initialization). Shared – Entered after a read access by a new thread. C(v) is updated, but data-races are not reported. In such way, multiple threads can read the variable without causing a race to be reported (handles read-sharing). Shared-Modified – Entered when more than one thread access the variable and at least one is for writing. C(v) is updated and races are reported as in original algorithm.

32 Lockset (11) Read-Write Locks Many programs use Single Writer/Multiple Readers (SWMR) locks as well as simple locks. The basic algorithm doesn’t support correctly such style of synchronization. Definition: For a variable v, some lock m protects v if m is held in write mode for every write of v, and m is held in some mode (read or write) for every read of v.

33 Lockset (12) Read-Write Locks – Final Refinement When the variable enters the Shared- Modified state, the checking is different: Let locks_held(t) be the set of locks held in any mode by thread t. Let write_locks_held(t) be the set of locks held in write mode by thread t.

34 Lockset (13) Read-Write Locks – Final Refinement The refined algorithm (for Shared-Modified): - for each v, initialize C(v) to the set of all locks - on each read of v by thread t: - C(v)  C(v) ∩ locks_held(t) - if C(v) = ∅, issue a warning - on each write of v by thread t: - C(v)  C(v) ∩ write_locks_held(t) - if C(v) = ∅, issue a warning Since locks held purely in read mode don’t protect against data-races between the writer and other readers, they are not considered when write occurs and thus removed from C(V).

35 The refined algorithm will still produce a false alarm in the following simple case: Thread 1Thread 2C(v) Lock( m1 ); v = v + 1; Unlock( m1 ); Lock( m2 ); v = v + 1; Unlock( m2 ); Lock( m1 ); Lock( m2 ); v = v + 1; Unlock( m2 ); Unlock( m1 ); {m1,m2} {m1} { } Lockset (14) Still False Alarms

36 Lockset (15) Additional False Alarms Additional possible false alarms are: Queue that implicitly protects its elements by accessing the queue through locked head and tail fields. Thread that passes arguments to a worker thread. Since the main thread and the worker thread never access the arguments concurrently, they do not use any locks to serialize their accesses. Privately implemented SWMR locks, which don’t communicate with Lockset. True data races that don’t affect the correctness of the program (for example “benign” races). if (f == 0) lock(m); if (f == 0) f = 1; unlock(m);

37 Lockset (16) Results Lockset was implemented in a full scale testing tool, called Eraser, which is used in industry (not “on paper only”). +Eraser was found to be quite insensitive to differences in threads’ interleaving (if applied to programs that are “deterministic enough”). –Since a superset of apparent data-races is located, false alarms are inevitable. –Still requires enormous number of runs to ensure that the tested program is race free, yet can not prove it. –The measured slowdowns are by a factor of 10 to 30.

38 Lockset (17) Which Accesses to Check? a and b in same thread, same time frame, a precedes b, then: Locks a (v) ⊆ Locks b (v) Locks u (v) is set of locks held during access u to v. ThreadLocks(v) unlock … lock(m1) a: write v write v lock(m2) b: write v unlock(m2) unlock(m1) {m1} {m1}= {m1} {m1,m2} ⊇ {m1}  Only first accesses need be checked in every time frame  Lockset can use same logging (access history) as DJIT +

39 Lockset Pros and Cons Less sensitive to scheduling Detects a superset of all apparently raced locations in an execution of a program: races cannot be missed  Lots (and lots) of false alarms  Still dependent on scheduling: cannot prove tested program is race free

40 S A F L Combining Djit + and Lockset All shared locations in some program P All raced locations in P Violations detected by Lockset in P D Raced locations detected by Djit + in P All apparently raced locations in P Lockset can detect suspected races in more execution orders Djit + can filter out the spurious warnings reported by Lockset Lockset can help reduce number of checks performed by Djit + If C(v) is not empty yet, Djit + should not check v for races The implementation overhead comes mainly from the access logging mechanism Can be shared by the algorithms

41 Implementing Access Logging: Recording First LTF Accesses Read-Only View No-Access View Physical Shared Memory Thread 1 Thread 2 Thread 4 Thread 3 Read- Write View Virtual Memory X Y X Y X Y X Y An access attempt with wrong permissions generates a fault The fault handler activates the logging and the detection mechanisms, and switches views

42 Swizzling Between Views Thread Read-Only View No-Access View Physical Shared Memory Read- Write View Virtual Memory X Y X Y X Y X Y unlock(m) write x … unlock(m) read x write x read fault write fault write fault

43 Detection Granularity A minipage (= detection unit) can contain: Objects of primitive types – char, int, double, etc. Objects of complex types – classes and structures Entire arrays of complex or primitive types An array can be placed on a single minipage or split across several minipages. Array still occupies contiguous addresses

44 Playing with Detection Granularity to Reduce Overhead Larger minipages  reduced overhead Less faults A minipage should be refined into smaller minipages when suspicious alarms occur Replay technology can help (if available) When suspicion resolved – regroup May disable detection on the accesses involved

45 Detection Granularity

46 Example of Instrumentation void func( Type* ptr, Type& ref, int num ) { for ( int i = 0; i < num; i++ ) { ptr->smartPointer()->data += ref.smartReference().data; ptr++; } Type* ptr2 = new(20, 2) Type[20]; memset( ptr2->write( 20*sizeof(Type) ), 0, 20*sizeof(Type) ); ptr = &ref; ptr2[0].smartReference() = *ptr->smartPointer(); ptr->member_func( ); } Currently, the desired value is specified by user through source code annotation No Change!!!

Reporting Races in MultiRace

Benchmark Specifications (2 threads) Input Set Shared Memory # Mini- pages # Write/ Read Faults # Time- frames Time in sec (NO DR) FFT2 8 *2 8 3MB49/ IS2 23 numbers 2 15 values 128KB360/ LU1024*1024 matrix, block size 32*32 8MB5127/ SOR1024*2048 matrices, 50 iterations 8MB2202/ TSP19 cities, recursion level 12 1MB92792/ WATER512 molecules, 15 steps 500KB315438/

Benchmark Overheads (4-way IBM Netfinity server, 550MHz, Win-NT)

Overhead Breakdown Numbers above bars are # write/read faults. Most of the overhead come from page faults. Overhead due to detection algorithms is small.

51 The End