Model Checking of a lock-free stack Wael Yehia York University March 31, 2010
Main Components of the Stack The stack was simply a top pointer Each thread has a ThreadInfo object that uniquely identifies the thread Two arrays, for collision and (lock-free) synchronization purposes AtomicIntegerArray collision AtomicReferenceArray<ThreadInfo<T>> location threadInfo<T> final int id OP op cell<T> cell
JPF Testing We ran JPF on our test cases from assignment 2, lowering the # of threads and operations. We found: no deadlocks 1 Data Race (not fixed, so maybe more) 3 Uncaught Exceptions (All fixed)
Data Races Found Tested on different number of threads and number of operations per thread For 2 threads, no Data Races were found # of ops: 2, 3, 4, 5 were tested For 3 threads, a Data Race was found. Could not be fixed. The race was also related to the same problem that causes the NullPointerException discussed later.
Uncaught Exceptions found One NullPointerException Two AssertionError() Exceptions The null pointer and one of the assertion errors seemed to be related. Occur due to the same scenario that causes the Data Race When the problem was fixed, so was the second assertion error
The Untested Scenario The null pointer and the data race problems rise in the following situation (which is not accounted for in the paper): Let p stand for Thread p, q for Thread q, and qInfo for q’s ThreadInfo p.pop() q.push() Collide with q - q sees someone has collide with it - q exits normally - q starts another operation either operations alter qInfo.cell - p read qInfo.cell
The Untested Scenario In general, the scenario is as follows: Two threads collide ( p.pop() and q.push() ) The pushing thread finishes first and exits. Then it executes another stack operation before the popping thread reads any data from it. The popping thread wakes up and starts reading the data from q’s ThreadInfo
Solution It is obvious that the problem occurs when one thread (popping p) is slow in reading the data from the second thread (pushing q). The fast thread cannot wait for the slow thread, so it has to store it’s data somewhere. Quick Reminder of the collision process: Two processes cannot be colliding with the same process, so their collision relation looks like this: …. q p r …. State of the location Array (that hold threadinfo’s) during a collision: 3 Threads: q pushing, p popping, r popping p.id = 0, q.id = 1, r.id = 2 Array before collision: Array after collision: collide with collide with collide with Threadinfo Threadinfo Threadinfo of p of q of r Threadinfo null Don’t care of q
Solution (Cont’d) Solution part (a) (when q starts the collision): instead of saving it’s ThreadInfo in popping thread’s slot, create and store a new dummy ThreadInfo holding the data Solution part (b) (when p starts the collision): Before attempting the collision, save the q’s data locally When collision succeeds, use it, otherwise discard it
Conclusion JPF helped us find and understand the problem more clearly. The exception caught by jpf took seconds or minutes at most. While during our testing, they appeared once every millions of operations executed by many threads concurrently. The described scenario will fail the algorithm presented in the paper. The Data race has still to be fixed