VYRD: VerifYing Concurrent Programs by Runtime Refinement-Violation Detection Tayfun Elmas, Serdar Tasiran Koç University, Istanbul, Turkey Shaz Qadeer Microsoft Research, Redmond, WA Hi all. I’m Tayfun Elmas from Koc University. In this talk I’ll present you a technique for detecting concurrency errors. In this technique we watch for refinement violations at runtime. This is joint work with my advisor Serdar Tasiran and Shaz Qadeer, from Microsoft Research. 26/04/19 PLDI 2005
Verifying Concurrent Data Structures Motivation Widely-used software systems are built on concurrent data structures File systems, databases, internet services Standard Java and C# class libraries Intricate synchronization mechanisms to improve performance Prone to concurrency errors Concurrency errors Data loss/corruption Difficult to detect, reproduce through testing Well, Many widely-used software applications are built on concurrent data structures. Examples are file systems, databases, internet services and some standard Java and C# class libraries. These systems frequently use intricate synchronization mechanisms to get better performance in a concurrent environment. This makes them prone to concurrency errors. Concurrency errors can have serious consequences, such as data loss or corruption. Unfortunately, these errors are typically hard to detect and reproduce through pure testing-based techniques. PLDI 2005
Our Approach Refinement as Correctness Criterion Refinement For each execution of the implementation (Impl) there exists an “equivalent”, atomic execution of Spec Linearizability, atomicity (by reduction) For each execution of Impl there exists an “equivalent” atomic execution of Impl Refinement less restrictive Rules out fewer implementations Example: more permissive Spec allows exceptional method termination in a way not possible in an atomic execution of Impl Keywords: Linerizability and atomicy are more restrictive. The flexibility in spec gives us a more powerful method to prove correctness of some tricky implmentations. In our approach to verifying concurrent data structures we use refinement as the correctness criterion. The benefits of this choice are that refinement is a more thorough condition than method local assertions and that it provides more observability than pure testing. Correctness conditions like Linearizability and atomicity require that for each execution of impl in a concurrent environment there exists an equivalent atomic execution of the same Impl. However Refinement uses a separate specification and for each execution of the impl refinement requires existence of an equivalent atomic execution of this spec. The specification we use is more permissive than the impl. For example the spec allows methods to terminate exceptionally to model failure due to resource contention in a concurrent environment. However the impl would not allow some of the method executions to fail. We check refinement at runtime using execution traces of the implementation. We do this in order to be able to handle industrial-scale programs. Our approach can be regarded as intermediate between testing and exhaustive verification with respect to the coverage of the whole execution space explored. PLDI 2005
Our Approach Runtime Checking of Refinement Refinement For each execution of Impl there exists an “equivalent”, atomic execution of Spec Use refinement as correctness criterion More thorough than assertions More observability than pure testing Runtime verification: Check refinement using execution traces Can handle industrial-scale programs Intermediate between testing & exhaustive verification Keywords: Linerizability and atomicy are more restrictive. The flexibility in spec gives us a more powerful method to prove correctness of some tricky implmentations. In our approach to verifying concurrent data structures we use refinement as the correctness criterion. The benefits of this choice are that refinement is a more thorough condition than method local assertions and that it provides more observability than pure testing. Correctness conditions like Linearizability and atomicity require that for each execution of impl in a concurrent environment there exists an equivalent atomic execution of the same Impl. However Refinement uses a separate specification and for each execution of the impl refinement requires existence of an equivalent atomic execution of this spec. The specification we use is more permissive than the impl. For example the spec allows methods to terminate exceptionally to model failure due to resource contention in a concurrent environment. However the impl would not allow some of the method executions to fail. We check refinement at runtime using execution traces of the implementation. We do this in order to be able to handle industrial-scale programs. Our approach can be regarded as intermediate between testing and exhaustive verification with respect to the coverage of the whole execution space explored. PLDI 2005
Outline Example Refinement The VYRD tool Experience Conclusions I/O-refinement View-refinement The VYRD tool Experience Conclusions Here is the outline of my talk. First I’ll give a motivating data structure example and and explain how our technique applies to the this example. Then I’ll talk about two different notions of refinement called ... I’ll introduce our runtime verification tool Vyrd and the experience we had by applying Vyrd on industrial scale software. PLDI 2005
Multiset Multiset data structure Represented by A[1..n] Implementation: LookUp Multiset Multiset data structure M = { 2, 3, 3, 3, 9, 8, 8, 5 } Represented by A[1..n] content: the element valid: Is it in the set? LookUp (x) for i = 1 to n acquire(A[i]) if (A[i].content==x && A[i].valid) release(A[i]) return true else release(A[i]) return false A 9 8 6 5 3 2 content valid Our motivating data structure is a multiset. Here is an example of a multiset. Notice that several copies of the same integer can be in the multiset like 3 and 8 in this example. The implementation represents the multiset by an array A with two fields. The content field stores the integer element and the Boolean valid field tells us whether the element is to be included in the multiset or not. For example one representation of the multiset above could be like as the bottom one. On the right you see the implementation for the lookup method. Lookup queries whether a given integer x is in the multiset. It traverses the array A linearly by locking elements one by one and checking if the content is x and the valid field is set. PLDI 2005
Multiset FindSlot: Helper routine for InsertPair For space allocation Implementation: FindSlot Multiset FindSlot: Helper routine for InsertPair For space allocation Does not set valid field x not in multiset yet FindSlot (x) for i = 1 to n acquire(A[i]) if (A[i].content==null) A[i].content = x release(A[i]) return i else return 0 FindSlot is a helper method for an insertion method I will tell you about in the next slide. Given an integer x, it looks for an empty slot to put x in. If it finds one, it allocates the slot for x by setting its content field to x and returns the index, otherwise it returns 0. Notice that it doesn’t set the valid field, so x is not in the multiset yet. Thus it will not be treated as in the set by a Lookup metod that will check this slot. PLDI 2005
Multiset Implementation: InsertPair InsertPair(x,y) Refinement violation if only one of x, y inserted Two separate calls to FindSlot To allocate space for x and y InsertPair allows exceptional termination Example: MS array of size 2 2 concurrent InsertPair’s both find slots for x’s both fail to find slots for y’s Not possible in atomic execution InsertPair (x, y) i = FindSlot (x) if (i == 0) return failure j = FindSlot (y) if (j == 0) A[i].content = null acquire(A[i]) acquire(A[j]) A[i].valid = true A[j].valid = true release(A[i]) release(A[j]) return success Insertpair has an interesting except. term that is not possible in seq case. Using a sep spec we do not rule out this excep execution. Multiset has an InsertPair method to insert a pair of integers x, y into the contents. The implementation of InsertPair is given on the right. InsertPair makes the multiset example interesting because InsertPair demonstrates the methods in real concurrent systems that first hold up several resources and then completes its operation on all the resources atomically. It is considered an error if one of x or y is inserted and but not the other. To prevent this error, it makes two calls to FindSlot to first allocate slots for x and y. If both FindSlot’s succeed, in a protected block, it includes x and y into the multiset atomically by setting their corresponding valid bits to true. Then it returns success. InsertPair returns failure if either of the FindSlot calls fail. This can happen because of resource contention with other concurrent InsertPair routines. For example, imagine we have an empty multiset of size n. n concurrent InsertPair’s running on this multiset can all find free slots for their x’s but then they may be unable to find slots for their y’s if there is no more empty slots. This causes all the InsertPairs to return failure even though at the beginning there is space for some of them to succeed. PLDI 2005
Multiset Specification Spec state M: set of integers Each method Atomic deterministic state update/observation Given current state, arguments and method return value (if one exists) specifies new Spec state INSERTPAIR (x, y, retval) if (retval == success) M = M U {x, y} return retval LOOKUP (x) if (x M) return true else return false DELETE (x) M = M \ {x} NOTE: Coordinate the bullt “Given...” with the InsertPair method. Here we give the specification for multiset. The state of the spec is represented by a set M of integers. Each method of the specification specifies an atomic deterministic update or observation of the specification state. A mutator method, given the current state and the arguments, specifies what will the next state be. Notice that some methods also take a return value that affects the behavior of the method. For example InsertPair takes two integers and a return value. If the return value is success it specifies a new state with x and y included. Other return values causes InsertPair to keep the existing state. If the return value is not success it leaves the current state unchanged. The reason for us to let the return value affect the state transition is to model the InsertPairs in the impl that fails due to concurrency. Also there are the Delete method that removes an integer from the multiset and the lookup method that queries the set for a given integer. PLDI 2005
Outline Example Refinement The VYRD tool Experience Conclusions I/O-refinement View-refinement The VYRD tool Experience Conclusions After the introducing multiset now I’ll explain the first notion of refinement called I/O refinement. PLDI 2005
Multiset I/O Refinement Witness ordering Spec trace M=Ø Call Insert(3) Call LookUp(3) Return“success” Call Insert(4) Return “true” Call Delete(3) Unlock A[0] A[0].elt=3 Unlock A[1] A[1].elt=4 read A[0] A[0].elt=null M=Ø {3} {3, 4} {4} Spec trace Call Insert(3) Return “success” Call LookUp(3) Call Insert(4) Call Delete(3) M = M U {3} Check 3 M Return “true” M = M U {4} M = M \ {3} Commit Insert(3) Commit LookUp(3) Commit Insert(4) Commit Delete(3) Witness ordering Unlock A[0] A[0].elt=3 Unlock A[1] A[1].elt=4 read A[0] A[0].elt=null M = M U {3} Check 3 M M = M U {4} M = M \ {3} In this slide we’ll explain how we check IO refinement. Again we use the insert operation instead of insertpair to simplify the picture. On the right you see PLDI 2005
I/O-refinement Selecting Commit Actions Commit points: Determines witness ordering Drives Spec Hints to refinement checking tools For each method Designate lines in source code Multiple lines annotated as commit For each method execution Only one line should get executed as commit action No formal procedure Intuitively, where new data structure state becomes visible to other threads Example: InsertPair InsertPair (x, y) i = FindSlot (x) if (i == 0) return failure j = FindSlot (y) if (j == 0) A[i].content = null acquire(A[i]) acquire(A[j]) A[i].valid = true A[j].valid = true release(A[i]) release(A[j]) return success Put IO refinement slide with commit points beforehand. Commit points are really hits to refinement checking tools by the user that helps in determining the witness ordering in which the spec trace is constructed. For each public method of the Impl, we designate lines in the source code so that their execution correspond to commit actions. There may be multiple lines annotated as commit. However, for each execution of a method there must be a single action executed as the commit action and its execution brings the method execution to its commit point There is no formal procedure for deciding on the commit points. Intuitively, where the modified state of the data structure becomes visible to other threads should be the primary candidate for a commit point. For example the commit point for the insertpair is where the lock of A[i] is released after inserting both x and y to the set. Even though insertpair changes some shared state by calling findslot beforehand, the elements in allocated slots are not observed as in the set by other threads so it is the commit point where methods by other threads can see x in the set. release(A[i]) // commit PLDI 2005
Outline Example Refinement The VYRD tool Experience Conclusions I/O-refinement View-refinement The VYRD tool Experience Conclusions Now it comes to introducing another notion of refinement, called view-refinement. PLDI 2005
LookUp(5)=true, LookUp(7)=true LookUp(6)=true, LookUp(8)=true Need for more observability View-refinement T1: InsertPair(5,7) T2: InsertPair(6,8) Read A[0].elt = null FINDSLOT (x) // Buggy for i 1 to n if (A[i].content == null) acquire(A[i]) A[i].content = x release(A[i]) return i return 0 Read A[0].elt = null Read A[1].elt = null 1 2 3 elt valid F F F F elt 5 7 valid F F F F elt 5 7 Overwrites 5! valid T T F F LookUp(5)=true, LookUp(7)=true 1 2 3 elt 6 7 It would be caught is lookup5 would get interleaved here. IO refinement is still not sufficient to catch some errors that does not appear in method calls and return values. Consider the buggy impl of FindSlot on the left. It does not lock the elements before reading from their content fields. It acquires the lock just before starting the modification. As a result as you see on the right, two concurrently executed insertpairs can get interleaved in a way so that they read the first slot in the array as empty and think that it is available for an insertion but only one of them succeeds in inserting its x into this slot. In this case, the integer 5 inserted by the first thread is overwritten by the second thread. Each thread checks the insertpair it runs by calling lookup methods with the same arguments as the insertpair had after the insertpair finishes. If the lookups are scheduled in this way, they all return true although the last state is inconsistent with the termination status of the methods. The error is there and is observable from the state but IO refinement is unable to detect it in this senario due to the lookup not being scheduled in the right places. valid T T F F Read A[2].elt = null elt 6 7 8 valid T F elt 6 7 8 valid T F LookUp(5)=false LookUp(6)=true, LookUp(8)=true PLDI 2005
LookUp(5)=true, LookUp(7)=true LookUp(6)=true, LookUp(8)=true I/O-refinement may miss errors View-refinement T1: InsertPair(5,7) T2: InsertPair(6,8) If observer methods don’t get interleaved in the right place Source of bug too far in the past when I/O refinement violation happens Read A[0].elt = null Read A[0].elt = null Read A[1].elt = null 1 2 3 elt valid F F F F elt 5 7 valid F F F F elt 5 7 Overwrites 5! valid T T F F LookUp(5)=true, LookUp(7)=true 1 2 3 elt 6 7 Do not say about the first bullet. IO refinement is still not sufficient to catch some errors that does not appear in method calls and return values. Consider the buggy impl of FindSlot on the left. It does not lock the elements before reading from their content fields. It acquires the lock just before starting the modification. As a result as you see on the right, two insertpairs can get interleaved in a way so that they read the first slot in the array as empty but only one of them succeeds in inserting its x into this slot. In this case, the integer 5 inserted by the first thread is overwritten by the second thread. Each thread checks the insertpair it runs by calling lookup methods with the same arguments as the insertpair had. If the lookups are scheduled in this way, they all return true although the last state is inconsistent with the termination status of the methods. The error is there and is observable from the state but IO refinement is unable to detect it in this senario due to the lookup not being scheduled in the right places. valid T T F F Read A[2].elt = null elt 6 7 8 valid T F elt 6 7 8 valid T F LookUp(6)=true, LookUp(8)=true PLDI 2005
View-refinement More Observability I/O-refinement may miss errors Our solution: View-refinement I/O-refinement + “correspondence” between states of Impl and Spec at commit points Catches state discrepancy right when it happens Early warnings for possible I/O refinement violations As we saw in the previous example with 2 insertpairs IO refinement is not that good at finding refinement errors.The problem is IO refinement relies on observer methods and if the observer methods do not get interleaved in the right place along the trace, IO refinement may miss errors. In the extreme case, if there are no observer methods, IO refinement trivially passes any executions. In another case, when a refinement violation is detected, the source of the bug may be too far in the past so there may need to be an analysis of the trace to the far back. Our solution is view-refinement. View refinement augments IO refinement with a new condition that seeks correspondence between states of the impl and the spec along at commit points. To accomplish this we add commit actions to the set lambda and label them with state information when they are executed. View-refinement catches state discrepancies right when it happens. In fact these state discrepancies are early warnings for future IO refinement violations. PLDI 2005
View-refinement View Variables State correspondence Hypothetical “view” variables must match at commit points “view” variable: Extracts abstract data structure state Updated atomically once by each method viewImpl : state information for Impl For A[1..n] Extract content if valid=true viewSpec: state information for Spec Elements of the multiset viewSpec M (nothing to abstract) Other Spec’s may have state to be abstracted viewImpl={3, 3, 5, 5, 8, 8, 9} 3 5 content valid A 9 8 6 The state correspondence is obtained by matching view variables from the impl and the spec at commit points. A view variable is a hypothetical variable that extracts an abstract state of the data structure. This abstract state is updated or observed atomically by each method. The view variables that carry state information of the impl and the spec are denoted by viewimpl and viewspec respectively. The view for multiset data structure is the set of integers stored in the multiset. The view variable for the multiset impl extracts the elements in content fields whose corresponding valid fields are true. Thus the view variable for the multiset in the figure does not contain the first 5 and 6 in the view variable. The view var for the spec gets elements from the set M. PLDI 2005
View-refinement View Variables for Multiset viewImpl: Computed using abstraction function View is a canonical representation Canonizes state for view: Exact match not required AbstractionFunction (A) view = Ø for i = 1 to n if (A[i].content != null && A[i].valid == true) view = view U {A[i].content} return view content valid A 1 3 7 6 5 Abstraction function (for checking view-refinement) An extra method of the data structure For the current data structure state, computes the current state of the view variable There may be state variables to be abstracted away in the spec. Later we will see example in which the spec is also a program. The abst func is given by the user. View is a canonical representation of the abstract state. View computation canonize the state so So even though the internal representation of the data structure state are different for two multiset instances the view variable may be identical for both of them and exact match between the data structure states are not required. For example the view for the representations in the figure are the same although the order of elements are different with extra allocated slots. Since the spec has already an abstact representation, its states has nothing to abstract so the view variable for a spec instance is canonized version of the spec state. However this does not mean spec cannot have details to abstract. Our method accepts specs in different levels of detail so as we will see later, the spec can be a program that requires an abstraction function. As for multiset example, the view variable carry information about what integers are stored in the multiset. The canonical representation of view variables for multiset discards the order of elements. For multiset spec, you do not need to do any abstraction, the view is the entire spec state. For the multiset impl the view variable must be extracted from the data structure state. The abstraction function for the multiset impl is given on the right. The abstraction function traverses the array A and abstracts away the valid fields of the elements. It only includes into the view the content field for which the valid field is true into the view. For example if abstraction function traversed the multiset whose state is given below on the right it would not include this element with content 5 and the element with content 6 because their corresponding valid fields are not set. viewImpl={1, 3, 5, 6} content valid A 6 1 5 3 PLDI 2005
View-refinement Checking Refinement Witness ordering Spec trace M=Ø Call Insert(3) Unlock A[0] A[0].elt=3 Call LookUp(3) Return“success” Unlock A[1] A[1].elt=4 Call Insert(4) read A[0] Return “true” Call Delete(3) A[0].elt=null M=Ø {3} {3, 4} {4} Spec trace Call Insert(3) Return “success” Call LookUp(3) Return “true” Call Insert(4) Call Delete(3) M = M U {3} Check 3 M M = M U {4} M = M \ {3} Commit Insert(3) Commit LookUp(3) Commit Insert(4) Commit Delete(3) Witness ordering viewImpl = {3} viewImpl = {3,4} viewImpl = {4} A[0].elt=3 A[1].elt=4 A[0].elt=null viewSpec = {3} viewSpec = {3,4} viewSpec = {4} Say the view are computed by running abst func at this point. The checking procedure is similar to checking IO refinement. PLDI 2005
LookUp(5)=true, LookUp(7)=true LookUp(6)=true, LookUp(8)=true Catching FindSlot Bug View-refinement T1: InsertPair(5,7) T2: InsertPair(6,8) Read A[0].elt = null FINDSLOT (x) // Buggy for i 1 to n if (A[i].content == null) acquire(A[i]) A[i].content = x release(A[i]) return i return 0 Read A[0].elt = null Read A[1].elt = null 1 2 3 elt valid F F F F elt 5 7 valid F F F F elt 5 7 Overwrites 5! valid T T F F LookUp(5)=true, LookUp(7)=true 1 2 3 elt 6 7 IO refinement is still not sufficient to catch some errors that does not appear in method calls and return values. Consider the buggy impl of FindSlot on the left. It does not lock the elements before reading from their content fields. It acquires the lock just before starting the modification. As a result as you see on the right, two insertpairs can get interleaved in a way so that they read the first slot in the array as empty but only one of them succeeds in inserting its x into this slot. In this case, the integer 5 inserted by the first thread is overwritten by the second thread. Each thread checks the insertpair it runs by calling lookup methods with the same arguments as the insertpair had. If the lookups are scheduled in this way, they all return true although the last state is inconsistent with the termination status of the methods. The error is there and is observable from the state but IO refinement is unable to detect it in this senario due to the lookup not being scheduled in the right places. valid T T F F Read A[2].elt = null elt 6 7 8 valid T F elt 6 7 8 valid T F LookUp(6)=true, LookUp(8)=true PLDI 2005
InsertP(5,7) Returns “success” InsertP(6,8) Returns “success” Catching FindSlot Bug View-refinement Specification {5, 7} {5, 6, 7, 8} M = Ø Call InsertPair(5,7) Return “success” Call InsertPair(6,8) viewSpec = Ø {5, 7} {5, 6, 7, 8} viewImpl = Ø {6, 7, 8} Commit InsertPair(5,7) Commit InsertPair(6,8) Suppose we are checking the execution trace with buggy FindSlot implementation. First we drive the spec according to the witness ordering of the commit points. Then we track the valuations of the view variables for the impl and the spec at commit points and compare them for equivalence. Here at the commit point of the second insertpair, 5 disappears from the view variable of the impl but it is there in the view var of the spec. Then a refinement error is signalled that says an element is overwritten betweeen last two commit actions. InsertP(5,7) Returns “success” Call InsertP(5,7) InsertP(6,8) Returns “success” Call InsertP(6,8) Read A[0].elt Read A[0].elt Read A[0].elt A[0].content=5 A[1].content=7 A[0].valid=true A[0].valid=true A[1].valid=true A[0].content=6 A[0].valid=true Read A[2].elt A[2].content=8 A[2].valid=true Implementation PLDI 2005
Outline Example Refinement The VYRD tool Experience Conclusions I/O-refinement View-refinement The VYRD tool Experience Conclusions After introducing the two notion of refinemenent, now it comes to our refinement checking tool Vyrd. PLDI 2005
The VYRD Tool ... Architecture Impl Implreplay Spec Instrument Impl in order to log actions in the order they happen Commit actions annotated by user Write abstraction function Test harness Impl Write to log Enables online/offline checking ... Call LookUp(3) Call Insert(3) A[0].elt=3 Unlock A[0] Call Delete(3) Call Insert(4) read A[0] A[1].elt=4 Return “true” Unlock A[1] Return“success” Return“success” A[0].elt=null Unlock A[0] Return“success” Read from log Abstraction function (for checking view-refinement) An extra method of the data structure For the current data structure state, computes the current state of the view variable Vyrd analyzes execution traces of the impl generated by test programs. Vyrd uses two separate threads for the process. The testing thread runs a test harness that generates test programs. A test program makes concurrent method calls to the impl. During the run of the test program, the corresponding execution trace is recorded in a shared sequential log. The verification thread reads the execution trace from the log. Since the verification thread follows the testing thread from behind, it can not access the instantaneous state of the impl. Thus the replaying module re-executes actions from the log on a separate instance of the impl called impl-replay and executes atomic methods on the spec at commit points. During replaying, the replaying mechanism also computes the view variables when it reaches a commit point and annotates the commit actions along the traces with view variables. The refinement checker module checks the resulting lambda traces of impl and the spec for IO and view refinement. These threads can run in online or offline setting. In online checking both threads simultaneously while in offline checking the verification thread runs after the whole test program finishes its work. Execute logged actions Replay Mechanism Run methods in witness ordering Implreplay Spec Refinement Checker traceImpl traceSpec PLDI 2005
The VYRD Tool Atomized Impl as Spec Spec : atomized version of Impl INSERTPAIR (x, y, retval) acquire(global_lock) if (retval == failure) return failure i = FindSlot (x) .......... j = FindSlot (y) acquire(A[i]) acquire(A[j]) A[i].valid = true A[j].valid = true release(A[i]) release(A[j]) release(global_lock) return success Spec : atomized version of Impl Fully synchronized methods Use single global lock Separates checking concurrency errors from sequential verification Slight modification: Return value from Impl method additional argument to Spec methods More permissive than Impl Can handle failure return values Exact state match at commit points not required Match view variables only Different from “commit atomicity” Global lock serializes the methods, this separates from sequential verification checking concurrency errors. Easily usable, no need to write a separate spec. Our approach can employ specifications in different forms and in different levels of detail. However, it is straightforward to obtain an executable specification from the atomized version of the impl. To accomplish this, we use a single global lock to fully synchronize the method bodies. You can see the modified version of the ınsertpair method for the spec. Slight modification is needed to make the methods to model nondeterministic failures of the impl due to concurrency. The spec methods accept as an input parameter, return values from the impl trace. When a method takes the return value “failure”, it does nothing even though without this check it can complete its job with a “success” as the return value. Although they use the same method impl, the impl and spec can have diff states at a commit point. But we use the canonicized forms of view and exact match between the impl and the spec states is not required. PLDI 2005
Outline Example Refinement The VYRD tool Experience Conclusions I/O-refinement View-refinement The VYRD tool Experience Conclusions In this part of the talk I’ll tell you about our experience using the Vyrd tool. PLDI 2005
Replicated Disk Manager The Boxwood Project Experience BLINKTREE MODULE Root Pointer Node Internal Pointer Node Level n+1 ................ Level n .................. Root Level Leaf Pointer Node ... Level 0 ... ...... ..... ..... ..... ..... ............... .... .... ......... ........ ...... ...... ..... ..... ..... ..... ............... .... .... ......... ........ ...... Data Node ... Data Nodes ... GlobalDiskAllocator CHUNK MANAGER MODULE Replicated Disk Manager Write Read Read Write CACHE MODULE Dirty Cache Entries ... Clean Cache Entries Cache We verified all modules of this system. We caught an interesting difficult tricky error that has gone undetected. We run Vyrd on the Boxwood Project from Microsoft. The goal of the Boxwood project is building a distributed abtract storage infrastructure for applications with high data storage and retrieval requirements. Here, you see a high level picture of Boxwood. Boxwood has a concurrent blinktree implementation in Blinktree module. The blinktree module uses a cache module to store and retrieve its data at tree nodes quickly. The cache module makes its data persistent using a chunk manager module. The chunk manager implements distributed storage system that abstracts the storage system from the upper layers. PLDI 2005
Experience Experimental Results Scalable method: Caught bugs in industrial-scale designs Boxwood (30K LOC) Scan Filesystem (Windows NT) Java Libraries with known bugs Moderate instrumentation effort Several lines for each method I/O-refinement Low logging and verification overhead: BLinkTree: Logging 17% over testing, refinement check 27% View-refinement BLinkTree: Logging 20% over testing, refinement check 137% More effective in catching errors Cache: View-refinement: 26 random methods before error I/O-refinement: 539 random methods before error Remove tables. Explain why cache bug is tricky. Here are the experimental results from application of Vyrd on the Blinktree and cache modules. The overall results show that Vyrd can handle industrial scale designs with modest logging and verification costs. IO refinement requires only method call commit and return actions to be logged so the logging overhead is much less than view refinement requires. Note that the logging overhead includes the logging for IO refinement. But view refinement is more effective in catching bugs the first table shows the big difference in time passes in terms of the number of method calls made before detecting the error. The overhead of logging the actions for view-refinement may take much time as the granularity of actions gets finer. PLDI 2005
Experience Concurrency Bug in Cache Very similar to bug found in Scan file system Had not been caught by developers Current version does not contain bug Bug manifestation: Cache entry is correct Permanent storage has corrupted data Cause of bug: Concurrent execution of Write and Flush on the same entry Write to a dirty entry not locked properly Flush writes corrupted data to Chunk Manager Marks entry clean Hard to catch through testing As long as Read’s hit in Cache, return value correct Caught through testing only if Cache fills, clean entry in Cache is evicted No “Write”s to entry in the meantime Entry read after eviction Very unlikely PLDI 2005
Conclusions Runtime refinement checking Powerful technique with reasonable computational cost Effective for complex industrial-scale software Key novelty: Improves observability of testing Future work: Improving coverage/controllability Reducing manual instrumentation by limited use of model checking Tayfun Elmas, Serdar Tasiran VyrdMC: Driving Runtime Refinement Checking with Model Checkers (To appear in) Fifth Workshop on Runtime Verification (RV'05). The University of Edinburgh, Scotland, UK. July 12, 2005. In this talk we introduced a runtime checking technique for refinement. It is a powerful verification technique with reasonable computation cost. Although it imay not be exhaustive, complex industrial scale software can be effectively checked by our method. As a future work we plan to improve the coverage of testing by using model checkers to explore the interleavings more structurally. Our current work in this approach will appear in RV 2005. We have an upcoming paper at runtime verification workshop. PLDI 2005
Tayfun Elmas, Serdar Tasiran Questions VYRD: VerifYing Concurrent Programs by Runtime Refinement-Violation Detection Tayfun Elmas, Serdar Tasiran College of Engineering Koç University, Istanbul, Turkey Shaz Qadeer Microsoft Research Redmond, U.S. PLDI 2005