Predicting problems caused by component upgrades This is a technique for helping programmers to cope with inadequately documented programs. Michael Ernst MIT Lab for Computer Science Joint work with Stephen McCamant
An upgrade problem You are a happy user of Stucco Photostand You install Microswat Monopoly Photostand stops working Why? Step 2 upgraded winulose.dll Photostand is not compatible with the new version
Upgrade safety System S uses component C A new version C’ is released Might C’ cause S to misbehave? (This question is undecidable.)
Previous solutions Integrate new component, then test Resource-intensive Vendor tests new component Impossible to anticipate all uses User, not vendor, must make upgrade decision (We require this) Static analysis to guarantee identical or subtype behavior Difficult and inadequate
Behavioral subtyping Subtyping guarantees type compatibility No information about behavior Behavioral subtyping [Liskov 94] guarantees behavioral compatibility Provable properties of supertype are provable about subtype Operates on human-supplied specifications Ill-matched to the component upgrade problem
Behavioral subtyping is too strong and too weak OK to change APIs that the application does not call … or other aspects of APIs that are not depended upon Too weak: Application may depend on implementation details Example: Component version 1 returns elements in order Application depends on that detail Component version 2 returns elements in a different order Who is at fault in this example? It doesn’t matter!
Features of our solution Application-specific Can warn before integrating, testing Minimal disruption to the development process Requires no source code Requires no formal specification Warns regardless of who is at fault Accounts for internal and external behaviors Caveat emptor: no guarantee of (in)compatibility!
Run-time behavior comparison Compare run-time behaviors of components Old component, in context of the application New component, in context of vendor test suite Compatible if the vendor tests all the functionality that the application uses Consider comparing test suites “Behavioral subtesting” Vendor tests must produce the same results, of course!
Reasons for behavioral differences Differences between application and test suite use of component require human judgment True incompatibility Change in behavior might not affect application Change in behavior might be a bug fix Vendor test suite might be deficient It may be possible to work around the incompatibility
Operational abstraction Abstraction of run-time behavior of component Set of program properties – mathematical statements about component behavior Syntactically identical to formal specification
Dynamic invariant detection Goal: recover invariants from programs Technique: run the program, examine values Artifact: Daikon Experiments demonstrate accuracy, usefulness
Goal: recover invariants Detect invariants (as in asserts or specifications) x > abs(y) x = 16*y + 4*z + 3 array a contains no duplicates for each node n, n = n.child.parent graph g is acyclic if ptr null then *ptr > i
Uses for invariants Write better programs [Gries 81, Liskov 86] Document code Check assumptions: convert to assert Maintain invariants to avoid introducing bugs Locate unusual conditions Validate test suite: value coverage Provide hints for higher-level profile-directed compilation [Calder 98] Bootstrap proofs [Wegbreit 74, Bensalem 96] [This is a laundry list of potential uses for invariants. I'm trying to motivate why one would care about invariants.] Better programs when invariants are used in their design. But here I will focus on uses of invariants in already-constructed programs. Documentation is guaranteed to be up-to-date; for assert, is machine-checkable. Invariants established at one point may be depended upon elsewhere. This was the original motivation for the system. There’s evidence that maintenance introduces many bugs (and anecdotal evidence that some are due to violating invariants). The next four are arguably more useful with dynamic than with static invariants: Unusual (exceptional) conditions may indicate bugs or special cases that should be brought to a programmer’s attention. (But finding bugs isn’t my focus or what I consider the tool’s strength, though it was effective at that task. I would rather prevent a bug than have to find it any day.) Value coverage may be inadequate even if a test suite covers every line of a program. [Example: only small values.] Use invariants like profile-directed compilation, but at a higher and possibly more valuable level (cf “Value Profiling and Optimization” by Calder et al). Bootstrap proofs: or, formally prove the invariants. In theorem-proving, many consider it harder to determine what to prove than to actually do the proof. [Don’t say “’turn the crank”; it belittles theorem-proving.] [If you prove the property, the system becomes sound: but probably don’t raise this red flag yet.] Now, I’ll try to convince you that this is at least possible.
Ways to obtain invariants Programmer-supplied Static analysis: examine the program text [Cousot 77, Gannod 96] properties are guaranteed to be true pointers are intractable in practice Dynamic analysis: run the program complementary to static techniques [Joke] I have heard that some programs lack exhaustively-specified invariants; we’ll assume for the rest of the talk that we have such invariants. Claim: programmers have these in mind, consciously or unconsciously, when they write code. We want access to that hidden part of the design space. We might want to check/refine/strengthen programmer-supplied invariants, and the program may be modified/maintained. (Leveson has noted that program self-checks are often ineffective.) May want to check/strengthen programmer-supplied assertions. Static analysis: For instance, abstract interpretation. The results of a conservative, sound analysis are guaranteed to be true. Static techniques can’t report true but undecidable properties. (or properties of a program context) Pointers: problematic to represent program state and the heap; must approximate, resulting in too-weak results. Dynamic techniques are complementary to static ones.
Dynamic invariant detection Look for patterns in values the program computes: Instrument the program to write data trace files Run the program on a test suite Invariant engine reads data traces, generates potential invariants, and checks them [No setup needed; dive right into this.] [Remain excited! (Here and on next slide.)] All of these steps are automatic. Instrumenters are source-to-source translators. Define the front end here, or say “instrumenter” instead of “front end”. Possibly define back end here; but I think I don’t use that term.
Checking invariants For each potential invariant: instantiate (determine constants like a and b in y = ax + b) check for each set of variable values stop checking when falsified This is inexpensive: many invariants, each cheap All three steps are inexpensive: instantiation, checking, and number of checks (most checks fail quickly). There are many invariants, but each one is cheap. Can think of the property as a generalization from the first few values; then check the rest of the values. Example: 2 degrees of freedom in y=ax+b, so two (x,y) pairs uniquely determine a and b. (I don’t actually do it that way for reasons of numerical stability, but that is the idea.)
Improving invariant detection Add desired invariants: implicit values, unused polymorphism Eliminate undesired invariants: unjustified properties, redundant invariants, incomparable variables Traverse recursive data structures Conditionals: compute invariants over subsets of data (if x>0 then yz) [Make it clear that this is all completed work; I have solved these problems.] A naive implementation wouldn’t (and didn’t) produce that exact output. The simplified picture I have given you so far isn’t enough; it wouldn’t produce all the output you saw in the initial examples, and it would produce other undesirable output. I will now give some additional details. Discuss what relevance is. Unused polymorphism: overly wide types. I will discuss each of these in turn. Collectively/overall, how much improvement? I can’t say, because these are not just quantitative but qualitative improvements.
Testing upgrade compatibility User computes operational abstraction of old component, in context of application Vendor computes operational abstraction of new component, over its test suite Vendor supplies operational abstraction along with new component User compares operational abstractions OAapp for old component OAtest for new component
New operational abstraction must be stronger Approximate test: OAtest OAapp OA consists of precondition and postcondition Per behavioral subtyping: Preapp Pretest Posttest Postapp Sufficient, but not necessary goal Preapp Pretest Define “precondition”? Application Test suite Postapp Posttest known
Comparing operational abstractions Sufficient but not necessary: Preapp Pretest Posttest Postapp Sufficient and necessary: Preapp & Posttest Postapp x is even x is an integer increment test suite Application x’ = x + 1 x’ is odd x’ = x + 1 Introduce the x’ notation.
Sorting application // Sort the argument into ascending order static void bubble_sort(int[] a) { for (int x = a.length - 1; x > 0; x--) { // Compare adjacent elements in a[0..x] for (int y = 0; y < x; y++) { if (a[y] > a[y+1]) swap(a, y, y+1); }
Swap component // Exchange the two array elements at i and j static void swap(int[] a, int i, int j) { int temp = a[i]; a[i] = a[j]; a[j] = temp; } There is a “serious” problem with this implementation: it uses extra space!
Upgrade to swap component // Exchange the two array elements at i and j static void swap(int[] a, int i, int j) { a[i] ^= a[j]; a[j] ^= a[i]; } This upgrade uses less temporary space!
Compare abstractions a != null 0 <= i < size(a[])-1 1 <= j <= size(a[])-1 i < j j == i + 1 a[i] == a[j-1] a[i] > a[j] a != null 0 <= i <= size(a[])-1 0 <= j <= size(a[])-1 i != j bubble_sort application swap test suite a’[i] == a[j] a’[j] == a[i] a’[i] == a’[j-1] a’[j] == a[j-1] a’[i] < a’[j] a’[i] == a[j] a’[j] == a[i]
Compare abstractions Preapp Pretest a != null 0 <= i < size(a[])-1 1 <= j <= size(a[])-1 i < j j == i + 1 a[i] == a[j-1] a[i] > a[j] a != null 0 <= i <= size(a[])-1 0 <= j <= size(a[])-1 i != j bubble_sort application swap test suite a’[i] == a[j] a’[j] == a[i] a’[i] == a’[j-1] a’[j] == a[j-1] a’[i] < a’[j] a’[i] == a[j] a’[j] == a[i] Preapp Pretest
Compare abstractions Preapp & Posttest Postapp a != null 0 <= i < size(a[])-1 1 <= j <= size(a[])-1 i < j j == i + 1 a[i] == a[j-1] a[i] > a[j] a != null 0 <= i <= size(a[])-1 0 <= j <= size(a[])-1 i != j bubble_sort application swap test suite a’[i] == a[j] a’[j] == a[i] a’[i] == a’[j-1] a’[j] == a[j-1] a’[i] < a’[j] a’[i] == a[j] a’[j] == a[i] Preapp & Posttest Postapp
Compare abstractions Upgrade succeeds a != null 0 <= i < size(a[])-1 1 <= j <= size(a[])-1 i < j j == i + 1 a[i] == a[j-1] a[i] > a[j] a != null 0 <= i <= size(a[])-1 0 <= j <= size(a[])-1 i != j bubble_sort application swap test suite a’[i] == a[j] a’[j] == a[i] a’[i] == a’[j-1] a’[j] == a[j-1] a’[i] < a’[j] a’[i] == a[j] a’[j] == a[i] Upgrade succeeds
Another sorting application // Sort the argument into ascending order static void selection_sort(int[] a) { for (int x = 0; x <= a.length - 2; x++) { // Find the smallest element in a[x..] int min = x; for (int y = x; y < a.length; y++) { if (a[y] < a[min]) min = y; } swap(a, x, min);
Compare abstractions Test suite a != null 0 <= i < size(a[])-1 i <= j <= size(a[])-1 a[i] >= a[j] a != null 0 <= i <= size(a[])-1 0 <= j <= size(a[])-1 i != j selection_sort application Test suite a’[i] == a[j] a’[j] == a[i] a’[i] <= a’[j] a’[i] == a[j] a’[j] == a[i]
Compare abstractions Upgrade fails Test suite a != null 0 <= i < size(a[])-1 i <= j <= size(a[])-1 a[i] >= a[j] a != null 0 <= i <= size(a[])-1 0 <= j <= size(a[])-1 i != j selection_sort application Test suite a’[i] == a[j] a’[j] == a[i] a’[i] <= a’[j] a’[i] == a[j] a’[j] == a[i] Upgrade fails
Currency case study Application: Math-Currency Component: Math-BigInt (versions 1.40, 1.42) Both from Comprehensive Perl Archive Network Our technique is needed: a wrong version of BigInt induces two errors in Currency
Downgrade from BigInt 1.42 to 1.40 (Why downgrade? Fix bugs, porting.) Inconsistency is discovered: In 1.42, bcmp returns -1, 0, or 1 In 1.40, bcmp returns any integer Do not downgrade without further examination Application might do (a <=> b) == (c <=> d) (This change is not reflected in the documentation.)
Upgrade from BigInt 1.40 to 1.42 Inconsistency: In 1.40, bcmp($1.67, $1.75) 0 In 1.42, bcmp($1.67, $1.75) -1 Our system did not discover this property … … but it discovered differences in behavior of other components that interacted with it Do not upgrade without further examination The change was a change to the mantissa during the right-shift that truncated to whole-dollar valuees. We don’t know whether this change introduces or fixes a bug.
Getting to Yes: Limits of the technique Rejecting an upgrade is easier than approving it Application postconditions may be hard to prove Can check the reason for the rejection Key problem is limits of the theorem prover Adjust grammar of operational abstractions Stronger or weaker properties may be provable Weak properties may depend on strong ones
Implementation status Operational abstractions are automatically generated (by the Daikon invariant detector) In Currency case study, operational abstractions were compared by hand Operational abstractions are automatically compared (by the Simplify theorem prover) Requires background theory for each property
Contributions New technique for early detection of upgrade problems Compares run-time behavior of old & new components Technique is Application-specific Pre-integration Lightweight Source-free Specification-free Blame-neutral Output-independent Unvalidated
