Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California, Berkeley James R. Larus, Microsoft Research POPL 2002
Motivation Formal verification is a promising alternative to software testing But Verifiers will be of little use without enough correctness specifications to be verified
The Assumption Common behavior is (often) correct behavior. If we can identify common behavior we can produce correct specifications, even from programs that contain errors.
A Program Using socket API 1 int s = socket(AF_INET, SOCK_STREAM, 0); 2 … 3 bind(s, &serv_addr, sizeof(serv_addr)); 4 … 5 listen(s, 5); 6 … 7 while (1) { 8 int ns = accept(s, &addr, &len); 9 if (ns < 0) break; 10 do { 11 read(ns, buffer, 255); 12 … 13 write(ns, buffer, size); 14 if (cond1) return; 15 } while (cond2) 16 close(ns); 17 } 18 close(s);
An Example Trace 1 socket(domain = 2, type = 1, proto = 0, return = 7) 2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0) 3 listen(so = 7, backlog = 5, return = 0) 4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8) 5 read(fd = 8, buf = 0x400320, len = 255, return = 12) 6 write(fd = 8, buf = 0x400320, len = 12, return = 12) 7 read(fd = 8, buf = 0x400320, len = 255, return = 7) 8 write(fd = 8, buf = 0x400320, len = 7, return = 7) 9 close(fd = 8, return = 0) 10 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 10) 11 read(fd = 10, buf = 0x400320, len = 255, return = 13) 12 write(fd = 10, buf = 0x400320, len = 13, return = 13) 13 close(fd = 10, return = 0) 14 close(fd = 7, return = 0)
Design Decisions 1.Learn from traces not from source Contain fewer bugs 2.Take a “vote” on what the common program behavior is. the high-probability core encodes the frequently followed protocol.
Mining System Run Tracer Automaton learner Scenario extractor Flow dependence annotator Instrumented program Traces Program Test inputs Annotated traces Scenario seed Abstract scenario strings Specifications
I - the set of all traces of interaction with an API or ADT. C I - the set of all correct traces of interaction. T - an unlabelled training set of interaction traces. Find an automaton A that generates exactly the traces in C. The (unsolvable) Problem
Restriction 1 C must be a regular language. –Model checkers require finite-state specifications. –Algorithms for learning finite-state automatons are relatively well developed.
Interaction Scenarios LinkedList(n) malloc free malloc free malloc(return = O 1 ) malloc(return = O 2 ) free(p = O n ) malloc(return = O n ) free(p = O 2 ) free(p = O 1 ) malloc(return = O 1 ) free(p = O 1 ) O1{O1{ malloc(return = O 2 ) free(p = O 2 ) O2{O2{ malloc(return = O n ) free(p = O n ) On{On{ malloc(return = O std ) free(p = O std ) O1{O1{ malloc(return = O std ) free(p = O std ) O2{O2{ malloc(return = O std ) free(p = O std ) On{On{
The Problem – Take 2 I S - the set of all interaction scenarios with an API or ADT that manipulate no more than k data objects. C S I S - the regular set of all correct scenarios. T S - an unlabelled training set of interaction scenarios from I S. Find a finite-state automaton A S that generates exactly the scenarios in C S.
Restriction 2 - Linking T s and C s T S = c 0,c 1,… be an infinite sequence of elements from C S in which each element of C S occurs at least once. for each n > 0: c 0,c 1,… c n A S n for some N ≥ 0, A S N generates exactly the scenarios in C S and A S n = A S N for all n ≥ N. A S 0,A S 1,… identifies C S in the limit.
The Probabilistic Approach I s – as before. M – a target PFSA and P M a distribution over I s that M generates. “Efficiently” find a PFSA M’ such that its distribution P M’ is an ε-good approximation of P M.
Mining System Run Tracer Automaton learner Scenario extractor Flow dependence annotator Instrumented program Traces Program Test inputs Annotated traces Scenario seed Abstract scenario strings Specifications
Tracer 1.C stdio replacement (requires recompilation) 2.Executable editing 1 socket(domain = 2, type = 1, proto = 0, return = 7) 2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0) 3 listen(so = 7, backlog = 5, return = 0) 4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8) skeleton : interaction(attribute 0,…, attribute n )
Flow Dependence Type inference Dependence analysis Untyped trace with dependencies Traces Annotated traces
Dependence Analysis Definers: socket.return bind.so listen.so accept.return close.fd Takes a list of attributes that define or use objects (manually created). Creates a flow dependence between users and definers. Users: bind.so listen.so accept.so read.fd write.fd close.fd
Type Inference If there exists a flow dependency between two attributes then typing gives these attributes the same type. Type(socket.return)=T0 Type(bind.so)=T0 Type(listen.so)=T0 Type(accept.so)=T0 Type(accept.return)=T0 Type(read.fd)=T0 Type(write.fd)=T0 Type(close.fd)=T0
Scenario Extraction Simplification Extraction scenarios simplified scenarios Annotaed traces Standardization Scenario seeds Abstract scenario strings
Extraction A scenario is a set of interactions related by flow dependences. 1 socket(domain = 2, type = 1, proto = 0, return = 7) 2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0) 3 listen(so = 7, backlog = 5, return = 0) 4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8) 5 read(fd = 8, buf = 0x400320, len = 255, return = 12) 6 write(fd = 8, buf = 0x400320, len = 12, return = 12) 7 read(fd = 8, buf = 0x400320, len = 255, return = 7) 8 write(fd = 8, buf = 0x400320, len = 7, return = 7) 9 close(fd = 8, return = 0)
Simplification Eliminate all interaction attributes that do not carry a flow dependence. 1 socket(return = 7) 2 bind(so = 7) 3 listen(so = 7) 4 accept(so = 7, return = 8) [seed] 5 read(fd = 8) 6 write(fd = 8) 7 read(fd = 8) 8 write(fd = 8) 9 close(fd = 8)
Standardization 1 socket(return = x0:T0) 2 bind(so = x0:T0) 3 listen(so = x0:T0) 4 accept(so = x0:T0, return = x1:T0) [seed] 5 read(fd = x1:T0) 7 read(fd = x1:T0) 6 write(fd = x1:T0) 8 write(fd = x1:T0) 9 close(fd = x1:T0) 1.Naming: replaces attribute values with symbolic variables. 2.Reordering (A) (B) (C) (D) (E) (F) (G)
Automaton Learning 1.OTS learner learns a PFSA 2.A corer removes infrequently traversed edges and converts the PFSA into an NFA. start final
Specification Automaton for the Socket Protocol socket(return = x) bind(so = x) listen(so = x) accept(so = x, return = y) read(fd = y)write(fd = y) close(fd = x) close(fd = y)
Experimental Results Analyzed traces from programs that use the Xlib and X Toolkit Intrinsics libraries for the X11 windowing system. Traces were generated manually Compare mined specification to Interclient Communication Conventions Manual (ICCCM) rules.
Experimental Results A small and buggy training set prevented the miner from discovering the rule. solution: an expert chooses correct traces as the training set.
Benefits Exploits the massive programmers' effort that is reflected in the code (and nowhere else). Offers convenience and insights. It is easier to approve a mined formal specification than to write one.
Conclusion Introduced a (semi) automatic machine- learning approach for discovering formal specifications. Reduced the problem to learning regular languages. Initial experience is promising.