ESP: Program Verification Of Millions of Lines of Code

ESP: Program Verification Of Millions of Lines of Code
Manuvir Das Researcher PPRC Reliability Team Microsoft Research

Motivation No Buffer Overruns! No Resource Leaks! No Privilege Misuse!
Don't like this box, want new packaging Many OSs have these problems We know Windows doesn't Would like to verify the code and prove it so we can put rubber stamps on the box Not done yet, would be on a beach We think we've invented the technology that will get us there

Approach Redundency is good Redundancy exposes inconsistency
Inconsistency points to errors Compare what programmer should do what her code actually does

Lightweight specifications
Rules Describe correct behavior Readable/writable by programmers Specify limited properties not total correctness/verification Compare rules against code

Types are rules Programmers use types to
document interface syntax represent program abstractions Types are written, read and checked routine part of development process Why are types successful? types are lightweight specifications type checking is fast & routine errors are found early, at compile-time

Can we extend this approach?
Specify and check other properties languages to express rules tools to check that code obeys rules Goal is partial correctness detect and report important classes of errors no guarantee of program correctness Systematic tools of various flavors compile-time verifiers and bug-finders run-time monitors and fault injectors document generators

Program Analysis Engine
Rule-based programming Rules Development Testing Static Verification Tool Read for understanding Drive testing tools Precise Rules New API rules 100% path coverage Defects elapsed time: 2 Program Analysis Engine Source Code

Path-sensitive Dataflow Analysis
ESP Rules ESP OPAL Rules 100% path coverage Defects elapsed time: 2 Path-sensitive Dataflow Analysis C/C++ Code

Requirements Scalability Usability Complete coverage
Millions of lines of code All features of C/C++ Usability Low number of false positives Simple rule description language Informative error reports

The bottom line Can ESP verify a million lines of code?
We’re not sure …. yet We’ve done 150 KLOC in 70s and 50MB So, we’re cautiously optimistic Segway: Verification has been around a long while More than a few lines of code is difficult How can we even contemplate large programs?

Are we running into a wall?
Verification demands precision Need to minimize false error reports Must analyze each execution path Big programs demand scalability Exponentially/infinitely many paths Cannot analyze each execution path Must use approximate analysis PREfix/Metal Drop some paths on the floor Can't do this if we want to rubber stamp We're up against a well known problem in program analysis

Research problem Can we invent a verification method that
is always conservative, is always scalable, is almost always precise, and matches our intuition? Yes, for a certain class of rules Finite state, temporal safety properties This is my last marketing slide Time to roll up our sleeves and look at some good old fashioned C code

Finite state safety properties
Property is described by an FSA As the program executes, a monitor tracks the current state of the FSA updates the current state signals an error when the FSA transitions into special error states Goal of verification: Is there some execution path that would cause the monitor to signal an error?

Example: stdio usage in gcc
void main () { if (dump) fil = fopen(dumpFile,”w”); if (p) x = 0; else x = 1; fclose(fil); } void main () { if (dump) Open; if (p) x = 0; else x = 1; Close; } Closed Opened Error Open Print Close Print/Close * Blue code shows interactions with stdio library Let's call these events, and name them There are rules we must follow Can only close a file if we've opened it, and we haven't already closed it Partial program verification Specify these rules using a state machine Verify code against these rules Program execution drives the FSM Transitions to $error represent violations

Path-sensitive property analysis
Symbolically evaluate the program Track FSA state and execution state At branch points: Execution state implies branch direction? Yes: process appropriate branch No: split state and process both branches Can use static analysis to verify the code One option it to perform a very precise path-by-path analysis

Dataflow property analysis
Track only FSA state Ignore non-state-changing code At control flow join points: Accumulate FSA states This is like a standard compiler dataflow analysis, but the only data we're tracking is the FSM state

Example T F entry dump p x = 0 x = 1 Open Close exit {Closed}
{Closed,Opened} {Closed,Opened} This analysis is imprecise In this case, the imprecision matters, because it causes a false error report Very efficient, unlike regular dataflow So these are the two previously known endpoints on the curve How do we find the right point in between? {Error,Closed,Opened}

Why is this code correct?
void main () { if (dump) Open; if (p) x = 0; else x = 1; Close; } Closed Opened Error Open Print Close Print/Close * Must be precise. So, must understand correct code Why is this code correct? Simple answer: Open and Close execute under the same conditions Examine more closely: at end of code, must close file, but only if FSM is in Opened state Type system of language does not allow her to write this So, relies on correlation between dump & FSM Analysis must track correlation after 1st branch No correlation with p

When is a branch relevant?
Precise answer When the value of the branch condition determines the property FSA state Heuristic answer When the property FSA is driven to different states along the arms of the branch statement Precise answer is undecidable Using this heuristic, we have designed an algorithm called property simulation

Property simulation Modification of path-sensitive analysis
At control flow join points: States agree on property FSA state? Yes: merge states No: process states separately

Example T F entry dump p x = 0 x = 1 Open Close exit [Closed]
[Opened|dump=T] [Closed|dump=F] At every point, there is a map from FSM state to simulation state. Hence the name. 2nd branch causes temporary blowup. Then effect is merged away so it doesn't come into play downstream. Think of downstream code being analyzed in 2 configurations rather than 4 states Much more in the paper Great, so now we can go implement this and start rubber stamping, right? [Opened|dump=T,p=T,x=0] [Opened|dump=T,p=F,x=1] [Opened|dump=T] [Closed|dump=F] [Closed|dump=T] [Closed|dump=F] [Closed|dump=T] [Closed]

Making property simulation work
Real programs are complex Multiple FSAs Aliasing Real code bases are very large Well beyond a million lines ESP = Property Simulation + Multiple FSAs + Aliasing + Component-wise Analysis

Problem: Multiple FSAs
void main () { if (dump1) Open(fil1); if (dump2) Open(fil2); Close(fil1); Close(fil2); } Transition Source code pattern Close fclose(e) Open e = fopen(_) void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fclose(fil1); fclose(fil2); } void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fclose(fil1); fclose(fil2); } Transition Source code pattern Close fclose(e) Open e = fopen(_) Closed Opened Error Open Print Close Print/Close * Wrong - real code is complex More than one file handle, so multiple FSMs Let's steal a very cool idea from Engler Use syntactic patterns to identify FSMs e is the file handle Events are parameterized Track as before Oops, vector of states now, so exponential property state. Not good.

Property simulation, bit by bit
Problem: property state can be exponential Solution: track one FSA at a time void main () { if (dump1) Open; if (dump2) ID; Close; } void main () { if (dump1) ID; if (dump2) Open; Close; } Let's steal an idea from optimizing compilers: bit vector analysis Analyze one FSM at a time through the code Only events for that FSm remain

Property simulation, bit by bit
One FSA at a time + Avoids exponential property state + Fewer branches are relevant + Lifetimes are often short + Smaller memory footprint + Embarassingly parallel − Cannot correlate FSAs Amplifies effect of merge heuristic Lifetime - reason for use in compilers Cannot express "A is opened and B is closed" So, we beat that problem, we're done

Problem: Aliasing void main () { if (dump1)
fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fil3 = fil1; fclose( fil3 ); fclose( fil2 ); } Not quite! Same code, assign from fil1 to fil3 Open/close have syntactically different params Will lead to false errors and/or missed errors Need to look beyond syntax, at semantics ESP semantic model Track one value at a time When we come to an event parameterized by a name, we need to decide if there is a state change on the value we're tracking

ESP Model: Values Have State
During execution, the program creates stateful values changes the state of stateful values The programmer defines how values are created (syntactic patterns) how values change state (syntactic patterns) Syntactic expressions are aliases for values

OPAL Rule Descriptions
Object Property Automata Language State Closed State Opened State Error Initial Event Open { _object_ ASTFUNCTIONCALL { ASTSYMBOL “fopen” } { _anyargs_ } } Event Close { ASTFUNCTIONCALL { ASTSYMBOL “fclose” } { _object_ } } Transition _ -> Opened on Open Transition Opened -> Closed on Close Transition Closed -> Error on Close “File already closed”

Parameterized transitions
void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fil3 = fil1; fclose( fil3 ); fclose( fil2 ); } Not quite! Same code, assign from fil1 to fil3 Open/close have syntactically different params Will lead to false errors and/or missed errors Need to look beyond syntax, at semantics ESP semantic model Track one value at a time When we come to an event parameterized by a name, we need to decide if there is a state change on the value we're tracking

Parameterized transitions
void main () { if (dump1) { t1 = fopen(dumpFile1,”w”); Open(t1); fil1 = t1; } if (dump2) { t2 = fopen(dumpFile2,”w”); Open(t2); fil2 = t2; fil3 = fil1; fclose( fil3 ); Close(fil3); fclose( fil2 ); Close(fil2); How? Time to steal another idea. This time, from ourselves So, we've invented some ideas, stolen others, now we can put it all together

Expressions are value aliases
void main () { if (dump1) { t1 = fopen(dumpFile1,”w”); Open(t1); fil1 = t1; } if (dump2) { t2 = fopen(dumpFile2,”w”); Open(t2); fil2 = t2; fil3 = fil1; fclose( fil3 ); Close(fil3); fclose( fil2 ); Close(fil2); Not quite! Same code, assign from fil1 to fil3 Open/close have syntactically different params Will lead to false errors and/or missed errors Need to look beyond syntax, at semantics ESP semantic model Track one value at a time When we come to an event parameterized by a name, we need to decide if there is a state change on the value we're tracking

Value-alias analysis Is expression e an alias for value v?
ESP uses GOLF to answer this query Generalized One Level Flow Context-sensitive Largely flow-insensitive Millions of lines of code, in seconds How? Time to steal another idea. This time, from ourselves So, we've invented some ideas, stolen others, now we can put it all together

Putting it all together
Property simulation Identify and track relevant execution state Syntactic patterns + value-alias analysis Identify and isolate individual FSAs One FSA at a time Bit vector analysis for safety properties These ideas form the basis of ESP, and distinguish it from other projects The implemented system is much more complex

Case study: stdio usage in gcc
cc1 from gcc version (Spec95) Does cc1 always print to opened files? cc1 is a complex program: 140K non-blank, non-comment lines of C 2149 functions, 66 files, 1086 globals Call graph includes one 450 function SCC What have we done with ESP? cc1 is the gcc from-end (parser etc) Prints output to many files depending on what the user wants Anyone who has written a recursive descent parser should understand where the big SCC comes from

Skeleton of cc1 source FILE *f1, … , *f15; int p1, … , p15;
void compileFile() { if (p1) f1 = fopen(…); … if (p15) f15 = fopen(…); restOfComp(); fclose(f1); fclose(f15); } void restOfComp() { if (p1) printRtl(f1); … if (p15) printRtl(f15); restOfComp(); } void printRtl(FILE *f) fprintf(f); For each compilation unit Conditionally open 15 files Compile code Conditionally close files Path-sensitive analysis would lead to 215 states In ESP, only one branch is relevant for each file handle Code is processed in two configs Path-sensitivity + context-sensitive value flow are needed

OPAL rules for stdio usage
State Uninit State Closed State Opened State Error Initial Event Decl {ASTDECLARATION {_object_ ASTSYMBOL _any_}} Initial Event Open {_object_ ASTFUNCTIONCALL {ASTSYMBOL “fopen”} {_anyargs_}} Event Print {ASTFUNCTIONCALL {ASTSYMBOL “fprintf”} {_object_,_anyargs_}} Event Close {ASTFUNCTIONCALL {ASTSYMBOL “fclose”} {_object_}} Transition _ -> Uninit on Decl Transition _ -> Opened on Open Transition Uninit -> Error on Print “File not opened” Transition Opened -> Opened on Print Transition Closed -> Error on Print “Printing to closed file” Transition Opened -> Closed on Close Transition Closed -> Error on Close “File already closed”

Experimental results Precision Scalability We have proved that:
Verification succeeds for every file handle No transitions to Error; no false errors Scalability Ave. per handle: 72.9 seconds, 49.7 MB Single 1GHz PIII laptop with 512 MB RAM We have proved that: Each of the 646 calls to fprintf in the source code prints to a valid, open file We've run on parts of the Windows code base. Of course, there are no bugs to find. Instead, we're using the code in Windows as a test suite, to find bugs in our own implementation. Which is backwards, isn't it?

Ongoing research Path-sensitive value-alias analysis
Value-alias sets Expressions that hold tracked value Track value-alias sets during simulation Add value-alias sets to property state When things get complicated, use GOLF Component-wise analysis Identify and analyze components Link using less precise analysis

ESP: Program Verification Of Millions of Lines of Code

Similar presentations

Presentation on theme: "ESP: Program Verification Of Millions of Lines of Code"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ESP: Program Verification Of Millions of Lines of Code

Similar presentations

Presentation on theme: "ESP: Program Verification Of Millions of Lines of Code"— Presentation transcript:

Similar presentations

About project

Feedback