ESP: Program Verification Of Millions of Lines of Code

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

Object Oriented Analysis And Design-IT0207 iiI Semester

Semantics Static semantics Dynamic semantics attribute grammars

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

Pointer Analysis – Part I Mayur Naik Intel Research, Berkeley CS294 Lecture March 17, 2009.

CS0004: Introduction to Programming Visual Studio 2010 and Controls.

Effectively Prioritizing Tests in Development Environment

Program Representations. Representing programs Goals.

© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming The software development method algorithms.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Final exam week Three things on finals week: –final exam –final project presentations –final project report.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Speeding Up Dataflow Analysis Using Flow- Insensitive Pointer Analysis Stephen Adams, Tom Ball, Manuvir Das Sorin Lerner, Mark Seigle Westley Weimer Microsoft.

4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)

Software Reliability Methods Sorin Lerner. Software reliability methods: issues What are the issues?

ESP: Program Verification Of Millions of Lines of Code Manuvir Das Researcher PPRC Reliability Team Microsoft Research.

ESP [Das et al PLDI 2002] Interface usage rules in documentation –Order of operations, data access –Resource management –Incomplete, wordy, not checked.

CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

Guide To UNIX Using Linux Third Edition

© 2006 Pearson Addison-Wesley. All rights reserved2-1 Chapter 2 Principles of Programming & Software Engineering.

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.

Symbolic Path Simulation in Path-Sensitive Dataflow Analysis Hari Hampapuram Jason Yue Yang Manuvir Das Center for Software Excellence (CSE) Microsoft.

Precision Going back to constant prop, in what cases would we lose precision?

Language Evaluation Criteria

Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.

Scalable Defect Detection Manuvir Das, Zhe Yang, Daniel Wang Center for Software Excellence Microsoft Corporation.

CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.

CSC 142 B 1 CSC 142 Java objects: a first view [Reading: chapters 1 & 2]

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Marie desJardins University of Maryland, Baltimore County.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Design methodologies.

Testing and Debugging Version 1.0. All kinds of things can go wrong when you are developing a program. The compiler discovers syntax errors in your code.

Security - Why Bother? Your projects in this class are not likely to be used for some critical infrastructure or real-world sensitive data. Why should.

Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.

Fall 2004EE 3563 Digital Systems Design EE 3563 VHSIC Hardware Description Language  Required Reading: –These Slides –VHDL Tutorial  Very High Speed.

Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.

© 2006 Pearson Addison-Wesley. All rights reserved 2-1 Chapter 2 Principles of Programming & Software Engineering.

CSCI1600: Embedded and Real Time Software Lecture 28: Verification I Steven Reiss, Fall 2015.

Software Quality Assurance and Testing Fazal Rehman Shamil.

MOPS: an Infrastructure for Examining Security Properties of Software Authors Hao Chen and David Wagner Appears in ACM Conference on Computer and Communications.

Introduction to Computing Systems and Programming Programming.

Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.

Chapter 2 Build Your First Project A Step-by-Step Approach 2 Exploring Microsoft Visual Basic 6.0 Copyright © 1999 Prentice-Hall, Inc. By Carlotta Eaton.

Certification of Reusable Software Artifacts

Regression Testing with its types

YAHMD - Yet Another Heap Memory Debugger

Types for Programs and Proofs

Learning to Program D is for Digital.

John D. McGregor Session 9 Testing Vocabulary

Chapter 9, Testing.

Unified Modeling Language

CSE 374 Programming Concepts & Tools

Verifying REACT Aleks Milisevic Will Noble Martin Rinard

John D. McGregor Session 9 Testing Vocabulary

John D. McGregor Session 9 Testing Vocabulary

Phil Tayco Slide version 1.0 Created Oct 2, 2017

Introduction to Software Testing

Human Complexity of Software

Over-Approximating Boolean Programs with Unbounded Thread Creation

Chapter 10 – Software Testing

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

CPSC 315 – Programming Studio

Data Flow Analysis Compiler Design

Tonga Institute of Higher Education IT 141: Information Systems

Tonga Institute of Higher Education IT 141: Information Systems

Lecture 5 Scanning.

Presentation transcript:

ESP: Program Verification Of Millions of Lines of Code Manuvir Das Researcher PPRC Reliability Team Microsoft Research

Motivation No Buffer Overruns! No Resource Leaks! No Privilege Misuse! Don't like this box, want new packaging Many OSs have these problems We know Windows doesn't Would like to verify the code and prove it so we can put rubber stamps on the box Not done yet, would be on a beach We think we've invented the technology that will get us there

Approach Redundency is good Redundancy exposes inconsistency Inconsistency points to errors Compare what programmer should do what her code actually does

Lightweight specifications Rules Describe correct behavior Readable/writable by programmers Specify limited properties not total correctness/verification Compare rules against code

Types are rules Programmers use types to document interface syntax represent program abstractions Types are written, read and checked routine part of development process Why are types successful? types are lightweight specifications type checking is fast & routine errors are found early, at compile-time

Can we extend this approach? Specify and check other properties languages to express rules tools to check that code obeys rules Goal is partial correctness detect and report important classes of errors no guarantee of program correctness Systematic tools of various flavors compile-time verifiers and bug-finders run-time monitors and fault injectors document generators

Program Analysis Engine Rule-based programming Rules Development Testing Static Verification Tool Read for understanding Drive testing tools Precise Rules New API rules 100% path coverage Defects elapsed time: 2 Program Analysis Engine Source Code

Path-sensitive Dataflow Analysis ESP Rules ESP OPAL Rules 100% path coverage Defects elapsed time: 2 Path-sensitive Dataflow Analysis C/C++ Code

Requirements Scalability Usability Complete coverage Millions of lines of code All features of C/C++ Usability Low number of false positives Simple rule description language Informative error reports

The bottom line Can ESP verify a million lines of code? We’re not sure …. yet We’ve done 150 KLOC in 70s and 50MB So, we’re cautiously optimistic Segway: Verification has been around a long while More than a few lines of code is difficult How can we even contemplate large programs?

Are we running into a wall? Verification demands precision Need to minimize false error reports Must analyze each execution path Big programs demand scalability Exponentially/infinitely many paths Cannot analyze each execution path Must use approximate analysis PREfix/Metal Drop some paths on the floor Can't do this if we want to rubber stamp We're up against a well known problem in program analysis

Research problem Can we invent a verification method that is always conservative, is always scalable, is almost always precise, and matches our intuition? Yes, for a certain class of rules Finite state, temporal safety properties This is my last marketing slide Time to roll up our sleeves and look at some good old fashioned C code

Finite state safety properties Property is described by an FSA As the program executes, a monitor tracks the current state of the FSA updates the current state signals an error when the FSA transitions into special error states Goal of verification: Is there some execution path that would cause the monitor to signal an error?

Example: stdio usage in gcc void main () { if (dump) fil = fopen(dumpFile,”w”); if (p) x = 0; else x = 1; fclose(fil); } void main () { if (dump) Open; if (p) x = 0; else x = 1; Close; } Closed Opened Error Open Print Close Print/Close * Blue code shows interactions with stdio library Let's call these events, and name them There are rules we must follow Can only close a file if we've opened it, and we haven't already closed it Partial program verification Specify these rules using a state machine Verify code against these rules Program execution drives the FSM Transitions to $error represent violations

Path-sensitive property analysis Symbolically evaluate the program Track FSA state and execution state At branch points: Execution state implies branch direction? Yes: process appropriate branch No: split state and process both branches Can use static analysis to verify the code One option it to perform a very precise path-by-path analysis

Example T F entry dump p x = 0 x = 1 Open Close exit [Closed] [Closed|dump=T] [Opened|dump=T] [Opened|dump=T,p=T] [Opened|dump=T,p=F] [Opened|dump=T,p=T,x=0] [Opened|dump=T,p=F,x=1] Shows the state just before the program point No errors 4 different states after 2nd branch So downstream code is analyzed 4 times [Opened|dump=T,p=T,x=0] [Opened|dump=T,p=F,x=1] [Closed|dump=T,p=T,x=0] [Closed|dump=T,p=F,x=1]

Dataflow property analysis Track only FSA state Ignore non-state-changing code At control flow join points: Accumulate FSA states This is like a standard compiler dataflow analysis, but the only data we're tracking is the FSM state

Example T F entry dump p x = 0 x = 1 Open Close exit {Closed} {Closed,Opened} {Closed,Opened} This analysis is imprecise In this case, the imprecision matters, because it causes a false error report Very efficient, unlike regular dataflow So these are the two previously known endpoints on the curve How do we find the right point in between? {Error,Closed,Opened}

Why is this code correct? void main () { if (dump) Open; if (p) x = 0; else x = 1; Close; } Closed Opened Error Open Print Close Print/Close * Must be precise. So, must understand correct code Why is this code correct? Simple answer: Open and Close execute under the same conditions Examine more closely: at end of code, must close file, but only if FSM is in Opened state Type system of language does not allow her to write this So, relies on correlation between dump & FSM Analysis must track correlation after 1st branch No correlation with p

When is a branch relevant? Precise answer When the value of the branch condition determines the property FSA state Heuristic answer When the property FSA is driven to different states along the arms of the branch statement Precise answer is undecidable Using this heuristic, we have designed an algorithm called property simulation

Property simulation Modification of path-sensitive analysis At control flow join points: States agree on property FSA state? Yes: merge states No: process states separately

Example T F entry dump p x = 0 x = 1 Open Close exit [Closed] [Opened|dump=T] [Closed|dump=F] At every point, there is a map from FSM state to simulation state. Hence the name. 2nd branch causes temporary blowup. Then effect is merged away so it doesn't come into play downstream. Think of downstream code being analyzed in 2 configurations rather than 4 states Much more in the paper Great, so now we can go implement this and start rubber stamping, right? [Opened|dump=T,p=T,x=0] [Opened|dump=T,p=F,x=1] [Opened|dump=T] [Closed|dump=F] [Closed|dump=T] [Closed|dump=F] [Closed|dump=T] [Closed]

Loop example T T F F entry [Closed] new = old [Closed|new=old+1] Open [Opened|new=old] * T T Close F new++ [Closed|new=old+1] new != old [Opened|new=old] F Close exit [Closed|new=old]

Making property simulation work Real programs are complex Multiple FSAs Aliasing Real code bases are very large Well beyond a million lines ESP = Property Simulation + Multiple FSAs + Aliasing + Component-wise Analysis

Problem: Multiple FSAs void main () { if (dump1) Open(fil1); if (dump2) Open(fil2); Close(fil1); Close(fil2); } Transition Source code pattern Close fclose(e) Open e = fopen(_) void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fclose(fil1); fclose(fil2); } void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fclose(fil1); fclose(fil2); } Transition Source code pattern Close fclose(e) Open e = fopen(_) Closed Opened Error Open Print Close Print/Close * Wrong - real code is complex More than one file handle, so multiple FSMs Let's steal a very cool idea from Engler Use syntactic patterns to identify FSMs e is the file handle Events are parameterized Track as before Oops, vector of states now, so exponential property state. Not good.

Property simulation, bit by bit Problem: property state can be exponential Solution: track one FSA at a time void main () { if (dump1) Open; if (dump2) ID; Close; } void main () { if (dump1) ID; if (dump2) Open; Close; } Let's steal an idea from optimizing compilers: bit vector analysis Analyze one FSM at a time through the code Only events for that FSm remain

Property simulation, bit by bit One FSA at a time + Avoids exponential property state + Fewer branches are relevant + Lifetimes are often short + Smaller memory footprint + Embarassingly parallel − Cannot correlate FSAs Amplifies effect of merge heuristic Lifetime - reason for use in compilers Cannot express "A is opened and B is closed" So, we beat that problem, we're done

Problem: Aliasing void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fil3 = fil1; fclose( fil3 ); fclose( fil2 ); } Not quite! Same code, assign from fil1 to fil3 Open/close have syntactically different params Will lead to false errors and/or missed errors Need to look beyond syntax, at semantics ESP semantic model Track one value at a time When we come to an event parameterized by a name, we need to decide if there is a state change on the value we're tracking

ESP Model: Values Have State During execution, the program creates stateful values changes the state of stateful values The programmer defines how values are created (syntactic patterns) how values change state (syntactic patterns) Syntactic expressions are aliases for values

OPAL Rule Descriptions Object Property Automata Language State Closed State Opened State Error Initial Event Open { _object_ ASTFUNCTIONCALL { ASTSYMBOL “fopen” } { _anyargs_ } } Event Close { ASTFUNCTIONCALL { ASTSYMBOL “fclose” } { _object_ } } Transition _ -> Opened on Open Transition Opened -> Closed on Close Transition Closed -> Error on Close “File already closed”

Parameterized transitions void main () { if (dump1) fil1 = fopen(dumpFile1,”w”); if (dump2) fil2 = fopen(dumpFile2,”w”); fil3 = fil1; fclose( fil3 ); fclose( fil2 ); } Not quite! Same code, assign from fil1 to fil3 Open/close have syntactically different params Will lead to false errors and/or missed errors Need to look beyond syntax, at semantics ESP semantic model Track one value at a time When we come to an event parameterized by a name, we need to decide if there is a state change on the value we're tracking

Parameterized transitions void main () { if (dump1) { t1 = fopen(dumpFile1,”w”); Open(t1); fil1 = t1; } if (dump2) { t2 = fopen(dumpFile2,”w”); Open(t2); fil2 = t2; fil3 = fil1; fclose( fil3 ); Close(fil3); fclose( fil2 ); Close(fil2); How? Time to steal another idea. This time, from ourselves So, we've invented some ideas, stolen others, now we can put it all together

Expressions are value aliases void main () { if (dump1) { t1 = fopen(dumpFile1,”w”); Open(t1); fil1 = t1; } if (dump2) { t2 = fopen(dumpFile2,”w”); Open(t2); fil2 = t2; fil3 = fil1; fclose( fil3 ); Close(fil3); fclose( fil2 ); Close(fil2); Not quite! Same code, assign from fil1 to fil3 Open/close have syntactically different params Will lead to false errors and/or missed errors Need to look beyond syntax, at semantics ESP semantic model Track one value at a time When we come to an event parameterized by a name, we need to decide if there is a state change on the value we're tracking

Value-alias analysis Is expression e an alias for value v? ESP uses GOLF to answer this query Generalized One Level Flow Context-sensitive Largely flow-insensitive Millions of lines of code, in seconds How? Time to steal another idea. This time, from ourselves So, we've invented some ideas, stolen others, now we can put it all together

Putting it all together Property simulation Identify and track relevant execution state Syntactic patterns + value-alias analysis Identify and isolate individual FSAs One FSA at a time Bit vector analysis for safety properties These ideas form the basis of ESP, and distinguish it from other projects The implemented system is much more complex

Case study: stdio usage in gcc cc1 from gcc version 2.5.3 (Spec95) Does cc1 always print to opened files? cc1 is a complex program: 140K non-blank, non-comment lines of C 2149 functions, 66 files, 1086 globals Call graph includes one 450 function SCC What have we done with ESP? cc1 is the gcc from-end (parser etc) Prints output to many files depending on what the user wants Anyone who has written a recursive descent parser should understand where the big SCC comes from

Skeleton of cc1 source FILE *f1, … , *f15; int p1, … , p15; void compileFile() { if (p1) f1 = fopen(…); … if (p15) f15 = fopen(…); restOfComp(); fclose(f1); fclose(f15); } void restOfComp() { if (p1) printRtl(f1); … if (p15) printRtl(f15); restOfComp(); } void printRtl(FILE *f) fprintf(f); For each compilation unit Conditionally open 15 files Compile code Conditionally close files Path-sensitive analysis would lead to 215 states In ESP, only one branch is relevant for each file handle Code is processed in two configs Path-sensitivity + context-sensitive value flow are needed

OPAL rules for stdio usage State Uninit State Closed State Opened State Error Initial Event Decl {ASTDECLARATION {_object_ ASTSYMBOL _any_}} Initial Event Open {_object_ ASTFUNCTIONCALL {ASTSYMBOL “fopen”} {_anyargs_}} Event Print {ASTFUNCTIONCALL {ASTSYMBOL “fprintf”} {_object_,_anyargs_}} Event Close {ASTFUNCTIONCALL {ASTSYMBOL “fclose”} {_object_}} Transition _ -> Uninit on Decl Transition _ -> Opened on Open Transition Uninit -> Error on Print “File not opened” Transition Opened -> Opened on Print Transition Closed -> Error on Print “Printing to closed file” Transition Opened -> Closed on Close Transition Closed -> Error on Close “File already closed”

Experimental results Precision Scalability We have proved that: Verification succeeds for every file handle No transitions to Error; no false errors Scalability Ave. per handle: 72.9 seconds, 49.7 MB Single 1GHz PIII laptop with 512 MB RAM We have proved that: Each of the 646 calls to fprintf in the source code prints to a valid, open file We've run on parts of the Windows code base. Of course, there are no bugs to find. Instead, we're using the code in Windows as a test suite, to find bugs in our own implementation. Which is backwards, isn't it?

Ongoing research Path-sensitive value-alias analysis Value-alias sets Expressions that hold tracked value Track value-alias sets during simulation Add value-alias sets to property state When things get complicated, use GOLF Component-wise analysis Identify and analyze components Link using less precise analysis