Program Analysis via Graph Reachability Thomas Reps University of Wisconsin Joint work with S. Horwitz, M. Sagiv, G. Rosay, D. Melski, M. Benedikt, and P. Godefroid
1987 Slicing & Applications CFL Reachability 1993 Dataflow Analysis Demand Algorithms 1994 Structure- Transmitted Dependences 1995 Set Constraints 1996 1997 1998
More Recently Flow-insensitive points-to analysis An undecidability result context-sensitive structure-transmitted data-dependence analysis Model checking of recursive hierarchical finite-state machines “infinite”-state systems CFL-reachability/circularity queries linear-, quadratic-, and cubic-time algorithms
Outline Interprocedural slicing Interprocedural dataflow analysis Demand-driven analysis Linear . . . cubic . . . undecidable Model-checking of recursive HFSMs
Backward slice with respect to statement “printf(“%d\n”,i)” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Backward slice with respect to statement “printf(“%d\n”,i)”
Backward slice with respect to statement “printf(“%d\n”,i)” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Backward slice with respect to statement “printf(“%d\n”,i)”
Backward slice with respect to statement “printf(“%d\n”,i)” Slice Extraction int main() { int i = 1; while (i < 11) { i = i + 1; } printf(“%d\n”,i); Backward slice with respect to statement “printf(“%d\n”,i)”
Forward slice with respect to statement “sum = 0” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Forward slice with respect to statement “sum = 0”
Forward slice with respect to statement “sum = 0” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Forward slice with respect to statement “sum = 0”
What Are Slices Useful For? Understanding Programs What is affected by what? Restructuring Programs Isolation of separate “computational threads” Program Specialization and Reuse Slices = specialized programs Only reuse needed slices Program Differencing Compare slices to identify changes Testing What new test cases would improve coverage? What regression tests must be rerun after a change?
Control Flow Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter F sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T sum = sum + i i = i + i
Flow Dependence Graph Flow dependence p q Value of variable int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Flow dependence p q Value of variable assigned at p may be used at q. Enter sum = 0 i = 1 while(i < 11) printf(sum) printf(i) sum = sum + i i = i + i
Control Dependence Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Control dependence q is reached from p if condition p is true (T), not otherwise. p q T Similar for false (F). p q F Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Program Dependence Graph (PDG) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Control dependence Flow dependence Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Program Dependence Graph (PDG) int main() { int i = 1; int sum = 0; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Opposite Order Same PDG Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice (2) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice (3) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice (4) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Slice Extraction int main() { int i = 1; while (i < 11) { i = i + 1; } printf(“%d\n”,i); Enter T T T T i = 1 while(i < 11) printf(i) T i = i + i
CodeSurfer
Interprocedural Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Backward slice with respect to statement “printf(“%d\n”,i)”
Interprocedural Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Backward slice with respect to statement “printf(“%d\n”,i)”
Interprocedural Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Superfluous components included by Weiser’s slicing algorithm [TSE 84] Left out by algorithm of Horwitz, Reps, & Binkley [PLDI 88; TOPLAS 90]
System Dependence Graph (SDG) Enter main Call p Call p Enter p
SDG for the Sum Program xin = sum yin = i sum = xout xin = i yin= 1 Enter main sum = 0 i = 1 while(i < 11) printf(sum) printf(i) Call add Call add xin = sum yin = i sum = xout xin = i yin= 1 i = xout Enter add x = xin y = yin x = x + y xout = x
Interprocedural Backward Slice Enter main Call p Call p Enter p
Interprocedural Backward Slice (2) Enter main Call p Call p Enter p
Interprocedural Backward Slice (3) Enter main Call p Call p Enter p
Interprocedural Backward Slice (4) Enter main Call p Call p Enter p
Interprocedural Backward Slice (5) Enter main Call p Call p Enter p
Interprocedural Backward Slice (6) Enter main Call p Call p ) ( [ ] Enter p
Matched-Parenthesis Path ) ( ) [
Interprocedural Backward Slice (6) Enter main Call p Call p Enter p
Interprocedural Backward Slice (7) Enter main Call p Call p Enter p
Slice Extraction Enter main Call p Enter p
Slice of the Sum Program Enter main i = 1 while(i < 11) printf(i) Call add xin = i yin= 1 i = xout Enter add x = xin y = yin x = x + y xout = x
CFL-Reachability [Yannakakis 90] G: Graph L: A context-free language L-path from s to t iff Running time: O(N 3)
Ordinary Graph Reachability ( e [ ] ) matched | e | [ matched ] | ( matched ) | matched matched CFL-Reachability ( t ) e [ ] e e [ e ] [ ] e e s t Ordinary Graph Reachability s t s t s
CFL-Reachability via Dynamic Programming Graph Grammar A B C B C A
Interprocedural Slicing via CFL-Reachability Graph: System dependence graph L: L(matched) Node m is in the slice w.r.t. n iff there is an L(matched)-path from m to n
Asymptotic Running Time [Reps, Horwitz, Sagiv, & Rosay 94] CFL-reachability System dependence graph: N nodes, E edges Running time: O(N 3) System dependence graph Special structure Running time: O(E + CallSites % MaxParams3)
Outline Interprocedural slicing Interprocedural dataflow analysis Demand-driven analysis Linear . . . cubic . . . undecidable Model-checking of recursive HFSMs
Precise Intraprocedural Analysis start n
( ) ] ( start p(a,b) start main if . . . x = 3 b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) exit main exit p
Precise Interprocedural Analysis ret start n ( ) [Sharir & Pnueli 81]
Representing Dataflow Functions b c Identity Function a b c Constant Function
Representing Dataflow Functions b c “Gen/Kill” Function a b c Non-“Gen/Kill” Function
x y a b start p(a,b) start main if . . . x = 3 b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) exit main exit p
Composing Dataflow Functions b c a b c a a b c
( ) ( ] YES! NO! x y start p(a,b) a b start main if . . . x = 3 Might y be uninitialized here? Might b be uninitialized here? b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) exit main exit p
Off Limits! matched matched matched | (i matched )i 1 i CallSites | edge | stack ) ( stack Off Limits!
Off Limits! unbalLeft matched unbalLeft | (i unbalLeft 1 i CallSites | stack ) ( stack Off Limits! (
Interprocedural Dataflow Analysis via CFL-Reachability Graph: Exploded control-flow graph L: L(unbalLeft) Fact d holds at n iff there is an L(unbalLeft)-path from
Asymptotic Running Time [Reps, Horwitz, & Sagiv 95] CFL-reachability Exploded control-flow graph: ND nodes Running time: O(N3D3) Exploded control-flow graph Special structure Running time: O(ED3) Typically: E l N, hence O(ED3) l O(ND3) “Gen/kill” problems: O(ED)
Outline Interprocedural slicing Interprocedural dataflow analysis Demand-driven analysis Linear . . . cubic . . . undecidable Model-checking of recursive HFSMs
Exhaustive Versus Demand Analysis Exhaustive analysis: All facts at all points Optimization: Concentrate on inner loops Program-understanding tools: Only some facts are of interest Demand analysis: Does a given fact hold at a given point? Which facts hold at a given point? At which points does a given fact hold?
All “appropriate” demands x y a b YES! ( ) start p(a,b) “Semi-exhaustive”: All “appropriate” demands start main Might b be uninitialized here? Might y be uninitialized here? if . . . x = 3 b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) NO! exit main exit p
Experimental Results [Horwitz , Reps, & Sagiv 1995] 53 C programs (200-6,700 lines) For a single fact of interest: Demand algorithm always better than exhaustive algorithm All “appropriate” demands beats exhaustive when percentage of “yes” answers is high Live variables Truly live variables Constant predicates . . .
Outline Interprocedural slicing Interprocedural dataflow analysis Demand-driven analysis Linear . . . cubic . . . undecidable Model-checking of recursive HFSMs
Structure-Transmitted Dependences [Reps1995] McCarthy’s equations: car(cons(x,y)) = x cdr(cons(x,y)) = y w = cons(x,y); v = car(w); v w y x
Dependences + Matched Paths? Enter main x y hd hd-1 ( ) w=cons(x,y) Call p Call p w w Enter p w v = car(w)
Undecidable! [Reps, TOPLAS 00] hd hd-1 ( ) Interleaved Parentheses!
Outline Interprocedural slicing Interprocedural dataflow analysis Demand-driven analysis Linear . . . cubic . . . undecidable Model-checking of recursive HFSMs
Model-Checking of Recursive HFSMs [Benedikt, Godefroid, & Reps (in prep.)] Non-recursive HFSMs [Alur & Yannakakis 98] Ordinary FSMs T-reachability/circularity queries Recursive HFSMs Matched-parenthesis T-reachability/circularity Key observation: Linear-time algorithms for matched-parenthesis T-reachability/cyclicity Single-entry/multi-exit [or multi-entry/single-exit] Deterministic, multi-entry/multi-exit
T-Cyclicity in Hierarchical Kripke Structures SN/SX SN/MX MN/SX MN/MX non-rec: O(|k|) non-rec: O(|k|) ? ? rec: O(|k3|) rec: ? SN/SX SN/MX MN/SX MN/MX O(|k|) O(|k|) O(|k|) O(|k3|) O(|k2|) [lin rec] O(|k|) [det]
Recursive HFSMs: Data Complexity SN/SX SN/MX MN/SX MN/MX LTL non-rec: O(|k|) non-rec: O(|k|) ? ? rec: P-time rec: ? CTL O(|k|) bad ? bad CTL* O(|k2|) [L2] bad ? bad
Recursive HFSMs: Data Complexity SN/SX SN/MX MN/SX MN/MX LTL O(|k|) O(|k|) O(|k|) O(|k3|) O(|k2|) [lin rec] O(|k|) [det] CTL O(|k|) bad O(|k|) bad CTL* O(|k|) bad O(|k|) bad Not Dual Problems!
CFL-Reachability: Scope of Applicability Static analysis Slicing, DFA, structure-transmitted dep., points-to analysis Formal-language theory CF-, 2DPDA-, 2NPDA-recognition Attribute-grammar analysis Verification Model-checking recursive HFSMs “Ping-pong” protocols [Dolev, Even, & Karp 83]
CFL-Reachability: Benefits Algorithms Demand & exhaustive Complexity Linear-, quadratic-, cubic-time algorithms PTIME-completeness Variants that are undecidable Complementary to Equations Set constraints Types . . .
But . . . Model checking Dataflow analysis Huge graphs (10100 reachable states) Reachability/circularity queries Represent implicitly (OBDDs) Dataflow analysis Large graphs e.g., Stmts Vars ( 1011) CFL-reachability queries [Reps,Horwitz,Sagiv 95] OBDDs blew up [Siff & Reps 95 (unpub.)] . . . yes, we tried the usual tricks . . .
Most Significant Contributions: 1987-2000 Asymptotically fastest algorithms Interprocedural slicing Interprocedural dataflow analysis Demand algorithms All “appropriate” demands may beat exhaustive Tool for slicing and browsing ANSI C Slices programs as large as 60,000 lines University research distribution Commercial product: CodeSurfer (GrammaTech, Inc.) CFL-reachability as unifying conceptual model [Kou 77], [Holley&Rosen 81], [Cooper&Kennedy 88], [Callahan 88], [Horwitz,Reps,&Binkley 88], . . . Identifies fundamental bottlenecks (e.g., cubic-time “barrier”)