Program Analysis via Graph Reachability Thomas Reps University of Wisconsin http://www.cs.wisc.edu/~reps/ See http://www.cs.wisc.edu/wpis/papers/tr1386.ps
PLDI 00 Registration Form Tutorial (morning): …………… $ ____ Tutorial (afternoon): ………….. $ ____ Tutorial (evening): ……………. $ – 0 –
1987 Slicing & Applications CFL Reachability 1993 Dataflow Analysis Demand Algorithms 1994 Structure- Transmitted Dependences 1995 Set Constraints 1996 1997 1998
Applications Program optimization Software engineering Program understanding Reengineering Static bug-finding Security (information flow)
Collaborators Susan Horwitz Mooly Sagiv Genevieve Rosay David Melski David Binkley Michael Benedikt Patrice Godefroid
Themes Harnessing CFL-reachability Exhaustive alg. Demand alg.
Backward slice with respect to “printf(“%d\n”,i)” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Backward slice with respect to “printf(“%d\n”,i)”
Backward slice with respect to “printf(“%d\n”,i)” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Backward slice with respect to “printf(“%d\n”,i)”
Backward slice with respect to “printf(“%d\n”,i)” Slice Extraction int main() { int i = 1; while (i < 11) { i = i + 1; } printf(“%d\n”,i); Backward slice with respect to “printf(“%d\n”,i)”
Forward slice with respect to “sum = 0” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Forward slice with respect to “sum = 0”
Forward slice with respect to “sum = 0” int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Forward slice with respect to “sum = 0”
What Are Slices Useful For? Understanding Programs What is affected by what? Restructuring Programs Isolation of separate “computational threads” Program Specialization and Reuse Slices = specialized programs Only reuse needed slices Program Differencing Compare slices to identify changes Testing What new test cases would improve coverage? What regression tests must be rerun after a change?
Line-Character-Count Program void line_char_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line(FILE *f, BOOL *bptr, int *iptr); scan_line(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars);
Character-Count Program void char_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line(FILE *f, BOOL *bptr, int *iptr); scan_line(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars);
Line-Character-Count Program void line_char_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line(FILE *f, BOOL *bptr, int *iptr); scan_line(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars);
Line-Count Program void line_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line2(FILE *f, BOOL *bptr, int *iptr); scan_line2(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars);
Specialization Via Slicing wc -lc wc -c wc -l
Control Flow Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter F sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T sum = sum + i i = i + i
Control Dependence Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Control dependence q is reached from p if condition p is true (T), not otherwise. p q T Similar for false (F). p q F Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Flow Dependence Graph Flow dependence p q Value of variable int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Flow dependence p q Value of variable assigned at p may be used at q. Enter sum = 0 i = 1 while(i < 11) printf(sum) printf(i) sum = sum + i i = i + i
Program Dependence Graph (PDG) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Control dependence Flow dependence Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Program Dependence Graph (PDG) int main() { int i = 1; int sum = 0; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Opposite Order Same PDG Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice (2) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice (3) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Backward Slice (4) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); Enter T T T T T T sum = 0 i = 1 while(i < 11) printf(sum) printf(i) T T sum = sum + i i = i + i
Slice Extraction int main() { int i = 1; while (i < 11) { i = i + 1; } printf(“%d\n”,i); Enter T T T T i = 1 while(i < 11) printf(i) T i = i + i
CodeSurfer
Browsing a Dependence Graph Pretend this is your favorite browser What does clicking on a link do? You get a new page Or you move to an internal tag
Interprocedural Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Backward slice with respect to “printf(“%d\n”,i)”
Interprocedural Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Backward slice with respect to “printf(“%d\n”,i)”
Interprocedural Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Superfluous components included by Weiser’s slicing algorithm [TSE 84] Left out by algorithm of Horwitz, Reps, & Binkley [PLDI 88; TOPLAS 90]
System Dependence Graph (SDG) Enter main Call p Call p Enter p
SDG for the Sum Program xin = sum yin = i sum = xout xin = i yin= 1 Enter main sum = 0 i = 1 while(i < 11) printf(sum) printf(i) Call add Call add xin = sum yin = i sum = xout xin = i yin= 1 i = xout Enter add x = xin y = yin x = x + y xout = x
Interprocedural Backward Slice Enter main Call p Call p Enter p
Interprocedural Backward Slice (2) Enter main Call p Call p Enter p
Interprocedural Backward Slice (3) Enter main Call p Call p Enter p
Interprocedural Backward Slice (4) Enter main Call p Call p Enter p
Interprocedural Backward Slice (5) Enter main Call p Call p Enter p
Interprocedural Backward Slice (6) Enter main Call p Call p ) ( [ ] Enter p
Matched-Parenthesis Path ) ( ) [
Interprocedural Backward Slice (6) Enter main Call p Call p Enter p
Interprocedural Backward Slice (7) Enter main Call p Call p Enter p
Slice Extraction Enter main Call p Enter p
Slice of the Sum Program Enter main i = 1 while(i < 11) printf(i) Call add xin = i yin= 1 i = xout Enter add x = xin y = yin x = x + y xout = x
CFL-Reachability [Yannakakis 90] G: Graph (N nodes, E edges) L: A context-free language L-path from s to t iff Running time: O(N 3)
Interprocedural Slicing via CFL-Reachability Graph: System dependence graph L: L(matched) [roughly] Node m is in the slice w.r.t. n iff there is an L(matched)-path from m to n
Asymptotic Running Time [Reps, Horwitz, Sagiv, & Rosay 94] CFL-reachability System dependence graph: N nodes, E edges Running time: O(N 3) System dependence graph Special structure Running time: O(E + CallSites % MaxParams3)
Ordinary Graph Reachability ( e [ ] ) matched | e | [ matched ] | ( matched ) | matched matched CFL-Reachability ( t ) e [ ] e e [ e ] [ ] e e s t Ordinary Graph Reachability s t s t s
Regular-Language Reachability [Yannakakis 90] G: Graph (N nodes, E edges) L: A regular language L-path from s to t iff Running time: O(N+E) Ordinary reachability (= transitive closure) Label each edge with e L is e* vs. O(N3)
CFL-Reachability via Dynamic Programming Graph Grammar A B C B C A
Degenerate Case: CFL-Recognition exp id | exp + exp | exp * exp | ( exp ) “(a + b) * c” L(exp) ? ) ( a c b + * s t
Degenerate Case: CFL-Recognition exp id | exp + exp | exp * exp | ( exp ) “a + b) * c +” L(exp) ? * a + ) b c s t
Program Chopping Given source S and target T, what program points transmit effects from S to T? S T Intersect forward slice from S with backward slice from T, right?
Non-Transitivity and Slicing int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Forward slice with respect to “sum = 0”
Non-Transitivity and Slicing int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Forward slice with respect to “sum = 0”
Non-Transitivity and Slicing int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Backward slice with respect to “printf(“%d\n”,i)”
Non-Transitivity and Slicing int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Backward slice with respect to “printf(“%d\n”,i)”
Non-Transitivity and Slicing int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Forward slice with respect to “sum = 0” Backward slice with respect to “printf(“%d\n”,i)”
Non-Transitivity and Slicing int main() { int sum = 0; int i = 1; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); int add(int x, int y) { return x + y; } Chop with respect to “sum = 0” and “printf(“%d\n”,i)”
Non-Transitivity and Slicing Enter main sum = 0 i = 1 while(i < 11) printf(sum) printf(i) Call add Call add xin = sum yin = i sum = xout xin = i yin= 1 i = xout ( ] Enter add x = xin y = yin x = x + y xout = x
“Precise interprocedural chopping” Program Chopping Given source S and target T, what program points transmit effects from S to T? S T “Precise interprocedural chopping” [Reps & Rosay FSE 95]
Dataflow Analysis Goal: For each point in the program, determine a superset of the “facts” that could possibly hold during execution Examples Constant propagation Reaching definitions Live variables Possibly uninitialized variables
Possibly Uninitialized Variables {} Start {w,x,y} x = 3 {w,y} if . . . {w,y} y = x {w,y} y = w {w} w = 8 {w,y} {} printf(y) {w,y}
Precise Intraprocedural Analysis start n pfp = fk fk-1 … f2 f1 MOP[n] = pfp(C) pPathsTo[n]
( ) ] ( start p(a,b) start main if . . . x = 3 b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) exit main exit p
Precise Interprocedural Analysis ret start n ( ) MOMP[n] = pfp(C) pMatchedPathsTo[n] [Sharir & Pnueli 81]
Representing Dataflow Functions b c Identity Function a b c Constant Function
Representing Dataflow Functions b c “Gen/Kill” Function a b c Non-“Gen/Kill” Function
x y a b start p(a,b) start main if . . . x = 3 b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) exit main exit p
Composing Dataflow Functions b c a b c a f2 f1({a,c}) = a b c
( ) ( ] YES! NO! x y start p(a,b) a b start main if . . . x = 3 Might y be uninitialized here? Might b be uninitialized here? b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) exit main exit p
Off Limits! matched matched matched | (i matched )i 1 i CallSites | edge | stack ) ( stack Off Limits!
Off Limits! unbalLeft matched unbalLeft | (i unbalLeft 1 i CallSites | stack ) ( stack Off Limits! (
Interprocedural Dataflow Analysis via CFL-Reachability Graph: Exploded control-flow graph L: L(unbalLeft) Fact d holds at n iff there is an L(unbalLeft)-path from
Asymptotic Running Time [Reps, Horwitz, & Sagiv 95] CFL-reachability Exploded control-flow graph: ND nodes Running time: O(N3D3) Exploded control-flow graph Special structure Running time: O(ED3) Typically: E l N, hence O(ED3) l O(ND3) “Gen/kill” problems: O(ED)
Why Bother? “We’re only interested in million-line programs” Know thy enemy! “Any” algorithm must do these operations Avoid pitfalls (e.g., claiming O(N2) algorithm) The essence of “context sensitivity” Special cases “Gen/kill” problems: O(ED) Compression techniques Basic blocks SSA form, sparse evaluation graphs Demand algorithms
Unifying Conceptual Model for Dataflow-Analysis Literature Linear-time gen-kill [Hecht 76], [Kou 77] Path-constrained DFA [Holley & Rosen 81] Linear-time GMOD [Cooper & Kennedy 88] Flow-sensitive MOD [Callahan 88] Linear-time interprocedural gen-kill [Knoop & Steffen 93] Linear-time bidirectional gen-kill [Dhamdhere 94] Relationship to interprocedural DFA [Sharir & Pneuli 81], [Knoop & Steffen 92]
Themes Harnessing CFL-reachability Exhaustive alg. Demand alg.
Exhaustive Versus Demand Analysis Exhaustive analysis: All facts at all points Optimization: Concentrate on inner loops Program-understanding tools: Only some facts are of interest
Exhaustive Versus Demand Analysis Does a given fact hold at a given point? Which facts hold at a given point? At which points does a given fact hold? Demand analysis via CFL-reachability single-source/single-target CFL-reachability single-source/multi-target CFL-reachability multi-source/single-target CFL-reachability
All “appropriate” demands x y a b YES! ( ) start p(a,b) “Semi-exhaustive”: All “appropriate” demands start main Might b be uninitialized here? Might y be uninitialized here? if . . . x = 3 b = a p(x,y) p(a,b) return from p return from p printf(y) printf(b) NO! exit main exit p
Experimental Results [Horwitz , Reps, & Sagiv 1995] 53 C programs (200-6,700 lines) For a single fact of interest: demand always better than exhaustive All “appropriate” demands beats exhaustive when percentage of “yes” answers is high Live variables Truly live variables Constant predicates . . .
A Related Result [Sagiv, Reps, & Horwitz 1996] [Uses a generalized analysis technique] 38 C programs (300-6,000 lines) copy-constant propagation linear-constant propagation All “appropriate” demands always beats exhaustive factor of 1.14 to about 6
Exhaustive Versus Demand Analysis Demand algorithms for Interprocedural dataflow analysis Set constraints Points-to analysis
Most Significant Contributions: 1987-2000 Asymptotically fastest algorithms Interprocedural slicing Interprocedural dataflow analysis Demand algorithms Interprocedural dataflow analysis [CC94,FSE95] All “appropriate” demands beats exhaustive Tool for slicing and browsing ANSI C Slices programs as large as 75,000 lines University research distribution Commercial product: CodeSurfer (GrammaTech, Inc.)
References Papers by Reps and collaborators: CFL-reachability http://www.cs.wisc.edu/~reps/ CFL-reachability Yannakakis, M., Graph-theoretic methods in database theory, PODS 90. Reps, T., Program analysis via graph reachability, Inf. and Softw. Tech. 98.
References Slicing, chopping, etc. Dataflow analysis Horwitz, Reps, & Binkley, TOPLAS 90 Reps, Horwitz, Sagiv, & Rosay, FSE 94 Reps & Rosay, FSE 95 Dataflow analysis Reps, Horwitz, & Sagiv, POPL 95 Horwitz, Reps, & Sagiv, FSE 95, TR-1283