Adv. UNIX: Profile/151 Advanced UNIX v Objectives –introduce profiling based on execution times and line counts Special Topics in Comp. Eng. 2 Semester 2, Profiling
Adv. UNIX: Profile/152 Contents 1.What is Profiling? 2.Profiling Primes Programs 3.Primes v.1 (p1.c) 4.Primes v.2 (p2.c) 5.Primes v.3 (p3.c) 6.Primes v.4 (p4.c) continued
Adv. UNIX: Profile/153 7.Primes v.5 (p5.c) 8.Care with Timings 9.Quick Timings 10.Function Call Trees
Adv. UNIX: Profile/ What is Profiling? v Profiling a program involves collecting numerical data about its execution –e.g. the total running time, the running time for each function, the number of times a function/statement is executed v This information can be used for speed optimisation, and for debugging.
Adv. UNIX: Profile/ Profiling Primes Programs v We will profile five versions of a primes program –it prints all the prime numbers between 2 and 70,000 v Two types of profiling are carried out: –time profiling of the functions –counting the number of times statements are executed
Adv. UNIX: Profile/ Time Profiling Detailed timing information for a program (e.g. foo.c ) is obtained in three steps: –$ gcc -pg -o foo foo.c –$ foo –$ gprof -b foo -b switches off the explanation of the results format continued
Adv. UNIX: Profile/157 gprof reports: –execution times for each function –info. on how functions call each other u this includes the number of times a function has been called
Adv. UNIX: Profile/158 gprof Information man gprof v The Web page: "GNU gprof" – info/gprof.html –explains some of the undocumented features of gprof, and gives examples of its use
Adv. UNIX: Profile/ Line Counting Profiling v Count the number of times lines in the program have been executed: –$ gcc -fprofile-arcs -ftest-coverage -o foo foo.c –$ foo –$ gcov foo continued
Adv. UNIX: Profile/1510 gcov generates a modified source listing of foo.c stored in foo.c.gcov –the listing includes execution counts for the lines Multiple calls to gcov foo, will cause the line executions to be counted again, and for foo.c.gcov to be updated.
Adv. UNIX: Profile/1511 gcov Information man gcov –very brief v The Web page: "gcov: a Test Coverage Program" – gcov_1.html
Adv. UNIX: Profile/ Primes v.1 (p1.c) This program calculates the primes between 2 and 70,000 ( MAXPRIME ) by calling prime() for each integer. The primes are printed in rows, 9 ( NUMCOLS ) primes per row.
Adv. UNIX: Profile/ p1.c v v #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int main () { int i; int colCount = 0; :
Adv. UNIX: Profile/1514 for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { colCount++; if (colCount%NUMCOLS == 0) { printf ("%5d\n", i); colCount = 0; } else printf ("%5d ", i); } putchar('\n'); return 0; } continued
Adv. UNIX: Profile/1515 int prime (int n) /* Is n a prime? Return 0 if yes, 1 otherwise */ { int i; for (i = 2; i < n; i++) if (n % i == 0) return 0; return 1; }
Adv. UNIX: Profile/ p1.c Timings v v $ gcc -pg -o p1 p1.c $ p $ gprof -b p1 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name prime : continued total running time
Adv. UNIX: Profile/1517 v v Call graph granularity: each sample hit covers 4 byte(s) for 0.04% of seconds index % time self children called name /69999 main [2] [1] prime[1] [2] main [2] /69999 prime[1] Index by function name [1] prime
Adv. UNIX: Profile/ p1.c Line Counts v v $ gcc -fprofile-arcs -ftest-coverage -o p1 p1.c $ p1 > /dev/null $ gcov p % of 20 source lines executed in file p1.c Creating p1.c.gcov. continued
Adv. UNIX: Profile/1519 v v $ cat p1.c.gcov #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int main () 1 { int i; 1 int colCount = 0; :
Adv. UNIX: Profile/ for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { 6935 colCount++; 6935 if (colCount%NUMCOLS == 0) { 770 printf ("%5d\n", i); 770 colCount = 0; 770 } else 6165 printf ("%5d ", i); } 1 putchar('\n'); 1 return 0; 1 } continued number of primes
Adv. UNIX: Profile/1521 int prime (int n) { int i; for (i = 2; i < n; i++) if (n % i == 0) return 0; 6935 return 1; } no. of non-primes (63064) + no. of primes (6935) = total range (69999). The expensive operations in prime() are the loop and factor test
Adv. UNIX: Profile/ Primes v.2 (p2.c) The analysis of p1.c shows it's "hot spots" are the loop and if-test inside prime(). –speeding these up would be good v Mathematical theory says that if n has a factor, then it will occur between 2 and n 0.5 –we can use this to reduce the loop range to 2..root(n)
Adv. UNIX: Profile/ p2.c v v #include #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int root(int n); int main () { int i; int colCount = 0; :
Adv. UNIX: Profile/1524 for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { colCount++; if (colCount%NUMCOLS == 0) { printf ("%5d\n", i); colCount = 0; } else printf ("%5d ", i); } putchar('\n'); return 0; } continued
Adv. UNIX: Profile/1525 int prime (int n) { int i; for (i = 2; i <= root(n); i++) if (n % i == 0) return 0; return 1; } int root(int n) { return (int) sqrt( (float)n ); }
Adv. UNIX: Profile/ p2.c Timings v v $ gcc -pg -o p2 p2.c -lm $ p $ gprof -b p2 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name root prime : total running time continued
Adv. UNIX: Profile/1527 v v Call graph granularity: each sample hit covers 4 byte(s) for 1.79% of 0.56 seconds index % time self children called name /69999 main [2] [1] prime[1] / root [3] [2] main [2] /69999 prime[1] / prime[1] [3] root [3] Index by function name [1] prime [3] root
Adv. UNIX: Profile/1528 Some Observations p2.c is very much faster than p1.c : –0.56 secs compared to secs (~50 times) The reason is that prime() in p2.c takes 0.28 secs compared to secs in p1.c –to see why, it helps to know the line counts inside prime()
Adv. UNIX: Profile/ p2.c Line Counts v v $ gcc -fprofile-arcs -ftest-coverage -o p2 p2.c -lm $ p2 > /dev/null $ gcov p % of 22 source lines executed in file p2.c Creating p2.c.gcov. continued
Adv. UNIX: Profile/1530 v v $ cat p2.c.gcov #include #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int root(int n); int main () 1 { int i; 1 int colCount = 0; :
Adv. UNIX: Profile/ for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { 6935 colCount++; 6935 if (colCount%NUMCOLS == 0) { 770 printf ("%5d\n", i); 770 colCount = 0; 770 } else 6165 printf ("%5d ", i); } 1 putchar('\n'); 1 return 0; 1 } same number of primes as in p1.c continued
Adv. UNIX: Profile/1532 int prime (int n) { int i; for (i = 2; i <= root(n); i++) if (n % i == 0) return 0; 6935 return 1; } int root(int n) { return (int) sqrt( (float)n ); } same numbers of non-primes and primes as in p1.c
Adv. UNIX: Profile/1533 Some Observations The loop and if-test in p2.c 's prime() are executed many times less than in p1.c –p2.c loop: 1,682,490 if-test: 1,675,555 –p1.c loop: 229,394,196 if-test: 229,387,261 The calls to root() are very expensive: –50% of the total execution time (0.28 secs) –1,682,490 calls to sqrt() inside root()
Adv. UNIX: Profile/ Primes v.3 (p3.c) This version is very similar to p2.c, but the call to root() inside prime() has been moved outside of the loop.
Adv. UNIX: Profile/ p3.c v v #include #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int root(int n); int main () { int i; int colCount = 0; :
Adv. UNIX: Profile/1536 for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { colCount++; if (colCount%NUMCOLS == 0) { printf ("%5d\n", i); colCount = 0; } else printf ("%5d ", i); } putchar('\n'); return 0; } continued
Adv. UNIX: Profile/1537 int prime (int n) { int i, bound; bound = root(n); for (i = 2; i <= bound; i++) if (n % i == 0) return 0; return 1; } int root(int n) { return (int) sqrt( (float)n ); }
Adv. UNIX: Profile/ p3.c Timings v v $ gcc -pg -o p3 p3.c -lm $ p $ gprof -b p3 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name prime root continued
Adv. UNIX: Profile/1539 v v Call graph granularity: each sample hit covers 4 byte(s) for 6.67% of 0.15 seconds index % time self children called name /69999 main [2] [1] prime[1] /69999 root [3] [2] main [2] /69999 prime[1] /69999 prime[1] [3] root [3] Index by function name [1] prime [3] root
Adv. UNIX: Profile/1540 Some Observations p3.c is faster than p2.c : –0.15 secs compared to 0.56 secs (~ 4 times) The speed-up is due to root() : –p3.c root() time: 0.15 secs, no. calls: 69,999 –p2.c root() time: 0.28 secs, no. calls: 1,682,490
Adv. UNIX: Profile/ p3.c Line Counts v v $ gcc -fprofile-arcs -ftest-coverage -o p3 p3.c -lm $ p3 > /dev/null $ gcov p % of 23 source lines executed in file p3.c Creating p3.c.gcov. continued
Adv. UNIX: Profile/1542 v v $ cat p3.c.gcov #include #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int root(int n); int main () 1 { int i; 1 int colCount = 0; :
Adv. UNIX: Profile/ for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { 6935 colCount++; 6935 if (colCount%NUMCOLS == 0) { 770 printf ("%5d\n", i); 770 colCount = 0; 770 } else 6165 printf ("%5d ", i); } 1 putchar('\n'); 1 return 0; 1 } same number of primes as before continued
Adv. UNIX: Profile/1544 int prime (int n) { int i, bound; bound = root(n); for (i = 2; i <= bound; i++) if (n % i == 0) return 0; 6935 return 1; } int root(int n) { return (int) sqrt( (float)n ); } same numbers of non-primes and primes as before sqrt() called much less
Adv. UNIX: Profile/ Primes v.4 (p4.c) This version makes two modifications to the code in p3.c : –add divisibility tests for 2, 3, 5 into prime(), to filter out numbers before the expensive loop u in the process we add some bugs! –have the for-loop start at 7, and increment in steps of 2
Adv. UNIX: Profile/ p4.c v v #include #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int root(int n); int main () { int i; int colCount = 0; :
Adv. UNIX: Profile/1547 for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { colCount++; if (colCount%NUMCOLS == 0) { printf ("%5d\n", i); colCount = 0; } else printf ("%5d ", i); } putchar('\n'); return 0; } continued
Adv. UNIX: Profile/1548 int prime (int n) { int i, bound; if (n%2 == 0) return 0; if (n%3 == 0) return 0; if (n%5 == 0) return 0; bound = root(n); for (i = 7; i <= bound; i = i+2) if (n % i == 0) return 0; return 1; } int root(int n) { return (int) sqrt( (float)n ); }
Adv. UNIX: Profile/ p4.c Timings v v $ gcc -pg -o p4 p4.c -lm $ p $ gprof -b p4 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name prime main root where are 2, 3, and 5? continued
Adv. UNIX: Profile/1550 v v Call graph granularity: each sample hit covers 4 byte(s) for 7.69% of 0.13 seconds index % time self children called name [1] main [1] /69999 prime [2] /69999 main [1] [2] prime [2] /18665 root [3] /18665 prime [2] [3] root [3] Index by function name [1] main [2] prime [3] root
Adv. UNIX: Profile/1551 Some Observations p4.c is a tiny bit faster than p3.c : –0.13 secs compared to 0.15 secs v There is an error somewhere, since primes 2, 3, and 5 were not printed. root() is called a lot less times: –p4.c root() calls: –p3.c root() calls: –but both execution times are close to 0 secs
Adv. UNIX: Profile/ p4.c Line Counts v v $ gcc -fprofile-arcs -ftest-coverage -o p4 p4.c -lm $ p4 > /dev/null $ gcov p % of 29 source lines executed in file p4.c Creating p4.c.gcov. continued
Adv. UNIX: Profile/1553 v v $ cat p4.c.gcov #include #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int root(int n); int main () 1 { int i; 1 int colCount = 0; :
Adv. UNIX: Profile/ for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { 6932 colCount++; 6932 if (colCount%NUMCOLS == 0) { 770 printf ("%5d\n", i); 770 colCount = 0; 770 } else 6162 printf ("%5d ", i); } 1 putchar('\n'); 1 return 0; 1 } not the same number of primes as before (6935) continued
Adv. UNIX: Profile/1555 int prime (int n) { int i, bound; if (n%2 == 0) return 0; if (n%3 == 0) return 0; if (n%5 == 0) 4667 return 0; bound = root(n); for (i = 7; i <= bound; i = i+2) if (n % i == 0) return 0; 6932 return 1; } int root(int n) { return (int) sqrt( (float)n ); }
Adv. UNIX: Profile/1556 Some Observations prime() is returning three less primes (6932) than it should –the error is the return statements for checking divisibility of 2, 3, 5 –instead of always returning 0, they should return 1 when n is 2, 3, or 5 continued
Adv. UNIX: Profile/1557 v The extra tests filter out many numbers (51,334, ~73% of range): –test for 2: 35,000 (half of input) –test for 3: 11,667 (1/3 of what's left) –test for 5: 4,667 (1/5 of what's left) v The for-loop executes about 55% less: –p4.c for-loop count: 767,154 –p3.c for-loop count: 1,682,490
Adv. UNIX: Profile/ Primes v.5 (p5.c) This final version fixes the divisibility bugs in the p4.c code. Also, the root() function is replaced by a much faster multiplication –there is no longer any need for the maths library
Adv. UNIX: Profile/ p5.c v v #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int main () { int i; int colCount = 0; :
Adv. UNIX: Profile/1560 for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { colCount++; if (colCount%NUMCOLS == 0) { printf ("%5d\n", i); colCount = 0; } else printf ("%5d ", i); } putchar('\n'); return 0; } continued
Adv. UNIX: Profile/1561 int prime (int n) { int i; if (n%2 == 0) return (n == 2); if (n%3 == 0) return (n == 3); if (n%5 == 0) return (n == 5); for (i = 7; i*i <= n; i = i+2) if (n % i == 0) return 0; return 1; }
Adv. UNIX: Profile/ p5.c Timings v v $ gcc -pg -o p5 p5.c $ p $ gprof -b p5 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name prime main All the primes are printed again. continued
Adv. UNIX: Profile/1563 v v Call graph granularity: each sample hit covers 4 byte(s) for 8.33% of 0.12 seconds index % time self children called name [1] main[1] /69999 prime[2] /69999 main [1] [2] prime[2] Index by function name [1] main [2] prime
Adv. UNIX: Profile/1564 Some Observations p5.c is a tiny bit faster than p4.c : –0.12 secs compared to 0.13 secs
Adv. UNIX: Profile/ p5.c Line Counts v v $ gcc -fprofile-arcs -ftest-coverage -o p5 p5.c $ p5 > /dev/null $ gcov p % of 26 source lines executed in file p5.c Creating p5.c.gcov. continued
Adv. UNIX: Profile/1566 v v $ cat p5.c.gcov #include #define NUMCOLS 9 #define MAXPRIME int prime (int n); int main () 1 { int i; 1 int colCount = 0; :
Adv. UNIX: Profile/ for (i = 2; i <= MAXPRIME; i++) if (prime (i)) { 6935 colCount++; 6935 if (colCount%NUMCOLS == 0) { 770 printf ("%5d\n", i); 770 colCount = 0; 770 } else 6165 printf ("%5d ", i); } 1 putchar('\n'); 1 return 0; 1 } the right number of primes continued
Adv. UNIX: Profile/1568 int prime (int n) { int i; if (n%2 == 0) return (n == 2); if (n%3 == 0) return (n == 3); if (n%5 == 0) 4667 return (n == 5); for (i = 7; i*i <= n; i = i+2) if (n % i == 0) return 0; 6932 return 1; }
Adv. UNIX: Profile/1569 Some Observations The divisibility tests and for-loop in prime() work as in p4.c –the tests filter out about 73% of the numbers –the for-loop executes about 55% less than in p3.c
Adv. UNIX: Profile/ Care with Timings v The five primes programs have total execution times: –ProgramTime (secs)Speed up p1.c p2.c 0.56 ~50 times p3.c 0.15 ~4 times p4.c 0.13 ~1.2 times p5.c 0.12 ~1.1 times continued
Adv. UNIX: Profile/1571 v What can be concluded? –the optimisations speed things up, but there is a trade-off of speed versus coding complexity v The figures are only based on for one run of the programs –the programs should be run many times, and averages taken continued
Adv. UNIX: Profile/1572 v The optimisation techniques should be tested on a range of inputs –in this case, we should run the programs for different ranges, not just 2..70,000 v Timing values are affected by the machine load, so timings should be collected at different times of day/night. continued
Adv. UNIX: Profile/1573 v Timings are affected by the machine type –e.g. SparcStation, Pentium 100 v Timings are affected by the OS version –e.g. BSD, Solaris, Linux –e.g older UNIXes implemented the maths library differently continued
Adv. UNIX: Profile/1574 v Timings are affected by the accuracy of the clock: –very small execution times will be measured as 0.00 secs –the execution times could be important for larger/different data
Adv. UNIX: Profile/ Quick Timings v To obtain the total CPU time used by the program, add: –printf("%f secs running\n", (double) (clock()/CLOCKS_PER_SEC) ); at the end of the program. continued
Adv. UNIX: Profile/1576 Bash contains a time command: $ time p2 /* p2's output */ real 0m1.366s user 0m1.360s sys 0m0.000s $ continued total elapsed time (wall clock time) user CPU time: time executing user code (and parts of libraries) system CPU time: time spent inside the UNIX kernel
Adv. UNIX: Profile/1577 There is a more detailed time command, which also prints information on other system resources: $ /usr/bin/time p2 /* p2's output */ 1.40user 0.02system 0:01.80elapsed 78%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (101major+12minor)pagefaults 0swaps $
Adv. UNIX: Profile/ Function Call Trees v A function call tree shows which functions call others –good for showing large program structure –shows what functions will be affected by a change in a function v An inverted flow graph is a related idea –for each function, it shows what other functions call it
Adv. UNIX: Profile/1579 Example: curl.c v $ cflow curl.c 1 main {curl.c 429} 2 init_strs {curl.c 3214} 3 strsbld {curl.c 3155} 4 strs_hash {curl.c 3160} 5 tolower {} 6 strs_insert {curl.c 3172} 7 strcmp {} 8 mk_strs {curl.c 3200} 9 malloc {} 10 strlen {} 11 strcpy {} 12 process_args {curl.c 486} 13 fopen {} : calls
Adv. UNIX: Profile/1580 Inverted Flow Graph for curl.c v $ cflow -i curl.c 1 _IO_getc {} 2 get_gifnm {curl.c 1600} 3 get_link {curl.c 1562} 4 mod_newfile {curl.c 3081} 5 mod_picfile {curl.c 2984} : 14 a_or_p {curl.c 2774} 15 delete_lineref {curl.c 2601} 16 add_path {curl.c 896} 17 announce_old {curl.c 2217} 18 nodup_kids {curl.c 1108} 19 str_add_path {curl.c 905} 20 add_path_html {curl.c 917} 21 next_cnt {curl.c 1920} : is called by