Download presentation
Presentation is loading. Please wait.
Published byJack Bell Modified over 9 years ago
1
Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization
2
Outline Introduction Example code Timing Profiling Cache Tuning Information Services & Technology 2 10/6/2015
3
Introduction Timing Where is most time being used? Tuning How to speed it up Often as much art as science Parallel Performance How to assess how well parallelization is working Information Services & Technology 3 10/6/2015
4
Example Code Information Services & Technology 4 10/6/2015
5
Example Code Simulation of response of eye to stimuli Response is affected by adjacent inputs A dark area next to a bright area makes the bright area look brighter Based on Grossberg & Todorovic paper Appendix in paper contains all equations errors in eqns (A4) and (A5) – cross out “log2” Paper contains 6 levels of response Our code only contains levels 1 through 5 Level 6 takes a long time to compute, and would skew our timings! Information Services & Technology 5 10/6/2015
6
Example Code (cont’d) All calculations done on a square array Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran) Due to nature of algorithm, array is padded on all sides npad is size of padding Information Services & Technology 6 10/6/2015
7
Example Code – Level 1 Luminance (input) distribution Paper (and code) use “yin-yang square” Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow Information Services & Technology 7 10/6/2015 bright dark Fig. 4 in paper
8
Example Code – Level 2 Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance Information Services & Technology 8 10/6/2015 Fig. 5 in paper
9
Level 2 Equations Information Services & Technology 9 10/6/2015 I pq =initial input (yin-yang)
10
Example Code – Level 3 Oriented Direction-of-Contrast-Sensitive Units Respond to angle 12 discrete angles Respond to direction of contrast, i.e., light-to-dark or dark-to-light Information Services & Technology 10 10/6/2015 Fig. 6(d) in paper
11
Level 3 Equations Information Services & Technology 11 10/6/2015
12
Example Code - Level 4 Oriented Direction-of-Contrast-Insensitive Units Respond to angle Do not respond to direction of contrast, i.e., light-to-dark or dark-to-light Information Services & Technology 12 10/6/2015 Fig. 8(a) in paper
13
Level 4 Equations Information Services & Technology 13 10/6/2015
14
Example Code – Level 5 Level 5 – Boundary Contour Units Pool nearby excitations Information Services & Technology 14 10/6/2015 Fig. 8(d) in paper
15
Level 5 Equation Information Services & Technology 15 10/6/2015
16
Timing Information Services & Technology 16 10/6/2015
17
Timing When tuning/parallelizing a code, need to assess effectiveness of your efforts Can time whole code and/or specific sections Some types of timers unix time command function/subroutine calls profiler Information Services & Technology 17 10/6/2015
18
CPU Time or Wall-Clock Time? CPU time How much time the CPU is actually crunching away User CPU time Time spent executing your source code System CPU time Time spent in system calls such as i/o Wall-clock time What you would measure with a stopwatch Information Services & Technology 18 10/6/2015
19
CPU Time or Wall-Clock Time? (cont’d) Both are useful For serial runs without interaction from keyboard, CPU and wall-clock times are usually close If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not Information Services & Technology 19 10/6/2015
20
CPU Time or Wall-Clock Time? (3) Parallel runs Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased Wall-clock time may not be accurate if sharing processors Wall-clock timings should always be performed in batch mode Information Services & Technology 20 10/6/2015
21
Unix Time Command easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh) Information Services & Technology 21 10/6/2015
22
Unix Time Command (cont’d) twister:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w Information Services & Technology 22 10/6/2015 user CPU time (s) system CPU time (s) wall-clock time (s) (u+s)/wc avg. shared + unshared text space input + output operations page faults + no. times proc. was swapped
23
Unix Time Command (3) Bourne shell results Information Services & Technology 23 10/6/2015 $ time mycode Real 1.62 User 1.57 System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s)
24
Exercise 1 Copy files from /scratch/sondak/gt cp /scratch/sondak/gt/*. Choose C (gt.c) or Fortran (gt.f90) Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 Submit rungt script to batch queue qsub rungt Information Services & Technology 24 10/6/2015 capital oh small ohzero
25
Exercise 1 (cont’d) Check status qstat –u username After run has completed a file will appear named rungt.o??????, where ?????? represents the process number File contains result of time command Write down wall-clock time Re-compile using –O3 Re-run and check time Information Services & Technology 25 10/6/2015
26
Function/Subroutine Calls often need to time part of code timers can be inserted in source code language-dependent Information Services & Technology 26 10/6/2015
27
cpu_time intrinsic subroutine in Fortran returns user CPU time (in seconds) no system time is included 0.01 sec. resolution on p-series Information Services & Technology 27 10/6/2015 real :: t1, t2 call cpu_time(t1)... do stuff to be timed... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.'
28
system_clock intrinsic subroutine in Fortran good for measuring wall-clock time on p-series: resolution is 0.01 sec. max. time is 24 hr. Information Services & Technology 28 10/6/2015
29
system_clock (cont’d) t1 and t2 are tic counts count_rate is optional argument containing tics/sec. Information Services & Technology 29 10/6/2015 integer :: t1, t2, count_rate call system_clock(t1, count_rate)... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’
30
times can be called from C to obtain CPU time 0.01 sec. resolution on p-series can also get system time with tms_stime Information Services & Technology 30 10/6/2015 #include void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }
31
gettimeofday can be called from C to obtain wall-clock time sec resolution on p-series Information Services & Technology 31 10/6/2015 #include void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }
32
MPI_Wtime convenient wall-clock timer for MPI codes sec resolution on p-series Information Services & Technology 32 10/6/2015
33
MPI_Wtime (cont’d) Fortran C Information Services & Technology 33 10/6/2015 double precision t1, t2 t1 = mpi_wtime()... do stuff to be timed... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime();... do stuff to be timed... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);
34
omp_get_time convenient wall-clock timer for OpenMP codes resolution available by calling omp_get_wtick() 0.01 sec. resolution on p-series Information Services & Technology 34 10/6/2015
35
omp_get_wtime (cont’d) Fortran C Information Services & Technology 35 10/6/2015 double precision t1, t2, omp_get_wtime t1 = omp_get_wtime()... do stuff to be timed... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime();... do stuff to be timed... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);
36
Timer Summary Information Services & Technology 36 10/6/2015 CPUWall Fortrancpu_timesystem_clock Ctimesgettimeofday MPIMPI_Wtime OpenMPomp_get_time
37
Exercise 2 Put wall-clock timer around each “level” in the example code Print time for each level Compile and run Information Services & Technology 37 10/6/2015
38
PROFILING Information Services & Technology 38 10/6/2015
39
Profilers profile tells you how much time is spent in each routine gives a level of granularity not available with previous timers e.g., function may be called from many places various profilers available, e.g. gprof (GNU) pgprof (Portland Group) Xprofiler (AIX) Information Services & Technology 39 10/6/2015
40
gprof compile with -pg file gmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof Information Services & Technology 40 10/6/2015
41
gprof (cont’d) Information Services & Technology 41 10/6/2015 ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00.conduct [5] 7.6 122.34 33.17 323 102.69 102.69.getxyz [8] 7.5 154.77 32.43.__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17.btri [10] 7.2 217.33 31.17.kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00.rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51.getq [24]
42
gprof (3) Information Services & Technology 42 10/6/2015 ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1.__start [2] [1] 78.3 0.00 340.50 1.main [1] 2.12 319.50 10/10.contrl [3] 0.04 7.30 10/10.force [34] 0.00 5.27 1/1.initia [40] 0.56 3.43 1/1.plot3da [49] 0.00 1.27 1/1.data [73]
43
pgprof compile with Portland Group compiler pgf90 (pgf95, etc.) pgcc –Mprof=func similar to –pg run code pgprof –exe executable pops up window with flat profile Information Services & Technology 43 10/6/2015
44
pgprof (cont’d) Information Services & Technology 44 10/6/2015
45
pgprof (3) To save profile data to a file: re-run pgprof using –text flag at command prompt type p > filename filename is the name you want to give the profile file type quit to get out of profiler Information Services & Technology 45 10/6/2015
46
Exercise 3 Use pgprof to profile code compile using –Mprof=func run code create profile using pgprof –exe gt Note which routines use most time Please close pgprof when you’re through Leaving window open ties up a license Information Services & Technology 46 10/6/2015
47
Line-Level Profiling Times individual lines For pgprof, compile with the flag –Mprof=line Optimizer will re-order lines profiler will lump lines in some loops or other constructs may want to compile without optimization, may not In flat profile, double-click on function to get line-level data Information Services & Technology 47 10/6/2015
48
Line-Level Profiling (cont’d) Information Services & Technology 48 10/6/2015
49
Exercise 4 Compile code with –Mprof=line and –O0 and run will take about 5 minutes to run due to overhead from line- level profiling and lack of optimization Examine line-level profile for most time-consuming routine Note lines with longest time consumption Save your profile data to a file (we will need it later) re-run pgprof using –text flag at command prompt type p > prof Information Services & Technology 49 10/6/2015
50
CACHE Information Services & Technology 50 10/6/2015
51
Cache Cache is a small chunk of fast memory between the main memory and the registers Information Services & Technology 51 10/6/2015 secondary cache registers primary cache main memory
52
Cache (cont’d) If variables are used repeatedly, code will run faster since cache memory is much faster than main memory Variables are moved from main memory to cache in lines L1 cache line sizes on our machines Opteron (katana cluster) 64 bytes Xeon (katana cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes Information Services & Technology 52 10/6/2015
53
Cache (3) Why not just make the main memory out of the same stuff as cache? Expensive Runs hot This was actually done in Cray computers Liquid cooling system Information Services & Technology 53 10/6/2015
54
Cache (4) Cache hit Required variable is in cache Cache miss Required variable not in cache If cache is full, something else must be thrown out (sent back to main memory) to make room Want to minimize number of cache misses Information Services & Technology 54 10/6/2015
55
Cache (5) Information Services & Technology 55 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] Main memory “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i; a b …
56
Cache (6) Information Services & Technology 56 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits for(i=0; i<10; i++) x[i] = i; a b … x[0 ] x[1] x[2 ] x[3]
57
Cache (7) Information Services & Technology 57 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits for(i=0; i<10; i++) x[i] = i; a b … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7]
58
Cache (8) Information Services & Technology 58 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line for(i=0; i<10; i++) x[i] = i; a b … x[4] x[5] x[6] x[7] x[8 ] x[9 ] a b
59
Cache (9) Contiguous access is important In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] Information Services & Technology 59 10/6/2015 …
60
Cache (10) In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) Information Services & Technology 60 10/6/2015 …
61
Cache (11) Rule: Always order your loops appropriately will usually be taken care of by optimizer suggestion: don’t rely on optimizer Information Services & Technology 61 10/6/2015 for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo CFortran
62
TUNING TIPS Information Services & Technology 62 10/6/2015
63
Tuning Tips Some of these tips will be taken care of by compiler optimization It’s best to do them yourself, since compilers vary Two important rules minimize number of operations access cache contiguously Information Services & Technology 63 10/6/2015
64
Tuning Tips (cont’d) Access arrays in contiguous order For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good Information Services & Technology 64 10/6/2015 for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; } for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }
65
Tuning Tips (3) Eliminate redundant operations in loops Bad: Good: Information Services & Technology 65 10/6/2015 for(i=0; i<N; i++){ x = 10; } … x = 10; for(i=0; i<N; i++){ } …
66
Tuning Tips (4) Minimize if statements within loops They may inhibit pipelining Information Services & Technology 66 10/6/2015 for(i=0; i<N; i++){ if(i==0) perform i=0 calculations else perform i>0 calculations }
67
Tuning Tips (5) Better Way: Information Services & Technology 67 10/6/2015 perform i=0 calculations for(i=1; i<N; i++){ perform i>0 calculations }
68
Tuning Tips (6) Divides are expensive Intel x86 clock cycles per operation add3-6 multiply4-8 divide 32-45 Bad: Good: Information Services & Technology 68 10/6/2015 for(i=0; i<N; i++) x[i] = y[i]/scalarval; qs = 1.0/scalarval; for(i=0; i<N; i++) x[i] = y[i]*qs ;
69
Tuning Tips (7) There is overhead associated with a function call Bad: Good: Information Services & Technology 69 10/6/2015 for(i=0; i<N; i++) myfunc(i); myfunc ( ); void myfunc( ){ for(int i=0; i<N; i++){ do stuff }
70
Tuning Tips (8) Minimize calls to math functions Bad: Good: Information Services & Technology 70 10/6/2015 for(i=0; i<N; i++) z[i] = log(x[i]) * log(y[i]); for(i=0; i<N; i++){ z[i] = log(x[i] + y[i]);
71
Tuning Tips (9) recasting may be costlier than you think Bad: Good: Information Services & Technology 71 10/6/2015 sum = 0.0; for(i=0; i<N; i++) sum += (float) i isum = 0; for(i=0; i<N; i++) isum += i; sum = (float) isum
72
Exercise 5 The example code that has been provided is written in a clear, readable style, that also happens to violate lots of the tuning tips that we have just reviewed. Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster? We will discuss options as a group come up with a strategy modify code re-compile and run compare timings Re-examine line level profile, come up with another strategy, repeat procedure, etc. Information Services & Technology 72 10/6/2015
73
Survey Please fill out the survey for this tutorial at http://scv.bu.edu/survey/tutorial_evaluation.html Information Services & Technology 73 10/6/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.