Code Tuning and Optimization

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Simplifications of Context-Free Grammars
Variations of the Turing Machine
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Sequential Logic Design
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
Create an Application Title 1Y - Youth Chapter 5.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
EE, NCKU Tien-Hao Chang (Darby Chang)
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak Boston University Scientific Computing and.
PP Test Review Sections 6-1 to 6-6
1 IMDS Tutorial Integrated Microarray Database System.
MM4A6c: Apply the law of sines and the law of cosines.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
1..
Lilian Blot PART III: ITERATIONS Core Elements Autumn 2012 TPOP 1.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
When you see… Find the zeros You think….
Before Between After.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Types of selection structures
Chapter 12 Working with Forms Principles of Web Design, 4 th Edition.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 13 Pointers and Linked Lists.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization.
Presentation transcript:

Code Tuning and Optimization Kadin Tseng Boston University Scientific Computing and Visualization

Code Tuning and Optimization Outline Introduction Timing Example Code Profiling Cache Tuning Parallel Performance

Code Tuning and Optimization Introduction Timing Where is most time being used? Tuning How to speed it up Often as much art as science Parallel Performance How to assess how well parallelization is working

Code Tuning and Optimization Timing

Code Tuning and Optimization Timing When tuning/parallelizing a code, need to assess effectiveness of your efforts Can time whole code and/or specific sections Some types of timers unix time command function/subroutine calls profiler

CPU Time or Wall-Clock Time? Code Tuning and Optimization CPU Time or Wall-Clock Time? CPU time How much time the CPU is actually crunching away User CPU time Time spent executing your source code System CPU time Time spent in system calls such as i/o Wall-clock time What you would measure with a stopwatch

CPU Time or Wall-Clock Time? (cont’d) Code Tuning and Optimization CPU Time or Wall-Clock Time? (cont’d) Both are useful For serial runs without interaction from keyboard, CPU and wall-clock times are usually close If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not

CPU Time or Wall-Clock Time? (3) Code Tuning and Optimization CPU Time or Wall-Clock Time? (3) Parallel runs Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased Wall-clock time may not be accurate if sharing processors Wall-clock timings should always be performed in batch mode

Code Tuning and Optimization Unix Time Command easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)

Unix Time Command (cont’d) Code Tuning and Optimization Unix Time Command (cont’d) katana:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w input + output operations user CPU time (s) wall-clock time (s) avg. shared + unshared text space system CPU time (s) (u+s)/wc page faults + no. times proc. was swapped

Code Tuning and Optimization Unix Time Command (3) Bourne shell results $ time mycode real 0m1.62s user 0m1.57s sys 0m0.03s wall-clock time user CPU time system CPU time

Code Tuning and Optimization Example Code

Code Tuning and Optimization Example Code Simulation of response of eye to stimuli (CNS Dept.) Based on Grossberg & Todorovic paper Contains 6 levels of response Our code only contains levels 1 through 5 Level 6 takes a long time to compute, and would skew our timings!

Code Tuning and Optimization Example Code (cont’d) All calculations done on a square array Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran)

Code Tuning and Optimization Level 1 Equations Computational domain is a square Defines square array I over domain (initial condition) bright dark

Code Tuning and Optimization Level 2 Equations Ipq=initial condition

Code Tuning and Optimization Level 3 Equations

Code Tuning and Optimization Level 4 Equations

Code Tuning and Optimization Level 5 Equation

Code Tuning and Optimization Exercise 1 Copy files from /scratch disc Katana% cp /scratch/kadin/tuning/* . Choose C (gt.c and gt.h) or Fortran (gt.f90) Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 Submit rungt script to batch queue katana% qsub -b y rungt small oh capital oh zero

Code Tuning and Optimization Exercise 1 (cont’d) Check status qstat –u username After run has completed a file will appear named rungt.o??????, where ?????? represents the process number File contains result of time command Write down wall-clock time Re-compile using –O3 Re-run and check time

Function/Subroutine Calls Code Tuning and Optimization Function/Subroutine Calls often need to time part of code timers can be inserted in source code language-dependent

Code Tuning and Optimization cpu_time intrinsic subroutine in Fortran returns user CPU time (in seconds) no system time is included real :: t1, t2 call cpu_time(t1) ! Start timer ... perform computation here ... call cpu_time(t2) ! Stop timer print*, 'CPU time = ', t2-t1, ' sec.'

Code Tuning and Optimization system_clock intrinsic subroutine in Fortran good for measuring wall-clock time

system_clock (cont’d) Code Tuning and Optimization system_clock (cont’d) t1 and t2 are tic counts count_rate is optional argument containing tics/sec. integer :: t1, t2, count_rate call system_clock(t1, count_rate) ! Start clock ... perform computation here ... call system_clock(t2) ! Stop clock print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

Code Tuning and Optimization times can be called from C to obtain CPU time #include <sys/times.h> #include <unistd.h> void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); // start clock tic1 = timedat.tms_utime; … perform computation here … times(&timedat); // stop clock tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); } can also get system time with tms_stime

Code Tuning and Optimization gettimeofday can be called from C to obtain wall-clock time #include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); // start clock t1 = t.tv_sec + 1.0e-6*t.tv_usec; … perform computation here … gettimeofday(&t, NULL); // stop clock t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

Code Tuning and Optimization MPI_Wtime convenient wall-clock timer for MPI codes

Code Tuning and Optimization MPI_Wtime (cont’d) Fortran C double precision t1, t2 t1 = mpi_wtime() ! Start clock ... perform computation here ... t2 = mpi_wtime() ! Stop clock print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime(); // start clock ... perform computation here … t2 = MPI_Wtime(); // stop clock printf(“wall-clock time = %5.3f\n”,t2-t1);

Code Tuning and Optimization omp_get_time convenient wall-clock timer for OpenMP codes resolution available by calling omp_get_wtick()

omp_get_wtime (cont’d) Code Tuning and Optimization omp_get_wtime (cont’d) Fortran C double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ! Start clock ... perform computation here ... t2 = omp_get_wtime() ! Stop clock print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime(); // start clock ... perform computation here ... t2 = omp_get_wtime(); // stop clock printf(“wall-clock time = %5.3f\n”,t2-t1);

Code Tuning and Optimization Timer Summary CPU Wall Fortran cpu_time system_clock C times gettimeofday MPI MPI_Wtime OpenMP omp_get_time

Code Tuning and Optimization Exercise 2 Put wall-clock timer around each “level” in the example code Print time for each level Compile and run

Code Tuning and Optimization Profiling

Code Tuning and Optimization Profilers profile tells you how much time is spent in each routine gives a level of granularity not available with previous timers e.g., function may be called from many places various profilers available, e.g. gprof (GNU) -- function level profiling pgprof (Portland Group) -- function and line level profiling

Code Tuning and Optimization gprof compile with -pg when you run executable, file gmon.out will be created gprof executable > myprof this processes gmon.out into myprof for multiple processes (MPI), copy or link gmon.out.n to gmon.out, then run gprof

Code Tuning and Optimization gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

Code Tuning and Optimization gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

Code Tuning and Optimization pgprof compile with Portland Group compiler pgf90 (pgf95, etc.) pgcc –Mprof=func similar to –pg run code pgprof –exe executable pops up window with flat profile

Code Tuning and Optimization pgprof (cont’d)

Code Tuning and Optimization pgprof (3) To save profile data to a file: re-run pgprof using –text flag at command prompt type p > filename filename is the name you want to give the profile file type quit to get out of profiler Close pgprof as soon as you’re through Leaving window open ties up a license (only a few available)

Code Tuning and Optimization Line-Level Profiling Times individual lines For pgprof, compile with the flag –Mprof=line Optimizer will re-order lines profiler will lump lines in some loops or other constructs may want to compile without optimization, may not In flat profile, double-click on function to get line-level data

Line-Level Profiling (cont’d) Code Tuning and Optimization Line-Level Profiling (cont’d)

Code Tuning and Optimization Cache

Code Tuning and Optimization Cache Cache is a small chunk of fast memory between the main memory and the registers secondary cache registers primary cache main memory

Code Tuning and Optimization Cache (cont’d) If variables are used repeatedly, code will run faster since cache memory is much faster than main memory Variables are moved from main memory to cache in lines L1 cache line sizes on our machines Opteron (katana cluster) 64 bytes Xeon (katana cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes

Code Tuning and Optimization Cache (3) Why not just make the main memory out of the same stuff as cache? Expensive Runs hot This was actually done in Cray computers Liquid cooling system

Code Tuning and Optimization Cache (4) Cache hit Required variable is in cache Cache miss Required variable not in cache If cache is full, something else must be thrown out (sent back to main memory) to make room Want to minimize number of cache misses

Code Tuning and Optimization Cache (5) “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i; x[0] x[8] x[1] x[9] x[2] a x[3] b Main memory x[4] … x[5] … x[6] x[7]

Code Tuning and Optimization Cache (6) x[0] will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits x[1] x[2] x[3] for(i=0; i<10; i++) x[i] = i; x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[5] … x[6] x[7]

Code Tuning and Optimization Cache (7) x[0] x[4] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits x[1] x[5] x[2] x[6] x[3] x[7] for(i=0; i<10; i++) x[i] = i; x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[5] … x[6] x[7]

Code Tuning and Optimization Cache (8) x[8] x[4] need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line x[9] x[5] a x[6] b x[7] for(i=0; i<10; i++) x[i] = i; x[0] x[8] x[1] x[9] x[2] a x[3] b x[4] … x[5] … x[6] x[7]

Code Tuning and Optimization Cache (9) Contiguous access is important In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …

Code Tuning and Optimization Cache (10) In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …

Code Tuning and Optimization Cache (11) Rule: Always order your loops appropriately will usually be taken care of by optimizer suggestion: don’t rely on optimizer for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo C Fortran

Code Tuning and Optimization Tuning Tips

Code Tuning and Optimization Tuning Tips Some of these tips will be taken care of by compiler optimization It’s best to do them yourself, since compilers vary Two important rules minimize number of operations access cache contiguously

Code Tuning and Optimization Tuning Tips (cont’d) Access arrays in contiguous order For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; } for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }

Code Tuning and Optimization Tuning Tips (3) Eliminate redundant operations in loops Bad: Good: for(i=0; i<N; i++){ x = 10; } … x = 10; for(i=0; i<N; i++){ } …

Code Tuning and Optimization Tuning Tips (4) Eliminate or minimize if statements within loops Bad: if may inhibit pipelining Good: for(i=0; i<N; i++){ if(i = = 0) perform i=0 calculations else perform i>0 calculations } perform i=0 calculations for(i=1; i<N; i++){ perform i>0 calculations }

Code Tuning and Optimization Tuning Tips (5) Divides are expensive Intel x86 clock cycles per operation add 3-6 multiply 4-8 divide 32-45 Bad: Good: for(i=0; i<N; i++) { x[i] = y[i]/scalarval; } qs = 1.0/scalarval; for(i=0; i<N; i++) { x[i] = y[i]*qs; }

Code Tuning and Optimization Tuning Tips (6) There is overhead associated with a function call Bad: Good: for(i=0; i<N; i++) myfunc(i); void myfunc( ){ for(int i=0; i<N; i++){ do stuff } myfunc ( );

Code Tuning and Optimization Tuning Tips (7) Minimize calls to math functions Bad: Good: for(i=0; i<N; i++) z[i] = log(x[i]) * log(y[i]); for(i=0; i<N; i++){ z[i] = log(x[i] + y[i]);

Code Tuning and Optimization Tuning Tips (8) recasting may be costlier than you think Bad: Good: sum = 0.0; for(i=0; i<N; i++) sum += (float) i isum = 0; for(i=0; i<N; i++) isum += i; sum = (float) isum

Exercise 3 (not in class) Code Tuning and Optimization Exercise 3 (not in class) The example code provided is written in a clear, readable style, that also happens to violate lots of the tuning tips that we have just reviewed. Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster? We will discuss options as a group come up with a strategy modify code re-compile and run compare timings Re-examine line level profile, come up with another strategy, repeat procedure, etc.

Speedup Ratio and Parallel Efficiency Code Tuning and Optimization Speedup Ratio and Parallel Efficiency S is ratio of T1 over TN , elapsed times of 1 and N workers. f is fraction of T1 due to code sections not parallelizable. Amdahl’s Law above states that a code with its parallelizable component comprising 90% of total computation time can at best achieve a 10X speedup with lots of workers. A code that is 50% parallelizable speeds up two-fold with lots of workers. The parallel efficiency is E = S / N Program that scales linearly (S = N) has parallel efficiency 1. A task-parallel program is usually more efficient than a data- parallel program. Parallel codes can sometimes achieve super-linear behavior due to efficient cache usage per worker.

Example of Speedup Ratio & Parallel Efficiency Code Tuning and Optimization Example of Speedup Ratio & Parallel Efficiency