Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization.

Slides:



Advertisements
Similar presentations
Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak Boston University Scientific Computing and.
Advertisements

Code Tuning and Optimization
Overview of programming in C C is a fast, efficient, flexible programming language Paradigm: C is procedural (like Fortran, Pascal), not object oriented.
Instruction Set Design
Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.
INSTRUCTION SET ARCHITECTURES
Computer Organization and Architecture
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Introduction to C Programming
Introduction to C Programming
1 Lecture 6 Performance Measurement and Improvement.
Software Lesson #1 CS1313 Spring Software Lesson 1 Outline 1.Software Lesson 1 Outline 2.What is Software? A Program? Data? 3.What are Instructions?
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 12 Pipelining Strategies Performance Hazards.
Program Design and Development
10/9/01CSE Project CSE 260 – Introduction to Parallel Computation 2-D Wave Equation Suggested Project.
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Introduction to MATLAB Northeastern University: College of Computer and Information Science Co-op Preparation University (CPU) 10/22/2003.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Performance Improvement
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
Introduction to C Programming
Performance Measuring on Blue Horizon and Sun HPC Systems: Timing, Profiling, and Reading Assembly Language NPACI Parallel Computing Institute 2000 Sean.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Prof. R. Willingale Department of Physics and Astronomy 2nd Year C+R 2 nd Year C and R Workshop Part of module PA2930 – 2.5 credits Venue: Computer terminal.
CH12 CPU Structure and Function
CS190/295 Programming in Python for Life Sciences: Lecture 1 Instructor: Xiaohui Xie University of California, Irvine.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
System Calls 1.
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
8 Shell Programming Mauro Jaskelioff. Introduction Environment variables –How to use and assign them –Your PATH variable Introduction to shell programming.
1 Computing Software. Programming Style Programs that are not documented internally, while they may do what is requested, can be difficult to understand.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Main Memory Central Processor Unit Keyboard Input Device Secondary Memory Monitor Printer Output Devices.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 7.Algorithm Efficiency What to measure? Space utilization: amount of memory required  Time efficiency: amount of time required to process the data.
Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.
Application Profiling Using gprof. What is profiling? Allows you to learn:  where your program is spending its time  what functions called what other.
CS 591x Profiling Parallel Programs Using the Portland Group Profiler.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room , Chris Hill, Room ,
Some Fortran programming tips ATM 562 Fall 2015 Fovell (see also PDF file on class page) 1.
Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.
Performance Optimization Getting your programs to run faster.
Adv. UNIX: Profile/151 Advanced UNIX v Objectives –introduce profiling based on execution times and line counts Special Topics in Comp.
Beginning Fortran Fortran (77) Advanced 29 October 2009 *Black text on white background provided for easy printing.
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Optimization of C Code The C for Speed
Python Lesson 1 1. Starter Create the following Excel spreadsheet and complete the calculations using formulae: 2 Add A1 and B1 A2 minus B2 A3 times B3.
1 Project 2: Using Variables and Expressions. 222 Project 2 Overview For this project you will work with three programs Circle Paint Ideal_Weight What.
1. COMPUTERS AND PROGRAMS Rocky K. C. Chang September 6, 2015 (Adapted from John Zelle’s slides)
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Outline Announcements: –HW II Idue Friday! Validating Model Problem Software performance Measuring performance Improving performance.
1 Types of Programming Language (1) Three types of programming languages 1.Machine languages Strings of numbers giving machine specific instructions Example:
Development Environment
Performance Analysis Tools
Code Optimization.
User-Written Functions
Topics Introduction Hardware and Software How Computers Store Data
Chapter 9 – Real Memory Organization and Management
Variables, Expressions, and IO
CS190/295 Programming in Python for Life Sciences: Lecture 1
Parallel Computing Explained Timing and Profiling
Presentation transcript:

Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization

Outline  Introduction  Example code  Timing  Profiling  Cache  Tuning Information Services & Technology 2 10/6/2015

Introduction  Timing  Where is most time being used?  Tuning  How to speed it up  Often as much art as science  Parallel Performance  How to assess how well parallelization is working Information Services & Technology 3 10/6/2015

Example Code Information Services & Technology 4 10/6/2015

Example Code  Simulation of response of eye to stimuli  Response is affected by adjacent inputs  A dark area next to a bright area makes the bright area look brighter  Based on Grossberg & Todorovic paper  Appendix in paper contains all equations  errors in eqns (A4) and (A5) – cross out “log2”  Paper contains 6 levels of response  Our code only contains levels 1 through 5  Level 6 takes a long time to compute, and would skew our timings! Information Services & Technology 5 10/6/2015

Example Code (cont’d)  All calculations done on a square array  Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran)  Due to nature of algorithm, array is padded on all sides  npad is size of padding Information Services & Technology 6 10/6/2015

Example Code – Level 1  Luminance (input) distribution  Paper (and code) use “yin-yang square”  Array I  magnitude of “bright” is ihigh  magnitude of “dark” is ilow Information Services & Technology 7 10/6/2015 bright dark Fig. 4 in paper

Example Code – Level 2  Level 2 – Circular Concentric On and Off Units  Excitation and inhibition vary with distance Information Services & Technology 8 10/6/2015 Fig. 5 in paper

Level 2 Equations Information Services & Technology 9 10/6/2015 I pq =initial input (yin-yang)

Example Code – Level 3  Oriented Direction-of-Contrast-Sensitive Units  Respond to angle  12 discrete angles  Respond to direction of contrast, i.e., light-to-dark or dark-to-light Information Services & Technology 10 10/6/2015 Fig. 6(d) in paper

Level 3 Equations Information Services & Technology 11 10/6/2015

Example Code - Level 4  Oriented Direction-of-Contrast-Insensitive Units  Respond to angle  Do not respond to direction of contrast, i.e., light-to-dark or dark-to-light Information Services & Technology 12 10/6/2015 Fig. 8(a) in paper

Level 4 Equations Information Services & Technology 13 10/6/2015

Example Code – Level 5  Level 5 – Boundary Contour Units  Pool nearby excitations Information Services & Technology 14 10/6/2015 Fig. 8(d) in paper

Level 5 Equation Information Services & Technology 15 10/6/2015

Timing Information Services & Technology 16 10/6/2015

Timing  When tuning/parallelizing a code, need to assess effectiveness of your efforts  Can time whole code and/or specific sections  Some types of timers  unix time command  function/subroutine calls  profiler Information Services & Technology 17 10/6/2015

CPU Time or Wall-Clock Time?  CPU time  How much time the CPU is actually crunching away  User CPU time  Time spent executing your source code  System CPU time  Time spent in system calls such as i/o  Wall-clock time  What you would measure with a stopwatch Information Services & Technology 18 10/6/2015

CPU Time or Wall-Clock Time? (cont’d)  Both are useful  For serial runs without interaction from keyboard, CPU and wall-clock times are usually close  If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not Information Services & Technology 19 10/6/2015

CPU Time or Wall-Clock Time? (3)  Parallel runs  Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased  Wall-clock time may not be accurate if sharing processors  Wall-clock timings should always be performed in batch mode Information Services & Technology 20 10/6/2015

Unix Time Command  easiest way to time code  simply type time before your run command  output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh) Information Services & Technology 21 10/6/2015

Unix Time Command (cont’d) twister:~ % time mycode 1.570u 0.010s 0: % k 0+0io 64pf+0w Information Services & Technology 22 10/6/2015 user CPU time (s) system CPU time (s) wall-clock time (s) (u+s)/wc avg. shared + unshared text space input + output operations page faults + no. times proc. was swapped

Unix Time Command (3)  Bourne shell results Information Services & Technology 23 10/6/2015 $ time mycode Real 1.62 User 1.57 System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s)

Exercise 1  Copy files from /scratch/sondak/gt cp /scratch/sondak/gt/*.  Choose C (gt.c) or Fortran (gt.f90)  Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 Submit rungt script to batch queue qsub rungt Information Services & Technology 24 10/6/2015 capital oh small ohzero

Exercise 1 (cont’d)  Check status qstat –u username  After run has completed a file will appear named rungt.o??????, where ?????? represents the process number  File contains result of time command  Write down wall-clock time  Re-compile using –O3  Re-run and check time Information Services & Technology 25 10/6/2015

Function/Subroutine Calls  often need to time part of code  timers can be inserted in source code  language-dependent Information Services & Technology 26 10/6/2015

cpu_time  intrinsic subroutine in Fortran  returns user CPU time (in seconds)  no system time is included  0.01 sec. resolution on p-series Information Services & Technology 27 10/6/2015 real :: t1, t2 call cpu_time(t1)... do stuff to be timed... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.'

system_clock  intrinsic subroutine in Fortran  good for measuring wall-clock time  on p-series:  resolution is 0.01 sec.  max. time is 24 hr. Information Services & Technology 28 10/6/2015

system_clock (cont’d)  t1 and t2 are tic counts  count_rate is optional argument containing tics/sec. Information Services & Technology 29 10/6/2015 integer :: t1, t2, count_rate call system_clock(t1, count_rate)... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

times  can be called from C to obtain CPU time  0.01 sec. resolution on p-series  can also get system time with tms_stime Information Services & Technology 30 10/6/2015 #include void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

gettimeofday  can be called from C to obtain wall-clock time   sec resolution on p-series Information Services & Technology 31 10/6/2015 #include void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

MPI_Wtime  convenient wall-clock timer for MPI codes   sec resolution on p-series Information Services & Technology 32 10/6/2015

MPI_Wtime (cont’d)  Fortran  C Information Services & Technology 33 10/6/2015 double precision t1, t2 t1 = mpi_wtime()... do stuff to be timed... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime();... do stuff to be timed... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

omp_get_time  convenient wall-clock timer for OpenMP codes  resolution available by calling omp_get_wtick()  0.01 sec. resolution on p-series Information Services & Technology 34 10/6/2015

omp_get_wtime (cont’d)  Fortran  C Information Services & Technology 35 10/6/2015 double precision t1, t2, omp_get_wtime t1 = omp_get_wtime()... do stuff to be timed... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime();... do stuff to be timed... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

Timer Summary Information Services & Technology 36 10/6/2015 CPUWall Fortrancpu_timesystem_clock Ctimesgettimeofday MPIMPI_Wtime OpenMPomp_get_time

Exercise 2  Put wall-clock timer around each “level” in the example code  Print time for each level  Compile and run Information Services & Technology 37 10/6/2015

PROFILING Information Services & Technology 38 10/6/2015

Profilers  profile tells you how much time is spent in each routine  gives a level of granularity not available with previous timers  e.g., function may be called from many places  various profilers available, e.g.  gprof (GNU)  pgprof (Portland Group)  Xprofiler (AIX) Information Services & Technology 39 10/6/2015

gprof  compile with -pg  file gmon.out will be created when you run  gprof executable > myprof  for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof Information Services & Technology 40 10/6/2015

gprof (cont’d) Information Services & Technology 41 10/6/2015 ngranularity: Each sample hit covers 4 bytes. Time: seconds % cumulative self self total time seconds seconds calls ms/call ms/call name conduct [5] getxyz [8] __mcount [9] btri [10] kickpipes [12] rmnmod [16] getq [24]

gprof (3) Information Services & Technology 42 10/6/2015 ngranularity: Each sample hit covers 4 bytes. Time: seconds called/total parents index %time self descendents called+self name index called/total children /1.__start [2] [1] main [1] /10.contrl [3] /10.force [34] /1.initia [40] /1.plot3da [49] /1.data [73]

pgprof  compile with Portland Group compiler  pgf90 (pgf95, etc.)  pgcc  –Mprof=func  similar to –pg  run code  pgprof –exe executable  pops up window with flat profile Information Services & Technology 43 10/6/2015

pgprof (cont’d) Information Services & Technology 44 10/6/2015

pgprof (3)  To save profile data to a file:  re-run pgprof using –text flag  at command prompt type p > filename  filename is the name you want to give the profile file  type quit to get out of profiler Information Services & Technology 45 10/6/2015

Exercise 3  Use pgprof to profile code  compile using –Mprof=func  run code  create profile using pgprof –exe gt  Note which routines use most time  Please close pgprof when you’re through  Leaving window open ties up a license Information Services & Technology 46 10/6/2015

Line-Level Profiling  Times individual lines  For pgprof, compile with the flag –Mprof=line  Optimizer will re-order lines  profiler will lump lines in some loops or other constructs  may want to compile without optimization, may not  In flat profile, double-click on function to get line-level data Information Services & Technology 47 10/6/2015

Line-Level Profiling (cont’d) Information Services & Technology 48 10/6/2015

Exercise 4  Compile code with –Mprof=line and –O0 and run  will take about 5 minutes to run due to overhead from line- level profiling and lack of optimization  Examine line-level profile for most time-consuming routine  Note lines with longest time consumption  Save your profile data to a file (we will need it later)  re-run pgprof using –text flag  at command prompt type p > prof Information Services & Technology 49 10/6/2015

CACHE Information Services & Technology 50 10/6/2015

Cache  Cache is a small chunk of fast memory between the main memory and the registers Information Services & Technology 51 10/6/2015 secondary cache registers primary cache main memory

Cache (cont’d)  If variables are used repeatedly, code will run faster since cache memory is much faster than main memory  Variables are moved from main memory to cache in lines  L1 cache line sizes on our machines  Opteron (katana cluster) 64 bytes  Xeon (katana cluster) 64 bytes  Power4 (p-series) 128 bytes  PPC440 (Blue Gene) 32 bytes  Pentium III (linux cluster) 32 bytes Information Services & Technology 52 10/6/2015

Cache (3)  Why not just make the main memory out of the same stuff as cache?  Expensive  Runs hot  This was actually done in Cray computers  Liquid cooling system Information Services & Technology 53 10/6/2015

Cache (4)  Cache hit  Required variable is in cache  Cache miss  Required variable not in cache  If cache is full, something else must be thrown out (sent back to main memory) to make room  Want to minimize number of cache misses Information Services & Technology 54 10/6/2015

Cache (5) Information Services & Technology 55 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] Main memory “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i; a b …

Cache (6) Information Services & Technology 56 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits for(i=0; i<10; i++) x[i] = i; a b … x[0 ] x[1] x[2 ] x[3]

Cache (7) Information Services & Technology 57 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits for(i=0; i<10; i++) x[i] = i; a b … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7]

Cache (8) Information Services & Technology 58 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line for(i=0; i<10; i++) x[i] = i; a b … x[4] x[5] x[6] x[7] x[8 ] x[9 ] a b

Cache (9)  Contiguous access is important  In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] Information Services & Technology 59 10/6/2015 …

Cache (10)  In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) Information Services & Technology 60 10/6/2015 …

Cache (11)  Rule: Always order your loops appropriately  will usually be taken care of by optimizer  suggestion: don’t rely on optimizer Information Services & Technology 61 10/6/2015 for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo CFortran

TUNING TIPS Information Services & Technology 62 10/6/2015

Tuning Tips  Some of these tips will be taken care of by compiler optimization  It’s best to do them yourself, since compilers vary  Two important rules  minimize number of operations  access cache contiguously Information Services & Technology 63 10/6/2015

Tuning Tips (cont’d)  Access arrays in contiguous order  For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good Information Services & Technology 64 10/6/2015 for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; } for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }

Tuning Tips (3)  Eliminate redundant operations in loops Bad: Good: Information Services & Technology 65 10/6/2015 for(i=0; i<N; i++){ x = 10; } … x = 10; for(i=0; i<N; i++){ } …

Tuning Tips (4)  Minimize if statements within loops  They may inhibit pipelining Information Services & Technology 66 10/6/2015 for(i=0; i<N; i++){ if(i==0) perform i=0 calculations else perform i>0 calculations }

Tuning Tips (5)  Better Way: Information Services & Technology 67 10/6/2015 perform i=0 calculations for(i=1; i<N; i++){ perform i>0 calculations }

Tuning Tips (6)  Divides are expensive  Intel x86 clock cycles per operation  add3-6  multiply4-8  divide  Bad:  Good: Information Services & Technology 68 10/6/2015 for(i=0; i<N; i++) x[i] = y[i]/scalarval; qs = 1.0/scalarval; for(i=0; i<N; i++) x[i] = y[i]*qs ;

Tuning Tips (7) There is overhead associated with a function call Bad: Good: Information Services & Technology 69 10/6/2015 for(i=0; i<N; i++) myfunc(i); myfunc ( ); void myfunc( ){ for(int i=0; i<N; i++){ do stuff }

Tuning Tips (8) Minimize calls to math functions Bad: Good: Information Services & Technology 70 10/6/2015 for(i=0; i<N; i++) z[i] = log(x[i]) * log(y[i]); for(i=0; i<N; i++){ z[i] = log(x[i] + y[i]);

Tuning Tips (9) recasting may be costlier than you think Bad: Good: Information Services & Technology 71 10/6/2015 sum = 0.0; for(i=0; i<N; i++) sum += (float) i isum = 0; for(i=0; i<N; i++) isum += i; sum = (float) isum

Exercise 5  The example code that has been provided is written in a clear, readable style, that also happens to violate lots of the tuning tips that we have just reviewed.  Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster?  We will discuss options as a group  come up with a strategy  modify code  re-compile and run  compare timings  Re-examine line level profile, come up with another strategy, repeat procedure, etc. Information Services & Technology 72 10/6/2015

Survey  Please fill out the survey for this tutorial at Information Services & Technology 73 10/6/2015