Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak Boston University Scientific Computing and.

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration Global Results.
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Simplifications of Context-Free Grammars
Variations of the Turing Machine
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
Process Description and Control
Copyright © 2003 Pearson Education, Inc. Slide 1.
Sequential Logic Design
Copyright © 2013 Elsevier Inc. All rights reserved.
Chung for Robofest 05 1 Introduction to RoboLab CJ Chung Lawrence Technological University.
Measurements and Their Uncertainty 3.1
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Mike Scott University of Texas at Austin
The 5S numbers game..
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Chapter 7: Arrays In this chapter, you will learn about
Time Measurement Topics Time scales Interval counting Cycle counters K-best measurement scheme time.ppt CS 105 Tour of Black Holes of Computing.
Break Time Remaining 10:00.
The basics for simulations
Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.
EE, NCKU Tien-Hao Chang (Darby Chang)
ECMWF 1 COM HPCF 2004: Profiling and optimisation Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit.
Code Tuning and Optimization
Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
PP Test Review Sections 6-1 to 6-6
Chapter 17 Linked Lists.
1 IMDS Tutorial Integrated Microarray Database System.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
COMPUTER INTERFACES.
Progressive Aerobic Cardiovascular Endurance Run
Lilian Blot PART III: ITERATIONS Core Elements Autumn 2012 TPOP 1.
Overview of programming in C C is a fast, efficient, flexible programming language Paradigm: C is procedural (like Fortran, Pascal), not object oriented.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Types of selection structures
Clock will move after 1 minute
R – C/C++ programming Katia Oleinik Scientific Computing and Visualization Boston University
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 13 Pointers and Linked Lists.
Select a time to count down from the clock above
1.step PMIT start + initial project data input Concept Concept.
C Tutorial Ross Shaull cs146a Why C Standard systems language – Historical reasons (OS have historically been written in C, so libraries written.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
Advanced Users Training 1 ENTERPRISE REPORTING FINANCIAL REPORTS.
Chapter 8 Improving the User Interface
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Parallel Processing with OpenMP
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
1 Lecture 6 Performance Measurement and Improvement.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Performance Improvement
Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.
Adv. UNIX: Profile/151 Advanced UNIX v Objectives –introduce profiling based on execution times and line counts Special Topics in Comp.
Performance Analysis Tools
Parallel Computing Explained Timing and Profiling
Presentation transcript:

Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak Boston University Scientific Computing and Visualization

Outline Introduction Timing Profiling Cache Tuning Timing/profiling exercise Parallelization

Introduction Tuning –Where is most time being used? –How to speed it up Often as much art as science Parallelization –After serial tuning, try parallel processing –MPI –OpenMP

Timing

When tuning/parallelizing a code, need to assess effectiveness of your efforts Can time whole code and/or specific sections Some types of timers –unix time command –function/subroutine calls –profiler

CPU or Wall-Clock Time? both are useful for parallel runs, really want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased CPU time doesnt account for wait time wall-clock time may not be accurate if sharing processors –wall-clock timings should always be performed in batch mode

Unix Time Command easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)

tcsh results Unix time Command (contd) twister:~ % time mycode 1.570u 0.010s 0: % k 0+0io 64pf+0w user CPU time (s) system CPU time (s) wall-clock time (s) (u+s)/wc avg. shared + unshared text space input + output operations page faults + no. times proc. was swapped

Unix Time Command (3) bsh results $ time mycode Real 1.62 User 1.57 System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s)

Function/Subroutine Calls often need to time part of code timers can be inserted in source code language-dependent

cpu_time intrinsic subroutine in Fortran returns user CPU time (in seconds) –no system time is included 0.01 sec. resolution on p-series real :: t1, t2 call cpu_time(t1)... do stuff to be timed... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.'

system_clock intrinsic subroutine in Fortran good for measuring wall-clock time on p-series: –resolution is 0.01 sec. –max. time is 24 hr.

system_clock (contd) t1 and t2 are tic counts count_rate is optional argument containing tics/sec. integer :: t1, t2, count_rate call system_clock(t1, count_rate)... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), sec

times can be called from C to obtain CPU time 0.01 sec. resolution on p-series can also get system time with tms_stime #include void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

gettimeofday can be called from C to obtain wall-clock time sec resolution on p-series #include void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(wall-clock time = %5.3f\n", t2-t1); }

MPI_Wtime convenient wall-clock timer for MPI codes sec resolution on p-series

MPI_Wtime (contd) Fortran C double precision t1, t2 t1 = mpi_wtime()... do stuff to be timed... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime();... do stuff to be timed... t2 = MPI_Wtime(); printf(wall-clock time = %5.3f\n,t2-t1);

omp_get_wtime convenient wall-clock timer for OpenMP codes resolution available by calling omp_get_wtick() 0.01 sec. resolution on p-series

omp_get_wtime (contd) Fortran C double precision t1, t2, omp_get_wtime t1 = omp_get_wtime()... do stuff to be timed... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime();... do stuff to be timed... t2 = omp_get_wtime(); printf(wall-clock time = %5.3f\n,t2-t1);

Timer Summary CPUWall Fortrancpu_timesystem_clock Ctimesgettimeofday MPIMPI_Wtime OpenMPomp_get_time

Profiling

Profilers profile tells you how much time is spent in each routine various profilers available, e.g. –gprof (GNU) –pgprof (Portland Group) –Xprofiler (AIX)

gprof compile with -pg file gmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof

gprof (contd) ngranularity: Each sample hit covers 4 bytes. Time: seconds called/total parents index %time self descendents called+self name index called/total children /1.__start [2] [1] main [1] /10.contrl [3] /10.force [34] /1.initia [40] /1.plot3da [49] /1.data [73]

gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: seconds % cumulative self self total time seconds seconds calls ms/call ms/call name conduct [5] getxyz [8] __mcount [9] btri [10] kickpipes [12] rmnmod [16] getq [24]

pgprof compile with Portland Group compiler –pgf95 (pgf90, etc.) –pgcc – –Mprof=func similar to –pg –run code pgprof –exe executable pops up window with flat profile

pgprof (contd)

pgprof (3) line-level profiling – –Mprof=line optimizer will re-order lines –profiler will lump lines in some loops or other constructs –may want to compile without optimization, may not in flat profile, double-click on function

pgprof (4)

xprofiler AIX (twister) has a graphical interface to gprof compile with -g -pg -Ox –Ox represents whatever level of optimization youre using (e.g., O5) run code –produces gmon.out file type xprofiler mycode –mycode is your code run comamnd

xprofiler (contd)

xprofiler (3) filled boxes represent functions or subroutines fences represent libraries left-click a box to get function name and timing information right-click on box to get source code or other information

xprofiler (4) can also get same profiles as from gprof by using menus –report flat profile –report call graph profile

Cache

Cache is a small chunk of fast memory between the main memory and the registers secondary cache registers primary cache main memory

Cache (contd) Variables are moved from main memory to cache in lines –L1 cache line sizes on our machines Opteron (katana cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

Cache (contd) Why not just make the main memory out of the same stuff as cache? –Expensive –Runs hot –This was actually done in Cray computers Liquid cooling system

Cache (contd) Cache hit –Required variable is in cache Cache miss –Required variable not in cache –If cache is full, something else must be thrown out (sent back to main memory) to make room –Want to minimize number of cache misses

Cache example … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] Main memory mini cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i a b …

Cache example (contd) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i=0; i<10; i++) x[i] = i x[0] x[1] x[2] x[3] We will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits a b …

Cache example (contd) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i=0; i<10; i++) x[i] = i x[0] x[1] x[2] x[3] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits x[4] x[5] x[6] x[7] a b …

Cache example (contd) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[8] x[9] a b need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line x[4] x[5] x[6] x[7] a b …

Cache (contd) Contiguous access is important In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …

Cache (contd) In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …

Cache (contd) Rule: Always order your loops appropriately –will usually be taken care of by optimizer –suggestion: dont rely on optimizer! for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo CFortran

Tuning Tips

Some of these tips will be taken care of by compiler optimization –Its best to do them yourself, since compilers vary

Tuning Tips (contd) Access arrays in contiguous order –For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; } for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }

Tuning Tips (3) Eliminate redundant operations in loops Bad: Good: for(i=0; i<N; i++){ x = 10; } … x = 10; for(i=0; i<N; i++){ } …

Tuning Tips (4) Eliminate if statements within loops They may inhibit pipelining for(i=0; i<N; i++){ if(i==0) perform i=0 calculations else perform i>0 calculations }

Tuning Tips (5) Better way perform i=0 calculations for(i=1; i<N; i++){ perform i>0 calculations }

Tuning Tips (6) Divides cost far more than multiplies or adds –Often order of magnitude difference! Bad: Good: for(i=0; i<N; i++) x[i] = y[i]/scalarval; qs = 1.0/scalarval; for(i=0; i<N; i++) x[i] = y[i]*qs ;

Tuning Tips (7) There is overhead associated with a function call Bad: Good: for(i=0; i<N; i++) myfunc(i); myfunc ( ); void myfunc(x){ for(int i=0; i<N; i++){ do stuff }

Tuning Tips (8) There is overhead associated with a function call Minimize calls to math functions Bad: Good: for(i=0; i<N; i++) z[i] = log(x[i]) * log(y[i]); for(i=0; i<N; i++){ z[i] = log(x[i] + y[i]);

Tuning Tips (9) There is overhead associated with a function call recasting may be costlier than you think Bad: Good: sum = 0.0; for(i=0; i<N; i++) sum += (float) i isum = 0; for(i=0; i<N; i++) isum += i; sum = (float) isum

Parallelization

Introduction MPI & OpenMP Performance metrics Amdahls Law

Introduction Divide and conquer! –divide operations among many processors –perform operations simultaneously –if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right? not so easy, of course

Introduction (contd) problem – some calculations depend upon previous calculations –cant be performed simultaneously –sometimes tied to the physics of the problem, e.g., time evolution of a system want to maximize amount of parallel code –occasionally easy –usually requires some work

Introduction (3) method used for parallelization may depend on hardware proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mem 2 mem 3 distributed memory proc 0 proc 1 proc 2 proc 3 mem shared memory proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mixed memory

Introduction (4) distributed memory –e.g., katana, Blue Gene –each processor has own address space –if one processor needs data from another processor, must be explicitly passed shared memory –e.g., p-series IBM machines –common address space –no message passing required

Introduction (5) MPI –for both distributed and shared memory –portable –freely downloadable OpenMP –shared memory only –must be supported by compiler (most do) –usually easier than MPI –can be implemented incrementally

MPI Computational domain is typically decomposed into regions –One region assigned to each processor Separate copy of program runs on each processor

MPI (contd) Discretized domain to solve flow over airfoil System of coupled PDEs solved at each point

MPI (3) Decomposed domain for 4 processors

MPI (4) Since points depend on adjacent points, must transfer information after each iteration This is done with explicit calls in the source code

MPI (5) Diminishing returns –Sending messages can get expensive –Want to maximize ratio of computation to communication

OpenMP Usually loop-level parallelization An OpenMP directive is placed in the source code before the loop –Assigns subset of loop indices to each processor –No message passing since each processor can see the whole domain for(i=0; i<N; i++){ do lots of stuff }

OpenMP (contd) Cant guarantee order of operations for(i = 0; i < 7; i++) a[i] = 1; for(i = 1; i < 7; i++) a[i] = 2*a[i-1]; ia[i] (serial)a[i] (parallel) Proc. 0 Proc. 1 Parallelize this loop on 2 processors Example of how to do it wrong!

Quantify performance Two common methods –parallel speedup –parallel efficiency

Parallel Speedup S n = parallel speedup n = number of processors T 1 = time on 1 processor T n = time on n processors

Parallel Speedup (2)

Parallel Efficiency n = parallel efficiency T 1 = time on 1 processor T n = time on n processors n = number of processors

Parallel Efficiency (2)

Parallel Efficiency (3) What is a reasonable level of parallel efficiency? depends on –how much CPU time you have available –when the paper is due can think of (1- ) as wasted CPU time my personal rule of thumb ~60%

Parallel Efficiency (4) Superlinear speedup –parallel efficiency > 1.0 –sometimes quoted in the literature –generally attributed to cache issues subdomains fit entirely in cache, entire domain does not this is very problem-dependent be suspicious!

Amdahls Law let fraction of code that can execute in parallel be denoted p let fraction of code that must execute serially be denoted s let T = time, n = number of processors

Amdahls Law (2) Noting that p = (1-s) parallel speedup is (dont confuse S n with s ) Amdahls Law

Amdahls Law (3) can also be expressed as parallel efficiency by dividing by n Amdahls Law

suppose s = 0 ; => linear speedup Amdahls Law (4)

Amdahls Law (5) suppose s = 1 ; => no speedup

Amdahls Law (6)

Amdahls Law (7) Should we despair? –No! –bigger machines bigger computations smaller value of s if you want to run on a large number of processors, try to minimize s

Recommendations

Add timers to your code –As you make changes and/or run new cases, they may give you an indication of a problem Profile your code –Sometimes results are surprising –Review tuning tips –See if you can speed up functions that are consuming the most time Try highest levels of compiler optimization

Recommendations (contd) Once youre comfortable that youre getting reasonable serial performance, parallelize If portability is an issue, MPI is a good choice If youll always be running on a shared-memory machine (e.g., multicore PC), consider OpenMP For parallel code, plot parallel efficiency vs. number of processors –Choose appropriate number of processors