Parallel Computing Explained Timing and Profiling

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Advertisements

1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
GNU gprof Profiler Yu Kai Hong Department of Mathematics National Taiwan University July 19, 2008 GNU gprof 1/22.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
1 Lecture 6 Performance Measurement and Improvement.
Performance Improvement
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Guide To UNIX Using Linux Third Edition
Performance Measuring on Blue Horizon and Sun HPC Systems: Timing, Profiling, and Reading Assembly Language NPACI Parallel Computing Institute 2000 Sean.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
PAPI Update Shirley Browne, Cricket Deane, George Ho, Philip Mucci University of Tennessee Computer.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.
Performance.
1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Application Profiling Using gprof. What is profiling? Allows you to learn:  where your program is spending its time  what functions called what other.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.
® IBM Software Group © 2006 IBM Corporation PurifyPlus on Linux / Unix Vinay Kumar H S.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Adv. UNIX: Profile/151 Advanced UNIX v Objectives –introduce profiling based on execution times and line counts Special Topics in Comp.
HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Performance Improvement
Profiling Tools Introduction to Computer System, Fall (PPI, FDU) Vtune & GProfile.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
Threaded Programming Lecture 2: Introduction to OpenMP.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors
1 Performance Issues CIS*2450 Advanced Programming Concepts.
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Outline Announcements: –HW II Idue Friday! Validating Model Problem Software performance Measuring performance Improving performance.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/11/2006 Lecture 7 – Introduction to C.
Measuring Performance Based on slides by Henri Casanova.
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
Two notions of performance
Profiling with GNU GProf
Precept 10 : Building and Performance
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Performance Analysis Tools
APPENDIX a WRITING SUBROUTINES IN C
OPERATING SYSTEMS CS3502 Fall 2017
Workshop in Nihzny Novgorod State University Activity Report
Chapter 5: CPU Scheduling
CS2100 Computer Organisation
Operating Systems Lecture 3.
Introduction to OProfile
Min Heap Update E.g. remove smallest item 1. Pop off top (smallest) 3
Department of Computer Science, University of Tennessee, Knoxville
Parallel Computing Explained How to Parallelize a Code
Module Recognition Algorithms
Performance.
Working in The IITJ HPC System
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CS2100 Computer Organisation
Presentation transcript:

Parallel Computing Explained Timing and Profiling Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing 6.1.1 Timing a Section of Code 6.1.1.1 CPU Time 6.1.1.2 Wall clock Time 6.1.2 Timing an Executable 6.1.3 Timing a Batch Job 6.2 Profiling 6.2.1 Profiling Tools 6.2.2 Profile Listings 6.2.3 Profiling Analysis 6.3 Further Information

Timing and Profiling Now that your program has been ported to the new computer, you will want to know how fast it runs. This chapter describes how to measure the speed of a program using various timing routines. The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program.

Timing In the following sections, we’ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters. The specific timing functions described are: Timing a section of code FORTRAN etime, dtime, cpu_time for CPU time time and f_time for wallclock time C clock for CPU time gettimeofday for wallclock time Timing an executable time a.out Timing a batch run busage qstat qhist

CPU Time etime A section of code can be timed using etime. It returns the elapsed CPU time in seconds since the program started. real*4 tarray(2),time1,time2,timeres … beginning of program time1=etime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=etime(tarray) timeres=time2-time1

CPU Time dtime A section of code can also be timed using dtime. It returns the elapsed CPU time in seconds since the last call to dtime. real*4 tarray(2),timeres … beginning of program timeres=dtime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed … rest of program

CPU Time The etime and dtime Functions User time. System time. This is returned as the first element of tarray. It’s the CPU time spent executing user code. System time. This is returned as the second element of tarray. It’s the time spent executing system calls on behalf of your program. Sum of user and system time. This is the function value that is returned. It’s the time that is usually reported. Metric. Timings are reported in seconds. Timings are accurate to 1/100th of a second.

CPU Time Timing Comparison Warnings For the SGI computers: The etime and dtime functions return the MAX time over all threads for a parallel program. This is the time of the longest thread, which is usually the master thread. For the Linux Clusters: The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib. Another warning: Do not put calls to etime and dtime inside a do loop. The overhead is too large.

CPU Time cpu_time The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines. It can be used as an elapsed timer. real*8 time1, time2, timeres … beginning of program call cpu_time (time1) … start of section of code to be timed … lots of computation … end of section of code to be timed call cpu_time(time2) timeres=time2-time1 … rest of program

CPU Time clock For C programmers, one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time. include <time.h> static const double iCPS = 1.0/(double)CLOCKS_PER_SEC; double time1, time2, timres; … time1=(clock()*iCPS); /* do some work */ time2=(clock()*iCPS); timers=time2-time1;

Wall clock Time time For the Origin, the function time returns the time since 00:00:00 GMT, Jan. 1, 1970. It is a means of getting the elapsed wall clock time. The wall clock time is reported in integer seconds. external time integer*4 time1,time2,timeres … beginning of program time1=time( ) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=time( ) timeres=time2 - time1

Wall clock Time f_time For the Linux clusters, the appropriate FORTRAN function for elapsed time is f_time. integer*8 f_time external f_time integer*8 time1,time2,timeres … beginning of program time1=f_time() … start of section of code to be timed … lots of computation … end of section of code to be timed time2=f_time() timeres=time2 - time1 As above for etime and dtime, the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib.

Wall clock Time gettimeofday For C programmers, wallclock time can be obtained by using the very portable routine gettimeofday. #include <stddef.h> /* definition of NULL */ #include <sys/time.h> /* definition of timeval struct and protyping of gettimeofday */ double t1,t2,elapsed; struct timeval tp; int rtn; .... rtn=gettimeofday(&tp, NULL); t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; /* do some work */ t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; elapsed=t2-t1;

Timing an Executable To time an executable (if using a csh or tcsh shell, explicitly call /usr/bin/time) time …options… a.out where options can be ‘-p’ for a simple output or ‘-f format’ which allows the user to display more than just time related information. Consult the man pages on the time command for format options.

Timing a Batch Job Time of a batch job running or completed. Origin busage jobid Linux clusters qstat jobid # for a running job qhist jobid # for a completed job

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing 6.1.1 Timing a Section of Code 6.1.1.1 CPU Time 6.1.1.2 Wall clock Time 6.1.2 Timing an Executable 6.1.3 Timing a Batch Job 6.2 Profiling 6.2.1 Profiling Tools 6.2.2 Profile Listings 6.2.3 Profiling Analysis 6.3 Further Information

Profiling Profiling determines where a program spends its time. It detects the computationally intensive parts of the code. Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. Most codes follow the 90-10 Rule. That is, 90% of the computation is done in 10% of the code.

Profiling Tools Profiling Tools on the Origin ssrun prof Example On the SGI Origin2000 computer there are profiling tools named ssrun and prof. Used together they do profiling, or what is called hot spot analysis. They are useful for generating timing profiles. ssrun The ssrun utility collects performance data for an executable that you specify. The performance data is written to a file named "executablename.exptype.id". prof The prof utility analyzes the data file created by ssrun and produces a report. Example ssrun -fpcsamp a.out prof -h a.out.fpcsamp.m12345 > prof.list

Profiling Tools Profiling Tools on the Linux Clusters On the Linux clusters the profiling tools are still maturing. There are currently several efforts to produce tools comparable to the ssrun, prof and perfex tools. . gprof Basic profiling information can be generated using the OS utility gprof. First, compile the code with the compiler flags -qp -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler. Second, run the program. Finally analyze the resulting gmon.out file using the gprof utility: gprof executable gmon.out. efc -O -qp -g -o foo foo.f ./foo gprof foo gmon.out

Profiling Tools Profiling Tools on the Linux Clusters vprof On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library. To instrument the whole application requires recompiling and linking to vprof and PAPI libraries. setenv VMON PAPI_TOT_CYC ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmonauto_gcc.o - L/usr/apps/tools/lib -lvmon -lpapi ./md /usr/apps/tools/vprof/bin/cprof -e md vmon.out

Profile Listings Profile Listings on the Origin Prof Output First Listing The first listing gives the number of cycles executed in each procedure (or subroutine). The procedures are listed in descending order of cycle count. Cycles % Cum% Secs Proc -------- ----- ----- ---- ---- 42630984 58.47 58.47 0.57 VSUB 6498294 8.91 67.38 0.09 PFSOR 6141611 8.42 75.81 0.08 PBSOR 3654120 5.01 80.82 0.05 PFSOR1 2615860 3.59 84.41 0.03 VADD 1580424 2.17 86.57 0.02 ITSRCG 1144036 1.57 88.14 0.02 ITSRSI 886044 1.22 89.36 0.01 ITJSI 861136 1.18 90.54 0.01 ITJCG

Profile Listings Profile Listings on the Origin Prof Output Second Listing The second listing gives the number of cycles per source code line. The lines are listed in descending order of cycle count. Cycles % Cum% Line Proc -------- ----- ----- ---- ---- 36556944 50.14 50.14 8106 VSUB 5313198 7.29 57.43 6974 PFSOR 4968804 6.81 64.24 6671 PBSOR 2989882 4.10 68.34 8107 VSUB 2564544 3.52 71.86 7097 PFSOR1 1988420 2.73 74.59 8103 VSUB 1629776 2.24 76.82 8045 VADD 994210 1.36 78.19 8108 VSUB 969056 1.33 79.52 8049 VADD 483018 0.66 80.18 6972 PFSOR

Profile Listings Profile Listings on the Linux Clusters gprof Output First Listing The listing gives a 'flat' profile of functions and routines encountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone. Flat profile: Each sample counts as 0.000976562 seconds. % cumulative self self total time seconds seconds calls us/call us/call name ----- ---------- ------- ----- ------- ------- ----------- 38.07 5.67 5.67 101 56157.18 107450.88 compute_ 34.72 10.84 5.17 25199500 0.21 0.21 dist_ 25.48 14.64 3.80 SIND_SINCOS 1.25 14.83 0.19 sin 0.37 14.88 0.06 cos 0.05 14.89 0.01 50500 0.15 0.15 dotr8_ 0.05 14.90 0.01 100 68.36 68.36 update_ 0.01 14.90 0.00 f_fioinit 0.01 14.90 0.00 f_intorange 0.01 14.90 0.00 mov 0.00 14.90 0.00 1 0.00 0.00 initialize_

Profile Listings Profile Listings on the Linux Clusters gprof Output Second Listing The second listing gives a 'call-graph' profile of functions and routines encountered. The definitions of the columns are specific to the line in question. Detailed information is contained in the full output from gprof. Call graph: index % time self children called name ----- ------ ---- -------- ---------------- ---------------- [1] 72.9 0.00 10.86 main [1] 5.67 5.18 101/101 compute_ [2] 0.01 0.00 100/100 update_ [8] 0.00 0.00 1/1 initialize_ [12] --------------------------------------------------------------------- 5.67 5.18 101/101 main [1] [2] 72.8 5.67 5.18 101 compute_ [2] 5.17 0.00 25199500/25199500 dist_ [3] 0.01 0.00 50500/50500 dotr8_ [7] 5.17 0.00 25199500/25199500 compute_ [2] [3] 34.7 5.17 0.00 25199500 dist_ [3] <spontaneous> [4] 25.5 3.80 0.00 SIND_SINCOS [4] …

Profile Listings Profile Listings on the Linux Clusters vprof Listing The above listing from (using the -e option to cprof), displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions. Columns correspond to the following events: PAPI_TOT_CYC - Total cycles (1956 events) File Summary: 100.0% /u/ncsa/gbauer/temp/md.f Function Summary: 84.4% compute 15.6% dist Line Summary: 67.3% /u/ncsa/gbauer/temp/md.f:106 13.6% /u/ncsa/gbauer/temp/md.f:104 9.3% /u/ncsa/gbauer/temp/md.f:166 2.5% /u/ncsa/gbauer/temp/md.f:165 1.5% /u/ncsa/gbauer/temp/md.f:102 1.2% /u/ncsa/gbauer/temp/md.f:164 0.9% /u/ncsa/gbauer/temp/md.f:107 0.8% /u/ncsa/gbauer/temp/md.f:169 0.8% /u/ncsa/gbauer/temp/md.f:162 0.8% /u/ncsa/gbauer/temp/md.f:105

Profile Listings Profile Listings on the Linux Clusters vprof Listing (cont.) 0.7% /u/ncsa/gbauer/temp/md.f:149 0.5% /u/ncsa/gbauer/temp/md.f:163 0.2% /u/ncsa/gbauer/temp/md.f:109 0.1% /u/ncsa/gbauer/temp/md.f:100 … 100 0.1% do j=1,np 101 if (i .ne. j) then 102 1.5% call dist(nd,box,pos(1,i),pos(1,j),rij,d) 103 ! attribute half of the potential energy to particle 'j' 104 13.6% pot = pot + 0.5*v(d) 105 0.8% do k=1,nd 106 67.3% f(k,i) = f(k,i) - rij(k)*dv(d)/d 107 0.9% enddo 108 endif 109 0.2% enddo

Profiling Analysis The program being analyzed in the previous Origin example has approximately 10000 source code lines, and consists of many subroutines. The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. Going back to the source code, line 8106 is a line inside a do loop. Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. To view the rearranged source use the option f90 … -FLIST:=ON cc … -CLIST:=ON For the Intel compilers, the appropriate options are ifort … –E … icc … -E …

Further Information SGI Irix Linux Clusters man etime man 3 time man busage man timers man ssrun man prof Origin2000 Performance Tuning and Optimization Guide Linux Clusters man 3 clock man 2 gettimeofday man 1 gprof man 1B qstat Intel Compilers Vprof on NCSA Linux Cluster