Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

GNU gprof Profiler Yu Kai Hong Department of Mathematics National Taiwan University July 19, 2008 GNU gprof 1/22.

Profiling your application with Intel VTune at NERSC

Intel® performance analyze tools Nikita Panov Idrisov Renat.

Computer Organization and Architecture

Tools for applications improvement George Bosilca.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

1 Lecture 6 Performance Measurement and Improvement.

Performance Improvement

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.

Performance Measuring on Blue Horizon and Sun HPC Systems: Timing, Profiling, and Reading Assembly Language NPACI Parallel Computing Institute 2000 Sean.

Principles of Programming Chapter 1: Introduction  In this chapter you will learn about:  Overview of Computer Component  Overview of Programming 

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

PAPI Update Shirley Browne, Cricket Deane, George Ho, Philip Mucci University of Tennessee Computer.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Application Profiling Using gprof. What is profiling? Allows you to learn:  where your program is spending its time  what functions called what other.

CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.

® IBM Software Group © 2006 IBM Corporation PurifyPlus on Linux / Unix Vinay Kumar H S.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Adv. UNIX: Profile/151 Advanced UNIX v Objectives –introduce profiling based on execution times and line counts Special Topics in Comp.

Power Profiling using Sim-Panalyzer Andria Dyess and Trey Brakefield CPE631 Spring 2005.

HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Morgan Kaufmann Publishers

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.

Threaded Programming Lecture 2: Introduction to OpenMP.

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors

1 Performance Issues CIS*2450 Advanced Programming Concepts.

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:

Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Tuning Threaded Code with Intel® Parallel Amplifier.

July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.

Two notions of performance

Profiling with GNU GProf

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Performance Analysis Tools

CSCI1600: Embedded and Real Time Software

Parallel Computing Explained Timing and Profiling

Min Heap Update E.g. remove smallest item 1. Pop off top (smallest) 3

Parallel Computing Explained How to Parallelize a Code

Working in The IITJ HPC System

CSCI1600: Embedded and Real Time Software

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

Presentation transcript:

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 (Additional Slides by Javier Delgado) Parallel Computing Explained Timing and Profiling

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing Timing a Section of Code CPU Time Wall clock Time Timing an Executable Timing a Batch Job 6.2 Profiling Profiling Tools Profile Listings Profiling Analysis 6.3 Further Information

Timing and Profiling Now that your program has been ported to the new computer, you will want to know how fast it runs. This chapter describes how to measure the speed of a program using various timing routines. The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program. This chapter also gives a summary of some available profiling tools.

Timing In the following sections, we’ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters. The specific timing functions described are: Timing a section of code FORTRAN etime, dtime, cpu_time for CPU time time and f_time for wallclock time clock for CPU time gettimeofday for wallclock time Timing an executable time a.out Timing a batch run busage qstat qhist

CPU Time etime A section of code can be timed using etime. It returns the elapsed CPU time in seconds since the program started. real*4 tarray(2),time1,time2,timeres … beginning of program time1=etime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=etime(tarray) timeres=time2-time1

CPU Time dtime A section of code can also be timed using dtime. It returns the elapsed CPU time in seconds since the last call to dtime. real*4 tarray(2),timeres … beginning of program timeres=dtime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed timeres=dtime(tarray) … rest of program

CPU Time The etime and dtime Functions User time. This is returned as the first element of tarray. It’s the CPU time spent executing user code. System time. This is returned as the second element of tarray. It’s the time spent executing system calls on behalf of your program. Sum of user and system time. This is the function value that is returned. It’s the time that is usually reported. Metric. Timings are reported in seconds. Timings are accurate to 1/100th of a second.

CPU Time Timing Comparison Warnings For the SGI computers: The etime and dtime functions return the MAX time over all threads for a parallel program. This is the time of the longest thread, which is usually the master thread. For the Linux Clusters: The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib. Another warning: Do not put calls to etime and dtime inside a do loop. The overhead is too large.

CPU Time cpu_time The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines. It can be used as an elapsed timer. real*8 time1, time2, timeres … beginning of program call cpu_time (time1) … start of section of code to be timed … lots of computation … end of section of code to be timed call cpu_time(time2) timeres=time2-time1 … rest of program

CPU Time clock For C programmers, one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time. include static const double iCPS = 1.0/(double)CLOCKS_PER_SEC; double time1, time2, timres; … time1=(clock()*iCPS); … /* do some work */ … time2=(clock()*iCPS); timers=time2-time1;

Wall clock Time time For the Origin, the function time returns the time since 00:00:00 GMT, Jan. 1, It is a means of getting the elapsed wall clock time. The wall clock time is reported in integer seconds. external time integer*4 time1,time2,timeres … beginning of program time1=time( ) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=time( ) timeres=time2 - time1

Wall clock Time f_time For the Linux clusters, the appropriate FORTRAN function for elapsed time is f_time. integer*8 f_time external f_time integer*8 time1,time2,timeres … beginning of program time1=f_time() … start of section of code to be timed … lots of computation … end of section of code to be timed time2=f_time() timeres=time2 - time1 As above for etime and dtime, the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib.

Wall clock Time gettimeofday For C programmers, wallclock time can be obtained by using the very portable routine gettimeofday. #include /* definition of NULL */ #include /* definition of timeval struct and protyping of gettimeofday */ double t1,t2,elapsed; struct timeval tp; int rtn;.... rtn=gettimeofday(&tp, NULL); t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;.... /* do some work */.... rtn=gettimeofday(&tp, NULL); t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; elapsed=t2-t1;

Timing an Executable To time an executable (if using a csh or tcsh shell, explicitly call /usr/bin/time) time …options… a.out where options can be ‘ -p ’ for a simple output or ‘ -f format ’ which allows the user to display more than just time related information. Consult the man pages on the time command for format options.

Timing a Batch Job Time of a batch job running or completed. Origin busage jobid Linux clusters qstat jobid # for a running job qhist jobid # for a completed job

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing Timing a Section of Code CPU Time Wall clock Time Timing an Executable Timing a Batch Job 6.2 Profiling Profiling Tools Profile Listings Profiling Analysis 6.3 Further Information

Profiling Profiling determines where a program spends its time. It detects the computationally intensive parts of the code. Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. Most codes follow the Rule. That is, 90% of the computation is done in 10% of the code.

Profiling Tools Profiling Tools on the Origin On the SGI Origin2000 computer there are profiling tools named ssrun and prof. Used together they do profiling, or what is called hot spot analysis. They are useful for generating timing profiles. ssrun The ssrun utility collects performance data for an executable that you specify. The performance data is written to a file named "executablename.exptype.id". prof The prof utility analyzes the data file created by ssrun and produces a report. Example ssrun -fpcsamp a.out prof -h a.out.fpcsamp.m12345 > prof.list

Profiling Tools Profiling Tools on the Linux Clusters On the Linux clusters the profiling tools are still maturing. There are currently several efforts to produce tools comparable to the ssrun and perfex tools. gprof Basic profiling information can be generated using the OS utility gprof. First, compile the code with the compiler flags -p -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler. Second, run the program. Finally analyze the resulting gmon.out file using the gprof utility : gprof executable gmon.out. efc -O -p -g -o foo foo.f./foo gprof foo gmon.out

The Performance API (PAPI) Provides an interface to hardware performance counters integrated in CPU Provides more in-depth details about resource utilization E.g. cache misses, instructions per second Used by perfex, mpitrace, perfsuite, and other profiling tools Requires kernel patch to deploy on Linux

Profiling Tools Profiling Tools on the Linux Clusters vprof On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library. To instrument the whole application requires recompiling and linking to vprof and PAPI libraries. setenv VMON PAPI_TOT_CYC ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmonauto_gcc.o - L/usr/apps/tools/lib -lvmon -lpapi./md /usr/apps/tools/vprof/bin/cprof -e md vmon.out

Cycles % Cum% Secs Proc VSUB PFSOR PBSOR PFSOR VADD ITSRCG ITSRSI ITJSI ITJCG Profile Listings Profile Listings on the Origin Prof Output First Listing The first listing gives the number of cycles executed in each procedure (or subroutine). The procedures are listed in descending order of cycle count.

Cycles % Cum% Line Proc VSUB PFSOR PBSOR VSUB PFSOR VSUB VADD VSUB VADD PFSOR Profile Listings Profile Listings on the Origin Prof Output Second Listing The second listing gives the number of cycles per source code line. The lines are listed in descending order of cycle count.

Flat profile: Each sample counts as seconds. % cumulative self self total time seconds seconds calls us/call us/call name compute_ dist_ SIND_SINCOS sin cos dotr8_ update_ f_fioinit f_intorange mov initialize_ Profile Listings Profile Listings on the Linux Clusters gprof Output First Listing The listing gives a 'flat' profile of functions and routines encountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone.

Call graph: index % time self children called name [1] main [1] /101 compute_ [2] /100 update_ [8] /1 initialize_ [12] /101 main [1] [2] compute_ [2] / dist_ [3] /50500 dotr8_ [7] / compute_ [2] [3] dist_ [3] [4] SIND_SINCOS [4] … Profile Listings Profile Listings on the Linux Clusters gprof Output Second Listing The second listing gives a 'call-graph' profile of functions and routines encountered. The definitions of the columns are specific to the line in question. Detailed information is contained in the full output from gprof.

Columns correspond to the following events: PAPI_TOT_CYC - Total cycles (1956 events) File Summary: 100.0% /u/ncsa/gbauer/temp/md.f Function Summary: 84.4% compute 15.6% dist Line Summary: 67.3% /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f:105 Profile Listings Profile Listings on the Linux Clusters vprof Listing The above listing from (using the -e option to cprof), displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions.

0.7% /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f:100 … % do j=1,np 101 if (i.ne. j) then % call dist(nd,box,pos(1,i),pos(1,j),rij,d) 103 ! attribute half of the potential energy to particle 'j' % pot = pot + 0.5*v(d) % do k=1,nd % f(k,i) = f(k,i) - rij(k)*dv(d)/d % enddo 108 endif % enddo Profile Listings Profile Listings on the Linux Clusters vprof Listing (cont.)

Profiling Analysis The program being analyzed in the previous Origin example has approximately source code lines, and consists of many subroutines. The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. Going back to the source code, line 8106 is a line inside a do loop. Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. To view the rearranged source use the option f90 … -FLIST:=ON cc … -CLIST:=ON For the Intel compilers, the appropriate options are ifort … –E … icc … -E …

MPE and Jumpshot MPE is a tracing library that comes with MPI Jumpshot is a graphical application for analyzing the MPE output MPE requires inserting code at specific locations to be analyzed Display options are specified in the code (e.g. “Show MPI_Broadcast events in dotted blue lines”

Jumpshot

Perfsuite Collection of tools, utilities, and libraries for software performance analysis Intel architectures only Provides many in-depth statistics Operations per cycle, Cache miss/hit data, etc. Not difficult to use (but may be difficult to compile) mpiexec –np $NN psrun wrf.exe psprocess wrf.exe.NN_n.xml Requires PAPI kernel patch for showing most information

Perfsuite + Graphical App

CEPBA Tools Developed at the European Center for Parallelism at Barcelona Currently not free Provide text-based and graphical applications for: Execution analysis and optimization Execution prediction 3 Main tools: Mpitrace, Dimemas, Paraver

CEPBA Tools Powerful, but complex Requires PAPI kernel patch for showing most information May require application to be recompiled Very large trace files for long executions and/or high number of processors (e.g. over 10GB)

CEPBA Tools Source: Barcelona SuperComputing Center –

Visualizing with Paraver Process: 1. (Compile application with mpitrace libraries linked) 2. Execute application (and preload mpitrace libraries if not linked to the application) 3. Convert individual trace files to a Paraver file 4. “Chop” paraver trace file, if it is too big

Paraver Screenshots

Dimemas Estimate impact of code changes without changing the code Estimate execution time on slightly different architectures

Further Information SGI Irix man etime man 3 time man 1 time man busage man timers man ssrun man prof Origin2000 Performance Tuning and Optimization Guide Linux Clusters man 3 clock man 2 gettimeofday man 1 time man 1 gprof man 1B qstat Intel Compilers Vprof on NCSA Linux Cluster