Performance Analysis Tools

Slides:



Advertisements
Similar presentations
Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak Boston University Scientific Computing and.
Advertisements

GNU gprof Profiler Yu Kai Hong Department of Mathematics National Taiwan University July 19, 2008 GNU gprof 1/22.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
1 Lecture 6 Performance Measurement and Improvement.
Performance Analysis and Debugging Tools Performance analysis and debugging intimately connected since they both involve monitoring of the software execution.
Exec function Exec function: - replaces the current process (its code, data, stack & heap segments) with a new program - the new program starts executing.
1 CS 668: Lecture 2 An Introduction to MPI Fred Annexstein University of Cincinnati CS668: Parallel Computing Fall 2007 CC Some.
Parallel Programming in C with MPI and OpenMP
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
Performance Improvement
Performance Measuring on Blue Horizon and Sun HPC Systems: Timing, Profiling, and Reading Assembly Language NPACI Parallel Computing Institute 2000 Sean.
Week 8 - Friday.  What did we talk about last time?  String to int conversions  Users and groups  Password files.
1 1 Profiling & Optimization David Geldreich (DREAM)
Hjemmeeksamen 1 INF3190. Oppgave Develop a monitoring/administration tool which allows an administrator to use a client to monitor all processes running.
DIMES Measurements. 2 Topics Description of the measurement module’s structure –How it’s activated –What the output looks like Measurement algorithm –Hop.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
4061 Session 17 (3/19). Today Time in UNIX Today’s Objectives Define what is meant when a system is called “interrupt-driven” Describe some trade-offs.
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 14, 2013.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.
Profiling Tools In Ranger Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma
MPI and High Performance Computing: Systems and Programming Barry Britt, Systems Administrator Department of Computer Science Iowa State University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
1 Timing MPI Programs The elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime : double t1, t2; t1 = MPI_Wtime();...
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Application Profiling Using gprof. What is profiling? Allows you to learn:  where your program is spending its time  what functions called what other.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.
® IBM Software Group © 2006 IBM Corporation PurifyPlus on Linux / Unix Vinay Kumar H S.
Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.
Adv. UNIX: Profile/151 Advanced UNIX v Objectives –introduce profiling based on execution times and line counts Special Topics in Comp.
Copyright ©: Nahrstedt, Angrave, Abdelzaher, Caccamo1 Timers and Clocks II.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Message-passing Model.
So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First.
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
1 Performance Issues CIS*2450 Advanced Programming Concepts.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
MPI Chapter 3 More Beginning MPI. MPI Philosopy One program for all processes – Starts with init – Get my process number Process 0 is usually the “Master”
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Performance* Objective: To learn when and how to optimize the performance of a program. “ The first principle of optimization is don ’ t. ” –Knowing how.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
1 Programming distributed memory systems Clusters Distributed computers ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 6, 2015.
Measuring Time Alan L. Cox Some slides adapted from CMU slides.
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
Two notions of performance
Profiling with GNU GProf
Copyright ©: Nahrstedt, Angrave, Abdelzaher, Caccamo
Message-Passing Computing
Performance Analysis and optimization of parallel applications
Week 9 - Wednesday CS222.
MPI Message Passing Interface
CS 668: Lecture 3 An Introduction to MPI
Paraguin Compiler Examples.
Profiling for Performance in C++
Advanced TAU Commander
Introduction to Message Passing Interface (MPI)
Paraguin Compiler Examples.
Measuring Program Performance Matrix Multiply
Paraguin Compiler Examples.
CSCE569 Parallel Computing
Pattern Programming Tools
Parallel Computing Explained Timing and Profiling
Min Heap Update E.g. remove smallest item 1. Pop off top (smallest) 3
Measuring Program Performance Matrix Multiply
Presentation transcript:

Performance Analysis Tools

Performance Analysis Goal Once we have a working parallel program, we want to tune it to run faster Hot spot – An area of code that uses a significant amount of CPU time Bottleneck – An area of code that uses resources inefficiently and slows the program down (e.g. communication)

Timers One way to identify hot spots and bottlenecks is to use timers. We’ve used it to measure the elapsed time of the entire algorithm, but this can be used to measure time spent on different parts of the algorithm

Timers Timer Usage Wallclock / CPU Time Resolution Languages time Shell script Both 1/100th second Any gettimeofday Subroutine Wallclock Microseconds C/C++ read_real_time Nanoseconds C/C++ on IBM AIX Systems MPI_Wtime Subrouting C/C++,Fortran

Time command Usage: time mpirun –np # command Result real 0m1.071s user 0m0.177s sys 0m0.045s

Time command Meaning Real time: the total wall clock (start to finish) time your program took to load, execute, and exit. User time: the total amount of CPU time your program took to execute. System time: the amount of CPU time spent on operating system calls in executing your program.

gettimeofday gettimeofday is a system call that returns a structure that gives the time since Epoch (January 1 1970) int gettimeofday(struct timeval *tv, struct timezone *tz); The timeval structure has seconds and microseconds: struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ };

gettimeofday Usage: #include <sys/time.h> struct timeval tv1, tv2; ... gettimeofday(&tv1, NULL); ... // Work to be timed gettimeofday(&tv2, NULL); // Convert time to seconds elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0);

MPI_Wtime Returns a single double-precision value that is the number of seconds since some time in the past (most likely Epoch) MPI also provides a MPI_Wtick() routine that provides the resolution (most likely microseconds)

MPI_Wtime Usage: #include "mpi.h" ... double start,end,resolution; MPI_Init(&argc, &argv); start = MPI_Wtime(); /* start time */ ... // Work to be timed end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("elapsed= %e resolution= %e\n", end-start, resolution);

MPI_Wtime Sample output: Wallclock times(secs): start= 1384864000.067529 end= 1384864000.074005 elapsed= 6.475925e-03 resolution= 1.000000e-06 Elapsed time (seconds) Accurate to microseconds

read_real_time read_real_time is a system call that returns a structure that gives the time since Epoch (January 1 1970) int read_real_time(timebasestruct_t *t, size_t size_of_timebasestruct_t); Designed to measure time accurate to nanoseconds Guarantee correct time units across different IBM RS/6000 architectures.

read_real_time #include <sys/time.h> ... timebasestruct_t start, finish; int secs, n_secs; read_real_time(&start, TIMEBASE_SZ); /* do some work */ read_real_time(&finish, TIMEBASE_SZ); /* Make sure both values are in seconds and nanoseconds */ time_base_to_time(&start, TIMEBASE_SZ); time_base_to_time(&finish, TIMEBASE_SZ);

read_real_time Usage continued: ... /* Subtract the starting time from the ending time */ secs = finish.tb_high - start.tb_high; n_secs = finish.tb_low - start.tb_low; /* Fix carry from low-order to high-order during the measurement */ if (n_secs < 0) { secs--; n_secs += 1000000000; } printf("Sample time was %d seconds %d nanoseconds\n", secs, n_secs);

Profilers Profiler prof gprof Xprofiler mpiP

prof Compile your program with the –p option: Run the program gcc –p <program>.c –o <program> Run the program Profile file created called mon.out Run: prof –m mon.out

Sample Output from prof Name %Time Seconds Cumsecs #Calls msec/call .fft 51.8 0.59 1024 0.576 .main 40.4 0.46 1.05 1 460. .bit_reverse 7.9 0.09 1.14 0.088 .cos 0.0 0.00 256 .sin .catopen 0. .setlocale ._doprnt 7 ._flsbuf 11 ._xflsbuf ._wrtchk ._findbuf ._xwrite .free 2 .free_y .write .exit .memchr 19 .atoi .__nl_langinfo_std 4 .gettimeofday 8 .printf

gprof Compile your program with the –p option: Run the program gcc –gp <program>.c –o <program> Run the program Profile file created called gmon.out Run: gprof <program> gmon.out

Sample Output from gprof ngranularity: Each sample hit covers 4 bytes. Time: 1.17 seconds called/total parents index %time self descendents called+self name index called/total children 0.44 0.72 1/1 .__start [2] [1] 99.1 0.44 0.72 1 .main [1] 0.59 0.13 1024/1024 .fft [3] 0.00 0.00 256/256 .cos [6] 0.00 0.00 256/256 .sin [7] 0.00 0.00 8/8 .gettimeofday [11] 0.00 0.00 7/7 .printf [16] 0.00 0.00 1/1 .atoi [31] 0.00 0.00 1/1 .exit [33]

xprofiler X Windows profiler based on gprof Compile and run the program as you would with gprof Run: xprofiler <program> gmon.out Provides a graphical representation of the program execution

Library View

Function View

mpiP Compile an MPI program with –g: Run the MPI program as usual mpcc -g <program>.c –o <program> -L/usr/local/tools/mpiP/lib -lmpiP -lbfd Run the MPI program as usual A file is created called <program>.N.XXXXX.mpiP Where N is the number of processors and XXXXX is the collector task processor id

Sample output from mpiP @ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0

Sample output from mpiP ---------------------------------------------------------------- @--- MPI Time (seconds) ---------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06

Sample output from mpiP ---------------------------------------------------------- @--- Callsites: 38 --------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 …

Sample output from mpiP ----------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ------ Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 …

Sample output from mpiP ------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 ...

Questions