Download presentation
Presentation is loading. Please wait.
Published byBritton Mitchell Modified over 9 years ago
1
11/17/02 1 PAPI and Dynaprof Application Signatures and Performance Analysis of Scientific Applications Philip J. Mucci Innovative Computing Laboratory, UTK Performance Evaluation Research Center, LBL mucci@cs.utk.edu http://icl.cs.utk.edu/~mucci/dynaprof/snapshots/sc2002.ppt
2
11/17/02 1 Goals ● Understanding the behavior of the application – Identification of bottlenecks. – Usage of the hardware resources. – Effects of that usage on performance. ● Using Dynaprof to achieve that goal – Command line usage – 3 Dynaprof probes ● Wallclock Time ● Hardware performance counters ● Resource usage traces
3
11/17/02 1 Motivation ● Optimize the application's performance. ● Evaluate the algorithms efficiency. ● Generate an application signature. – A collection of data that represent the major terms in the performance model. ● Develop a performance model.
4
11/17/02 1 Overview of Hardware Counters ● Data is NOT PORTABLE, but PAPI is... ● Small number of registers dedicated for performance monitoring functions. – AMD Athlon, 4 counters – Pentium <= III, 2 counters – Pentium IV, 18 counters – IA64, 4 counters – Alpha 21x64, 2 counters – Power 3, 8 counters – Power 4, 8 counters to a group – UltraSparc II, 2 counters – MIPS R14K, 2 counters
5
11/17/02 1 Applications used in this Tutorial ● Serial: – FSPX: A binary alloy solidification benchmark. – SWIM: The SPEC shallow water benchmark. ● Parallel (MPI): – Ex19 from PetSC distribution. – Solves nonlinear driven cavity with multigrid. A 2D driven cavity problem solved in a velocity-vorticity formulation.
6
11/17/02 1 FPSX Execution Environment ● Intel PIII, 1.2 Ghz – FP Results/Clock: 1 1.2 Gflips ● 4 SP/clk with SSE, 2DP/clk with SSE2 – Caches: 16K/16K, 256K ● G77 version 2.96 -g -O -malign-double -mpentiumpro -funroll- loops -fexpensive-optimizations ● Execution time: > /bin/time fspx 115.370u 0.030s 1:58.17 97.6%0+0k 0+0io 162pf+0w
7
11/17/02 1 swim Execution Environment ● IBM Nighthawk, 16-way Power 3, 375MHz – FP Results/Clock: 4 (1.5 Gflips) – Caches: 32K/64K, 8MB – MPI over TCP/IP via switch ● Xlc 5.0.2.1 built with -g -O3 -qstrict - qarch=pwr3 -qtune=pwr3 ● Execution time: > /bin/time poe swim -procs 2 0.4u 0.0s 0:15 3% 217+3933k 0+0io 1pf+0w
8
11/17/02 1 ex19 Execution Environment ● IBM Nighthawk, 16-way Power 3, 375MHz – FP Results/Clock: 4 (1.5 Gflips) – Caches: 32K/64K, 8MB ● Xlc 5.0.2.1 built with -g ● Execution time: > /bin/time poe ex19 -procs 2 -da_grid_x 56 - da_grid_y 56 0.520u 0.200s 0:44.18 1.6% 297+3580k 0+0io 0pf+0w
9
11/17/02 1 Gprof ● Gathers timer interrupts vs. text address. ● Recompile with -p option. ● Gprof profile is useful for a high level overview ● Does it tell us why?
10
11/17/02 1 Gprof Profile of FSPX
11
11/17/02 1 FPSX: Top 4 functions ● Top 4 functions make up 50% of execution time ● In module update.F – flux – proflux – pde ● In module phase.F – phase ● Use the list command to explore modules and functions
12
11/17/02 1 Gprof Profile of SWIM
13
11/17/02 1 Gprof Profile of ex19
14
11/17/02 1 Dynaprof Environment Variables ● LD_LIBRARY_PATH: Colon seperated list where to look for shared libraries. We need to find: – DynInst library – PAPI library – Any dependancies on the above. (libperfctr.so, libcpc.so) ● DYNINSTAPI_RT_LIB: Full pathname of DynInst runtime library. ● No settings necessary for AIX/DPCL port
15
11/17/02 1 Running Dynaprof ● Usage: dynaprof [-d] [serial_application] ● -d enables debugging output ● Specifying an application automatically loads it into the tool immediately after initialization.
16
11/17/02 1 Command Line Interface ● Uses GNU Readline library for input ● Full featured Command Line Editing – File and command completion: – History: / ● Settings, macros and aliases in ~/.inputrc ● Allows Emacs or VI style bindings – set editing-mode emacs – set editing-mode vi ● See man page, TexInfo file or home page.
17
11/17/02 1 Load command ● Starts the application and stops it at the first instruction. ● Usage: load [args] > dynaprof (dynaprof) load tests/fpsx
18
11/17/02 1 Poeload command ● For use with MPI applications on AIX and DPCL. – DPCL < 3.2.5 requires full path ● Usage: poeload [args] (dynaprof) poeload tests/swim - procs 2
19
11/17/02 1 Mpiload command ● For use with MPI applications. ● Stops the application after it calls PMPI_Init(). ● Mostly useful for script driven execution of MPI jobs ● Usage: mpiload [args] (dynaprof) mpiload tests/mpicount
20
11/17/02 1 Attach command ● Attaches to a running application (or poe process) and stops it. ● Usage: attach (dynaprof) ^Z > tests/fspx & [2] 17500 > fg (dynaprof) attach tests/fspx 17500
21
11/17/02 1 Poeattach Command ● For use with MPI applications on AIX and DPCL. – DPCL < 3.2.5 requires full path ● Usage: poeattach (dynaprof) ^Z poe ex19 -da_grid_x 56 -da_grid_y 56 -procs 2 & [2] 17500 > fg (dynaprof) poeattach ex19 17500
22
11/17/02 1 List command ● list – List all modules in process ● list – List all matching modules ● list – List all functions in module ● list – List all matching functions in module ● list – List instrumentable points in function
23
11/17/02 1 Exploring FSPX ● G77's Fortran Runtime support Code compiled with g77 without -g ends up in the DEFAULT_MODULE ● Application Code ● Shared libraries
24
11/17/02 1 Exploring FSPX 2 ● G77's Fortran Runtime support Code compiled with g77 without -g ends up in the DEFAULT_MODULE
25
11/17/02 1 Exploring FSPX 3 Function Calls
26
11/17/02 1 Use command ● Loads a probe shared library into address space (dynaprof) use [probe [args]] ● Use by itself displays current probe. ● To change options, respecify probe. ● 4 probes in this release – Wallclock: Real time clock – PAPI: Hardware metrics – Perfometer: RT Visi of streaming hardware metrics
27
11/17/02 1 Instr command ● instr – list all instrumented functions ● instr module [arg] – Instrument all functions in modules matching pattern ● instr function [arg] – Instrument all functions matching pattern in module
28
11/17/02 1 Threads and Dynaprof Probes ● For threaded code, use the same probe! ● Dynaprof detects threads and loads a special version of the probe library. ● Each probe specifies what to do when a new thread is discovered. ● Each thread gets the same instrumentation.
29
11/17/02 1 Probe Warning ● Instrumentation is not free. ● Consider granularity of region being measured. ● Overhead for PAPI 2.3 is O(100) cycles. – Between 500 and 2000 cycles for a 2 counter read. ● Overhead for Wallclock is O(100) cycles.
30
11/17/02 1 Wallclock Probe ● High resolution, low latency timer ● Usage: use wallclockprobe ● Reports time in microseconds, 1.0x10 -6 s.
31
11/17/02 1 PAPI Probe ● Count PAPI Presets or Native Events ● Usage: use papiprobe [event,event,...] ● Default argument is either PAPI_FP_INS or PAPI_TOT_INS if the architecture doesn't support it. ● Available events a can be obtained by using: papi_avail -a
32
11/17/02 1 PAPI Probe and Multiplexing ● More than physical number of metrics automatically enables multiplexing. ● Minimum runtime of instrumented regions must be observed, such that all virtual counters get a chance to run at least once. run-time min = num_events *.01s ● Automatic warning functionality is being rolled into PAPI.
33
11/17/02 1 PAPI Native Events ● Look in the PAPI distribution ● See the README file for your architecture in the src directory ● See the example program tests/native.c in the src/tests directory
34
11/17/02 1 Power 3 Events
35
11/17/02 1 Power 3 Events 2
36
11/17/02 1 Power 4 Events
37
11/17/02 1 Pentium III Events
38
11/17/02 1 Intel Pentium IV Events (Arguments to perfex -e from PerfCtr distribution)
39
11/17/02 1 Sun UltraSparc II Events
40
11/17/02 1 Sun UltraSparc III Events
41
11/17/02 1 MIPS R12K Events
42
11/17/02 1 Alpha/DADD 21264 Events
43
11/17/02 1 Perfometer Probe ● Sends a stream of performance data every N seconds to the Perfometer GUI. ● Functions can be colored at instrumentation time. – Default color is white, 0xFFFFFF ● Usage: use perfometerprobe [0xRRGGBB] instr
44
11/17/02 1 Perfometer Probe 2 ● Perfometer GUI is NOT launched automatically. ● showrgb in X11 lists colors and names. ● Run the Java GUI – Java -jar Perfometer.jar ● Connect up to the specified hostname and port.
45
11/17/02 1 Instrumenting SWIM with perfometerprobe
46
11/17/02 1 Instrumenting FSPX for Instructions Per Cycle
47
11/17/02 1 Instrumenting SWIM for Instructions Per Cycle
48
11/17/02 1 Reporting Probe Data ● The wallclock and PAPI probes produce very similar data. ● Both use a parsing script written in Perl. – wallclockrpt – papiproberpt ● Produce 3 profiles – Inclusive: T function = T self + T children – Exclusive: T function = T self – 1-Level Call Tree: T child = Inclusive T function
49
11/17/02 1 Fspx Cycles & Instrs.
50
11/17/02 1 fspx IPC proflux0.61 phase0.63 flux 0.49 pde0.46
51
11/17/02 1 Swim Cycles & Instrs.
52
11/17/02 1 Swim IPC calc2 0.59 calc1 0.53 calc3 0.46
53
11/17/02 1 Perfometer Screenshot
54
11/17/02 1 Dynaprof 0.8 SC Release ● Binary distribution for 4 Platforms on the website – AIX 3.x / DPCL 3.2.5 on Power 3 – Linux / DynInst 3.0 on Pentium <= III – Solaris 2.8 / DynInst 3.0 on UltraSparc II/III – IRIX / DynInst 3.0 on MIPS R10/12/14k – Power 4 and Pentium 4 are coming... ● Xdynaprof Java/Swing GUI included ● perfometerprobe and GUI included ● Updated documentation
55
11/17/02 1 References ● The Dynaprof Homepage http://www.cs.utk.edu/~mucci/dynaprof ● The PAPI Homepage http://icl.cs.utk.edu/projects/papi ● The DynInst Homepage http://www.dyninst.org ● The DPCL Homepage http://oss.software.ibm.com/developerworks/opensource/dpcl ● The Vprof Homepage http://aros.ca.sandia.gov/~cljanss/perf/vprof ● The GNU Readline Homepage http://cnswww.cns.cwru.edu/~chet/readline/rltop.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.