Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3 Philip Mucci Shirley Moore

Similar presentations


Presentation on theme: "Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3 Philip Mucci Shirley Moore"— Presentation transcript:

1 Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3 Philip Mucci mucci@cs.utk.eduucci@cs.utk.edu Shirley Moore shirley@cs.utk.edu Daniel Terpstra terpstra@cs.utk.edu

2 October 11,2001 ScicomP4 Knoxville, TN 2 Hardware Counters Small set of registers that count events, which are occurrences of specific signals related to the processor’s function Monitoring these events facilitates correlation between the structure of the source/object code and the efficiency of the mapping of that code to the underlying architecture.

3 October 11,2001 ScicomP4 Knoxville, TN 3 Goals of PAPI Solid foundation for cross platform performance analysis tools Free tool developers from re- implementing counter access Standardization between vendors, academics and users Encourage vendors to provide hardware and OS support for counter access Reference implementations for a number of HPC architectures Well documented and easy to use

4 October 11,2001 ScicomP4 Knoxville, TN 4 Overview of PAPI Performance Application Programming Interface The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. Parallel Tools Consortium project http://www.ptools.org/

5 October 11,2001 ScicomP4 Knoxville, TN 5 PAPI Counter Interfaces PAPI provides three interfaces to the underlying counter hardware: 1.The low level interface manages hardware events in user defined groups called EventSets. 2.The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. 3.Graphical tools to visualize information.

6 October 11,2001 ScicomP4 Knoxville, TN 6 PAPI Implementation Tools!!! PAPI Low Level PAPI High Level Hardware Performance Counter Operating System Kernel Extension PAPI Machine Dependent Substrate Machine Specific Layer Portable Layer

7 October 11,2001 ScicomP4 Knoxville, TN 7 PAPI Preset Events Proposed standard set of events deemed most relevant for application performance tuning Defined in papiStdEventDefs.h Mapped to native events on a given platform –Run tests/avail to see list of PAPI preset events available on a platform

8 October 11,2001 ScicomP4 Knoxville, TN 8 High-level Interface Meant for application programmers wanting coarse-grained measurements Not thread safe Calls the lower level API Allows only PAPI preset events Easier to use and less setup (additional code) than low-level

9 October 11,2001 ScicomP4 Knoxville, TN 9 High-level API C interface PAPI_start_counters PAPI_read_counters PAPI_stop_counters PAPI_accum_counters PAPI_num_counters PAPI_flops Fortran interface PAPIF_start_counters PAPIF_read_counters PAPIF_stop_counters PAPIF_accum_counters PAPIF_num_counters PAPIF_flops

10 October 11,2001 ScicomP4 Knoxville, TN 10 Using the High-level API Int PAPI_num_counters(void) –Initializes PAPI (if needed) –Returns number of hardware counters int PAPI_start_counters(int *events, int len) –Initializes PAPI (if needed) –Sets up an event set with the given counters –Starts counting in the event set int PAPI_library_init(int version) – Low-level routine implicitly called by above

11 October 11,2001 ScicomP4 Knoxville, TN 11 Controlling the Counters PAPI _stop_counters(long_long *vals, int alen) –Stop counters and put counter values in array PAPI_accum_counters(long_long *vals, int alen) –Accumulate counters into array and reset PAPI_read_counters(long_long *vals, int alen) –Copy counter values into array and reset counters PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops) –Wallclock time, process time, FP ins since start, –Mflop/s since last call

12 October 11,2001 ScicomP4 Knoxville, TN 12 PAPI_flops int PAPI_flops(float *real_time, float *proc_time, long_long *flpins, float *mflops) –Only two calls needed, PAPI_flops before and after the code you want to monitor –real_time is the wall-clocktime between the two calls –proc_time is the “virtual” time or time the process was actually executing between the two calls (not as fine grained as real_time but better for longer measurements) –flpins is the total floating point instructions executed between the two calls –mflops is the Mflop/s rating between the two calls

13 October 11,2001 ScicomP4 Knoxville, TN 13 PAPI High-level Example long long values[NUM_EVENTS]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);

14 October 11,2001 ScicomP4 Knoxville, TN 14 Return Codes

15 October 11,2001 ScicomP4 Knoxville, TN 15 Low-level Interface Increased efficiency and functionality over the high level PAPI interface About 40 functions Obtain information about the executable and the hardware Thread-safe Fully programmable Callbacks on counter overflow

16 October 11,2001 ScicomP4 Knoxville, TN 16 Low-level Functionality Library initialization PAPI_library_init, PAPI_thread_init, PAPI_shutdown Timing functions PAPI_get_real_usec, PAPI_get_virt_usec PAPI_get_real_cyc, PAPI_get_virt_cyc Inquiry functions Management functions Simple lock PAPI_lock/PAPI_unlock

17 October 11,2001 ScicomP4 Knoxville, TN 17 Event sets The event set contains key information –What low-level hardware counters to use –Most recently read counter values –The state of the event set (running/not running) –Option settings (e.g., domain, granularity, overflow, profiling) Event sets can overlap if they map to the same hardware counter set-up. –Allows inclusive/exclusive measurements

18 October 11,2001 ScicomP4 Knoxville, TN 18 Event set operations Event set management PAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset Event set control PAPI_start, PAPI_stop, PAPI_read, PAPI_accum Event set inquiry PAPI_query_event, PAPI_list_events,...

19 October 11,2001 ScicomP4 Knoxville, TN 19 Simple example #include "papi.h“ #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet; long_long values[NUM_EVENTS]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values);

20 October 11,2001 ScicomP4 Knoxville, TN 20 Overlapping counters retval = PAPI_start(InclEventSet); retval = PAPI_start(OthersEventSet);... retval = PAPI_reset(OthersEventSet); do_flops(NUM_FLOPS); /* Function call */ retval = PAPI_accum(OthersEventSet,Othersvalues);... retval = PAPI_stop(InclEventSet,Inclvalues); printf("Counts: %12lld %12lld\n", Inclvalues[0], Inclvalues[0]-Othersvalues[0]);

21 October 11,2001 ScicomP4 Knoxville, TN 21 Counter Domains int PAPI_set_domain(int domain); –PAPI_DOM_USER User context counted –PAPI_DOM_KERNEL Kernel/OS context counted –PAPI_DOM_OTHER Exception/transient mode –PAPI_DOM_ALL All contexts counted –PAPI_DOM_MIN The smallest available context –PAPI_DOM_MAX The largest available context Requires OS support, all domains not available on all platforms Default is PAPI_DOM_USER

22 October 11,2001 ScicomP4 Knoxville, TN 22 Using PAPI with Threads After PAPI_library_init need to register unique thread identifier function For Pthreads retval=PAPI_thread_init(pthread_self, 0); OpenMP retval=PAPI_thread_init(omp_get_thread_num, 0); Each thread responsible for creation, start, stop and read of its own counters

23 October 11,2001 ScicomP4 Knoxville, TN 23 Using PAPI with Multiplexing Multiplexing allows simultaneous use of more counters than are supported by the hardware. Some platforms support multiplexing in the OS (e.g., SGI IRIX); on those that don’t PAPI implements multiplexing in software. The more events you multiplex and the shorter the amount of time, the more likely the representation is not correct. See John May’s excellent whitepaper

24 October 11,2001 ScicomP4 Knoxville, TN 24 Issues with Multiplexing PAPI_multiplex_init() –should be called after PAPI_library_init() to initialize multiplexing PAPI_set_multiplex( int *EventSet ); –Used after the eventset is created to turn on multiplexing for that eventset Then use PAPI like normal

25 October 11,2001 ScicomP4 Knoxville, TN 25 Native Events An event countable by the CPU can be counted even if there is no matching preset PAPI event Same interface as when setting up a preset event, but a CPU-specific bit pattern is used instead of the PAPI event definition /usr/lpp/pmtoolkit/lib/POWER3-II.evs

26 October 11,2001 ScicomP4 Knoxville, TN 26 Power3 Native Event Example native = (35 << 8) | 1; /* FPU1CMPL */ PAPI_add_event(&EventSet,native) See for list of native events. The encoding for native events is: Lower 8 bits indicate which counter number: 0 – 7 Bits 8-16 indicate which event number: 0-50

27 October 11,2001 ScicomP4 Knoxville, TN 27 Power 3 General Events 1.INST_CMPL (1) 2.FPU1_CMPL (35) 3.LD_MISS_L1 (5) 4.LD_CMPL (5) 5.FPU0_CMPL (5) 6.CYC (12) 7.FMA (9) 8.TLB (0)

28 October 11,2001 ScicomP4 Knoxville, TN 28 Power 3 False Sharing Events 1.L1_M_TO_E_OR_S (25) 2.E_TO_S (23) 3.L2_E_OR_S_TO_I (21) 4.L2_M_TO_E_OR_S (27) 5.L2_M_TO_I (10) 6.PUSH_INT (8) 7.0INST_DISP (6) 8.TLB (0)

29 October 11,2001 ScicomP4 Knoxville, TN 29 Callbacks on Counter Overflow PAPI provides the ability to call user-defined handlers when a specified event exceeds a specified threshold. For systems that do not support counter overflow at the OS level, PAPI sets up a high resolution interval timer and installs a timer interrupt handler.

30 October 11,2001 ScicomP4 Knoxville, TN 30 PAPI_overflow int PAPI_overflow(int EventSet, int EventCode, int threshold, int flags, PAPI_overflow_handler_t handler) Sets up an EventSet such that when it is PAPI_start()’d, it begins to register overflows The EventSet may contain multiple events, but only one may be an overflow trigger.

31 October 11,2001 ScicomP4 Knoxville, TN 31 Statistical Profiling PAPI provides support for execution profiling based on any counter event. PAPI_profil() creates a histogram by text address of overflow counts for a specified region of the application code. Used in vprof tool from Sandia Livermore

32 October 11,2001 ScicomP4 Knoxville, TN 32 PAPI Release Platforms Linux/x86, Windows 2000 –Requires patch to Linux kernel, driver for Windows Linux/IA-64 Sun Solaris 2.8/Ultra I/II IBM AIX 4.3+/Power –Contact IBM for pmtoolkit SGI IRIX/MIPS Compaq Tru64/Alpha Ev6 & Ev67 Requires OS device driver from Compaq Cray T3E/Unicos

33 October 11,2001 ScicomP4 Knoxville, TN 33 For More Information http://icl.cs.utk.edu/projects/papi/ –Software and documentation –Reference materials –Papers and presentations –Third-party tools –Mailing lists

34 October 11,2001 ScicomP4 Knoxville, TN 34 PERC: Performance Evaluation Research Center Developing a science for understanding performance of scientific applications on high-end computer systems.Developing a science for understanding performance of scientific applications on high-end computer systems. Developing engineering strategies for improving performance on these systems.Developing engineering strategies for improving performance on these systems. DOE Labs: ANL, LBNL, LLNL, ORNLDOE Labs: ANL, LBNL, LLNL, ORNL Universities: UCSD, UI-UC, UMD, UTKUniversities: UCSD, UI-UC, UMD, UTK Funded by SciDAC: Scientific Discovery through Advanced ComputingFunded by SciDAC: Scientific Discovery through Advanced Computing

35 October 11,2001 ScicomP4 Knoxville, TN 35 PERC: Real-World Applications High Energy and Nuclear PhysicsHigh Energy and Nuclear Physics –Shedding New Light on Exploding Stars: Terascale Simulations of Neutrino-Driven SuperNovae and Their NucleoSynthesis –Advanced Computing for 21st Century Accelerator Science and Technology Biology and Environmental ResearchBiology and Environmental Research –Collaborative Design and Development of the Community Climate System Model for Terascale Computers Fusion Energy SciencesFusion Energy Sciences –Numerical Computation of Wave-Plasma Interactions in Multi- dimensional Systems Advanced Scientific ComputingAdvanced Scientific Computing –Terascale Optimal PDE Solvers (TOPS) –Applied Partial Differential Equations Center (APDEC) –Scientific Data Management (SDM) Chemical SciencesChemical Sciences –Accurate Properties for Open-Shell States of Large Molecules …and more……and more…

36 October 11,2001 ScicomP4 Knoxville, TN 36 Parallel Climate Transition Model Components for Ocean, Atmosphere, Sea Ice, Land Surface and River Transport Developed by Warren Washington’s group at NCAR POP: Parallel Ocean Program from LANL CCM3: Community Climate Model 3.2 from NCAR including LSM: Land Surface Model ICE: CICE from LANL and CCSM from NCAR RTM: River Transport Module from UT Austin

37 October 11,2001 ScicomP4 Knoxville, TN 37 PCTM: The Code 132K lines of code Mostly Fortran 90 with some Fortran 77 and C Each component runs sequentially Each component is parallel The overall model must run as pure MPI Code is already instrumented with timing library in flux coupler Add additional PAPI code to measure 8 hardware metrics for each timer

38 October 11,2001 ScicomP4 Knoxville, TN 38 PCTM: Parallel Climate Transition Model Flux Coupler Land Surface Model Ocean Model Atmosphere Model Sea Ice Model Sequential Execution of Parallelized Modules River Model

39 October 11,2001 ScicomP4 Knoxville, TN 39 PCTM: Performance Questions Are we cache, functional unit or bound? How does shared memory MPI affect performance? How many tasks on a 4-way node?

40 October 11,2001 ScicomP4 Knoxville, TN 40 PCTM: Answers No answers yet… Wait for SC2001 –Measure coarse regions of code for Instructions per Cycle Stall Cycles 0 Instruction Dispatch Cycles –Measure those regions with multiplexing and inspect code –Statistically profile candidates for improvement, line by line


Download ppt "Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3 Philip Mucci Shirley Moore"

Similar presentations


Ads by Google