Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.
1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.
Computer Performance CS350 Term Project-Spring 2001 Elizabeth Cramer Bryan Driskell Yassaman Shayesteh.
Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 1 Introduction to C Programming. 1.1 INTRODUCTION This book is about problem solving with the use of computers and the C programming language.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.
Chapter 6: An Introduction to System Software and Virtual Machines
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
PAPI Update Shirley Browne, Cricket Deane, George Ho, Philip Mucci University of Tennessee Computer.
An intro to programming. The purpose of writing a program is to solve a problem or take advantage of an opportunity Consists of multiple steps:  Understanding.
Bottlenecks: Automated Design Configuration Evaluation and Tune.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
DCS Overview MCS/DCS Technical Interchange Meeting August, 2000.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Chapter 6 : Software Metrics
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
INTRODUCTION SOFTWARE HARDWARE DIFFERENCE BETWEEN THE S/W AND H/W.
The Beauty and Joy of Computing Lecture #3 : Creativity & Abstraction UC Berkeley EECS Lecturer Gerald Friedland.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Conrad Benham Java Opcode and Runtime Data Analysis By: Conrad Benham Supervisor: Professor Arthur Sale.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Robert Crawford, MBA West Middle School.  Explain how the binary system is used by computers.  Describe how software is written and translated  Summarize.
Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
Summertime Fun Everyone loves performance Shirley Browne, George Ho, Jeff Horner, Kevin London, Philip Mucci, John Thurman.
Contract Year 1 Review Computational Environment (CE) Shirley Moore University of Tennessee-Knoxville May 16, 2002.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
Representation of Data - Instructions Start of the lesson: Open this PowerPoint from the A451 page – Representation of Data/ Instructions How confident.
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Performance Analysis, Tools and Optimization
Understanding Performance Counter Data - 1
Determining the Accuracy of Event Counts - Methodology
What Are Performance Counters?
Presentation transcript:

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Deployment, Evaluation, and Extensions Shirley Moore, Daniel Terpstra, Kevin London, and Philip Mucci University of Tennessee-Knoxville Patricia J. Teller, Leonardo Salayandia, Alonso Bayona, and Manuel Nieto University of Texas-El Paso.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Main Objectives Provide DoD users with a set of portable tools and accompanying documentation that enables them to easily collect, analyze, and interpret hardware performance data that is highly relevant for analyzing and improving performance of applications on HPC platforms.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI PAPI Performance Application Programmer Interface Performance Monitoring Hardware Vendor Interface PAPI high-level InterfaceLow-level Interface Routines to start, read, and stop counters Specific list of event counts Obtain performance data Thread-safe Fully programmable All native events and counting modes. Callbacks on counter overflow SVR4-compatible profiling A cross-platform interface to hardware performance counters Application Programmer Tool Developer Advanced User

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Standard Event Set A common set of events that are considered most relevant for application performance analysis Included in this set are: Cycle and instruction counts Functional unit status Cache and memory access events SMP cache coherence events Many PAPI events are mapped directly to native platform events. Some are derived from two or more native events. Run avail to find out what standard events are available on a given platform.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Deployment: Objectives Develop a portable and efficient interface to the performance monitoring hardware on DoD platforms. Develop a portable and efficient interface to the performance monitoring hardware on DoD platforms. Install and support PAPI software and related tools on DoD HPC Center machines. Install and support PAPI software and related tools on DoD HPC Center machines. Collaborate with vendors and users to add additional features and extensions. Collaborate with vendors and users to add additional features and extensions.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Deployment: Methodology - 1 Investigation of statistical sampling to reduce instrumentation overhead PAPI substrate for HP AlphaServer based on DADD/DCPI sampling interface PAPI substrate for HP AlphaServer based on DADD/DCPI sampling interface 2-3% overhead vs. 30% using counting 2-3% overhead vs. 30% using counting Investigating similar approach on other platforms Investigating similar approach on other platforms Efficient counter allocation algorithm based on bipartite graph matching – improved allocation on IBM POWER platforms

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Deployment: Methodology-2 Need easy-to-use end-user tools for collecting and analyzing PAPI data TAU (Tuning and Analysis Utilities) from University of Oregon provides automatic instrumentation for profiling and tracing, with profiling based on time and/or PAPI data. PAPI being incorporated into VAMPIR perfometer graphical analysis tool provides real-time display and/or tracefile capture and replay. papirun command-line utility under development (similar to perfex and ssrun on SGI IRIX) dynaprof dynamic instrumentation tool under development

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 TAU Performance System Architecture EPILOG Paraver

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Vampir v3.x: Hardware Counter Data Counter Timeline Display

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Perfometer Parallel Interface

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Deployment: Methodology-3 Memory utilization extensions allow users to obtain static and dynamic memory utilization information. Routines added to low-level API: PAPI_get_memory_info() PAPI_get_memory_info() PAPI_get_dmem_info() PAPI_get_dmem_info()

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Deployment: Results PAPI implementations for the following DoD platforms: Cray X1 implementation underway PAPI installed at all four MSRCs and ARSC and MHPCC, plan to install at additional DCs PAPI widely used by DoD application developers and HPCMO benchmarking team Encouraging vendors to take over responsibility for implementing and supporting PAPI machine-dependent substrate More information: IBM POWER3/4 HP AlphaServer SGI Origin Sun UltraSparc Cray T3E Linux clusters

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Evaluation: Objectives Understand and explain counts obtained for various PAPI metrics Determine reasons why counts may be different from what is expected Calibrate counts, excluding PAPI overhead Work with vendors and/or the PAPI team to fix errors Provide DoD users with information that will allow them to effectively use collected performance data

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Evaluation: Methodology Micro-benchmark: design and implement a micro-benchmark that facilitates event count prediction 2.Prediction: predict event counts using tools and/or mathematical models 3.Data collection-1: collect hardware-reported event counts using PAPI 4.Data collection-2: collect predicted event counts using a simulator (not always necessary or possible)

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Evaluation: Methodology Comparison: compare predicted and hardware-reported event counts 6.Analysis: analyze results to identify and possibly quantify differences 7.Alternate approach: when analysis indicates that prediction is not possible, use an alternate means to either verify reported event count accuracy or demonstrate that the reported event count seems reasonable

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Example Findings - 1 Some hardware-reported event counts mirror expected behavior, e.g., number of floating-point instructions on the MIPS R10K and R12K. Other hardware-reported events can be calibrated, by subtracting that part of the event count associated with the interface (overhead or bias error), to mirror expected behavior, e.g., number of load instructions on the MIPS and POWER processors and instructions completed on the POWER3. In some cases, compiler optimizations effect event counts, e.g., the number of floating-point instructions on the IBM POWER platforms.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Example Findings - 2 Very-long instruction words can affect event counts, e.g., on the Itanium architecture the number of instruction cache misses and instructions retired are dilated by no- ops used to compose very long instruction words. The definition of the event count may be non-standard and, thus, the associated performance data may be misleading, e.g., instruction cache hits on the POWER3. The complexity of hardware features and lack of documentation can make it difficult to understand how to tune performance based on information gleaned from event counts—example: data prefetching, TLB walker.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Example Findings - 3 Although we have not been able to determine the algorithms used for prefetching, the ingenuity of these mechanisms is striking. In some cases, more instructions are completed than issued on the R10K. The DTLB miss count on the POWER3 varies depending upon the method used to allocate memory (i.e., static, calloc or malloc). Hardware SQRT on POWER3 not counted in total floating-point operations unless combined with another floating-point operation.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Calibration Example - 1 Instructions completed PAPI overhead: 139 on POWER3-II PAPI overhead: 139 on POWER3-II Number of C-level instructions0 (base) Predicted Count Mean Reported Count Standard Deviation Reported - Predicted139

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Calibration Example - 2 Instructions completed PAPI overhead: 141 for small micro- benchmarks PAPI overhead: 141 for small micro- benchmarks Number of C-level instructions0 (base) Predicted Count Mean Reported Count Standard Deviation e Reported – Predicted

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 RIB/OKC for Evaluation Resources Object-oriented data model to store benchmarks, results and analyses Information organized for ease of use by colleagues external to PCAT To be web-accessible to members Objects linked between them as appropriate Benchmark General description of a benchmark Machine Description of platform Case Specific implementation and results Organization Contact information

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT RIB/OKC Data Repository Example Benchmark name: DTLB misses Development date: 12/2002 Benchmark type: Array Abstract:Code traverses though an array of integers once at regular strides of PAGESIZE. The intention is to create compulsory misses on each array access. Input parameters are: Page size (bytes) and array size (bytes). The number of misses normally expected should be: Array Size / Page Size. Files included: dtlbmiss.c, dtlbmiss.pl About included files: dtlbmiss.c, benchmark source code in C, requires pagesize and arraysize parameters for input and outputs PAPI event count. dtlbmiss.pl, perl script that executes the benchmark 100 times for increasing arraysize parameters and saves benchmark output to text file. Script should be customized for pagesize parameter and arraysize range. Links to files

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT RIB/OKC Example Case Object Name: DTLB misses on Itanium Date: 12/2002 Compiler and options: gcc ver (Red Hat Linux ) –O0 PAPI Event: PAPI_TLB_DM, Data TLB misses Native Event: DTLB_MISSES Experimental methodology: Ran benchmark 100 times with perl script, averages and standard deviations reported Input parameters used: Page size = 16K, Array size = 16K – 160M (increments by multiples of 10) Platform used: HP01.cs.utk.edu (Itanium) Developed by: PCAT Benchmark used: DTLB misses Links to other objects

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT RIB/OKC Example Case Object Results summary: Reported counts closely match the predicted counts, showing differences close to 0% even in the cases with a small number of data references, which may be more susceptible to external perturbation. The counts indicate that prefetching is not performed at the DTLB level. Included files and description: - dtlbmiss.itanium.c: Source code of benchmark, instrumented with PAPI to count PAPI_TLB_DM - dtlbmiss.itanium.pl: Perl script used to run the benchmark - dtlbmiss.itanium.txt: Raw data obtained, each column contains results for a particular array size, each case is run 100 times (i.e., 100 rows included) - dtlbmiss.itanium.xls: Includes raw data, averages of runs, standard deviations and graph of % difference between reported and predicted counts - dtlbmiss.itanium.pdf: Same as dtlbmiss.itanium.xls

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 Contributions Infrastructure that facilitates user access of hardware performance data that is highly relevant for analyzing and improving the performance of their applications on HPC platforms. Information that allows users to effectively use the data with confidence.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June 9-13, 2003 QUESTIONS?