1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
2 Four Components for the University of Tennessee’s Performance Capturing Tools PAPI Self adapting numerical software Automatic performance enhancement SANS/AEOS/ATLAS Performance repository for apps, kernels, machines, etc NETLIB, Repository in a Box (RIB) Modeling, predictability
3 Tools for Performance Evaluation Timing and performance evaluation has been an art Resolution of the clock Issues about cache effects Different systems Can be cumbersome and inefficient with traditional tools Situation about to change Today’s processors have internal counters
4 Performance Counters Almost all high performance processors include hardware performance counters. Some are easy to access, others not available to users. On most platforms the APIs, if they exist, are not appropriate for the end user or well documented. Existing performance counter APIs Compaq Alpha EV 6 & 6/7 SGI MIPS R10000 IBM Power Series CRAY T3E Sun Solaris Pentium Linux and Windows IA-64 HP-PA RISC Hitachi Fujitsu NEC
5 Overview of PAPI Performance Application Programming Interface The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors
6 Performance Data from PAPI Execution Rate (MIPS, Flop/s) Bandwidth Utilization Main Memory L2 cache L1 cache Cache Miss Statistics: Icache, Dcache, and L2 cache TLB misses Mispredicted Branches Instruction Mix (FP, branch, LD/ST, other) Load/store instruction issue rate
7 Implementation Counters exist as a small set of registers that count events. PAPI provides three interfaces to the underlying counter hardware: 1.The low level interface manages hardware events in user defined groups called EventSet. 2.The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. 3.Graphical tools to visualize information.
8 PAPI - Supported Processors Intel Pentium,Pro,II,III,4 Linux 2.4, 2.2, 2.0 and perf kernel patch IBM Power 3,604,604e For AIX 4.3 and pmtoolkit (in available) Sun UltraSparc I, II, & III Solaris 2.8 MIPS R10K, R12K AMD Athlon Linux 2.4 and perf kernel patch Cray T3E, SV1, SV2 Soon: Windows 2K, Compaq Alpha EV6 & 67 and Intel IA-64
9 Go To Demo
10 PAPI’s Parallel Interface
11 PAPI Development Extensions to PAPI to support collection and analysis of hardware performance counter data in the context of shared and distributed memory parallel programs Allowing for straightforward instrumentation of multithreaded and multiprocessor applications. Tools will include graphical tools extended with dynamic instrumentation capabilities. Framework for using Dyninst with parallel programs, the Free Probe Class Server (FPCS) and IBM’s Dynamic Probe Class Library (DPCL) Port PAPI to Compaq Alpha and HP machines Summary information on problem spots within applications Integration with other tools, SvPablo, Dyninst, etc Help with setting up PAPI at various sites.
12 Repository Development Repository of Tools and Data on Performance Evaluation A network-based catalog that will serve as a “road map” to important Performance Evaluation enabling technologies A methodology for evaluation and measurement of the success of the tools. SciDAC outreach: Start a community effort for the collection and dissemination of performance data
13 Self-Adapting Numerical Software (SANS) Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning. Simple operations like Matrix-Vector ops require many man-hours / platform Software lags far behind hardware introduction Only done if financial incentive is there Compilers not up to optimization challenge Hardware, compilers, and software have a large design space w/many parameters Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules. Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors. Need for quick/dynamic deployment of optimized routines. ATLAS - Automatic Tuned Linear Algebra Software
14 SANS Extensions BLAS Sparse matrix operations Message passing Algorithm selection at a higher level
15 Repository In a Box (RIB) Metadata objects are stored in repositories. A repository automatically generates a web site for displaying customizable views of its metadata - search, browse, join, etc. Metadata objects are also made available to network applications via the RIB API.
16 Repository Interoperation My Repository Our Virtual Repository Metadata objects Your Repository Metadata objects HTML Catalog
17 Tools Integration PAPI, Dyninst, SVPablo Intelligent Adaptation Rose and SANS (ATLAS) Repository-in-a-Box effort provides a toolkit for building and maintaining meta-data repositories
18 Interaction with Other Efforts SciDAC - TOPS David Keyes, ICASE/ODU/LLNL SciDAC - Astrophysics Tony Mezzacappa, ORNL DOE - Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis Bart Miller, U Wisc Jeff H., U of Maryland
19 High-End Computer System Performance: Science and Engineering Activities for UTennessee Performance Capturing Tools PAPI Automatic performance enhancement SANS/AEOS/ATLAS Performance repository for apps, kernels, machines, etc NETLIB, RIB Modeling, predictability