Lecture 4: Parallel Tools Landscape – Part 1 Allen D. Malony Department of Computer and Information Science.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
PAPI for Blue Gene/Q: The 5 BGPM Components Heike Jagode and Shirley Moore Innovative Computing Laboratory University of Tennessee-Knoxville
Performance Analysis of Multiprocessor Architectures
Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
Performance Evaluation on SGI Altix 4700 Guangdeng Liu and Danny Guo.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Figure 1.1 Interaction between applications and the operating system.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Lecture 3: Technologies Allen D. Malony Department of Computer and Information Science.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.
Lecture 13 – Parallel Performance Methods Parallel Performance Methods and Technologies Parallel Computing CIS 410/510 Department of Computer and Information.
PAPI Update Shirley Browne, Cricket Deane, George Ho, Philip Mucci University of Tennessee Computer.
Lecture 3: Technologies Allen D. Malony Department of Computer and Information Science.
ICOM Noack Operating Systems - Administrivia Prontuario - Please time-share and ask questions Info is in my homepage amadeus/~noack/ Make bookmark.
Lecture 3: Technologies Allen D. Malony Department of Computer and Information Science.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
BG/Q Performance Tools Scott Parker Mira Community Conference: March 5, 2012 Argonne Leadership Computing Facility.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.
March 12, 2001 Kperfmon-MP Multiprocessor Kernel Performance Profiling Alex Mirgorodskii Computer Sciences Department University of Wisconsin.
GPU Architecture and Programming
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Introduction to Operating Systems Concepts
Computer System Structures
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Virtual Machine Monitors
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
CS 6560: Operating Systems Design
OPERATING SYSTEMS CS3502 Fall 2017
Operating System (013022) Dr. H. Iwidat
Review of computer processing and the basic of Operating system
TAU integration with Score-P
Performance Tuning Team Chia-heng Tu June 30, 2009
Chapter 3: Windows7 Part 1.
What we need to be able to count to tune programs
MASS CUDA Performance Analysis and Improvement
CS170 Computer Organization and Architecture I
A Survey on Virtualization Technologies
Perfctr-Xen: A framework for Performance Counter Virtualization
Allen D. Malony Computer & Information Science Department
Chapter 4: Threads.
What Are Performance Counters?
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Presentation transcript:

Lecture 4: Parallel Tools Landscape – Part 1 Allen D. Malony Department of Computer and Information Science

Tools and Technologies  PAPI  Dyninst ❍ binary instrumenter, StackwalkerAPI, MRNet  Scalasca and Score-P  Periscope  mpiP  Open|SpeedShop, Stat  Prophesy and MuMMI  HPCToolkit  Vtune  Active Harmony  Orio and Pbound  PerfExpert  MAQAO  Vampir  Paraver Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Timer: gettimeofday()  UNIX function  Returns wall-clock time in seconds and microseconds  Actual resolution is hardware-dependent  Base value is 00:00 UTC, January 1, 1970  Some implementations also return the timezone #include struct timeval tv; double walltime; /* seconds */ gettimeofday(&tv, NULL); walltime = tv.tv_sec + tv.tv_usec * 1.0e-6;

Timer: clock_gettime()  POSIX function  For clock_id CLOCK_REALTIME returns wall-clock time in seconds and nanoseconds  More clocks may be implemented but are not standardized  Actual resolution is hardware-dependent #include struct timespec tv; double walltime; /* seconds */ Clock_gettime(CLOCK_REALTIME, &tv); walltime = tv.tv_sec + tv.tv_nsec * 1.0e-9;

Timer: getrusage()  UNIX function  Provides a variety of different information ❍ Including user time, system time, memory usage, page faults, etc. ❍ Information provided system-dependent! #include struct rusage ru; double usrtime; /* seconds */ int memused; getrusage(RUSAGE_SELF, &ru); usrtime = ru.ru_utime.tv_sec + ru.ru_utime.tv_usec * 1.0e-6; memused = ru.ru_maxrss;

Timer: Others  MPI provides portable MPI wall-clock timer ❍ Not required to be consistent/synchronized across ranks!  OpenMP 2.0 also provides a library function  Hybrid MPI/OpenMP programming? ❍ Interactions between both standards (yet) undefined #include double walltime; /* seconds */ walltime = MPI_Wtime(); #include double walltime; /* seconds */ walltime = omp_get_wtime();

Timer: Others  Fortran 90 intrinsic subroutines ❍ cpu_time() ❍ system_clock()  Hardware Counter Libraries ❍ Vendor APIs ◆ PMAPI, HWPC, libhpm, libpfm, libperf, … ❍ PAPI

What Are Performance Counters  Extra processor logic inserted to count specific events  Updated at every cycle  Strengths ❍ Non-intrusive ❍ Very accurate ❍ Low overhead  Weaknesses ❍ Provides only hard counts ❍ Specific for each processor ❍ Access is not appropriate for the end user nor well documented ❍ Lack of standard on what is counted

Hardware Counter Issues  Kernel level ❍ Handling of overflows ❍ Thread accumulation ❍ Thread migration ❍ State inheritance ❍ Multiplexing ❍ Overhead ❍ Atomicity  Multi-platform interfaces ❍ Performance API (PAPI) ◆ University of Tennessee, USA ❍ LIKWID ◆ University of Erlangen, Germany Parallel Performance Tools: A Short Course, Beihang University, December 2-4, Multi platform interface Kernel Hardware counters

Hardware Measurement  Typical measured events account for: ❍ Functional units status ◆ float point operations ◆ fixed point operations ◆ load/stores ❍ Access to memory hierarchy ❍ Cache coherence protocol events ❍ Cycles and instructions counts ❍ Speculative execution information ◆ instructions dispatched ◆ branches mispredicted II-10

Hardware Metrics  Typical hardware counterUseful derived metrics ❍ Cycles / InstructionsIPC ❍ Floating point instructionsFLOPS ❍ Integer instructionscomputation intensity ❍ Load/storesinstructions per load/store ❍ Cache missesload/stores per cache miss ❍ Cache missescache hit rate ❍ Cache missesloads per load miss ❍ TLB missesloads per TLB miss  Derived metrics allow users to correlate the behavior of the application to hardware components  Define threshold values acceptable for metrics and take actions regarding optimization when below/above thresholds II-11

Accuracy Issues  Granularity of the measured code ❍ If not sufficiently large enough, overhead of the counter interfaces may dominate  Pay attention to what is not measured: ❍ Out-of-order processors ❍ Sometimes speculation is included ❍ Lack of standard on what is counted ◆ Microbenchmarks can help determine accuracy of the hardware counters

Hardware Counters Access on Linux  Linux had not defined an out-of-the-box interface to access the hardware counters! ❍ Linux Performance Monitoring Counters Driver (PerfCtr) by Mikael Pettersson from Uppsala X86 + X86-64 ◆ Needs kernel patching! – ❍ Perfmon by Stephane Eranian from HP – IA64 ◆ It was being evaluated to be added to Linux –  Linux ❍ Performance Counter subsystem provides an abstraction of special performance counter hardware registers

Utilities to Count Hardware Events  There are utilities that start a program and at the end of the execution provide overall event counts ❍ hpmcount (IBM) ❍ CrayPat (Cray) ❍ pfmon from HP (part of Perfmon for AI64) ❍ psrun (NCSA) ❍ cputrack, har (Sun) ❍ perfex, ssrun (SGI) ❍ perf (Linux )

PAPI – Performance API  Middleware to provide a consistent programming interface for the performance counter hardware in microprocessors  Countable events are defined in two ways: ❍ Platform-neutral preset events ❍ Platform-dependent native events  Presets can be derived from multiple native events  Two interfaces to the underlying counter hardware: ❍ High-level interface simply provides the ability to start, stop and read the counters for a specified list of events ❍ Low-level interface manages hardware events in user defined groups called EventSets  Events can be multiplexed if counters are limited 

II-16 PAPI Architecture PAPI Machine Dependent Substrate PAPI Low Level Portable Layer Tools Hardware Performance Counters Operating System Kernel Extensions Machine Specific Layer PAPI High Level

PAPI Predefined Events  Common set of events deemed relevant and useful for application performance tuning (wish list) ❍ papiStdEventDefs.h ❍ Accesses to the memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit and pipeline status ❍ Run PAPI papi_avail utility to determine which predefined events are available on a given platform ❍ Semantics may differ on different platforms!  PAPI also provides access to native events on all supported platforms through the low-level interface ❍ Run PAPI papi_native_avail utility to determine which predefined events are available on a given platform

II-18 papi_avail Utility % papi_avail -h This is the PAPI avail program. It provides availability and detail information for PAPI preset and native events. Usage: papi_avail [options] [event name] papi_avail TESTS_QUIET Options: -a display only available PAPI preset events -d display PAPI preset event info in detailed format -e EVENTNAME display full detail for named preset or native event -h print this help message -t display PAPI preset event info in tabular format (default)

High Level API  Meant for application programmers wanting simple but accurate measurements  Calls the lower level API  Allows only PAPI preset events  Eight functions: ❍ PAPI_num_counters ❍ PAPI_start_counters, PAPI_stop_counters ❍ PAPI_read_counters ❍ PAPI_accum_counters ❍ PAPI_flops ❍ PAPI_flips, PAPI_ipc (New in Version 3.x)  Not thread-safe (Version 2.x) II-19

Low Level API  Increased efficiency and functionality over the high level PAPI interface  54 functions  Access to native events  Obtain information about the executable, the hardware, and memory  Set options for multiplexing and overflow handling  System V style sampling (profil())  Thread safe II-20

PAPI FRAMEWORK Low Level User API High Level User API PAPI COMPONENT (CPU) Operating System Counter Hardware Developer API PAPI COMPONENT (Network) Operating System Counter Hardware PAPI COMPONENT (System Health) Operating System Counter Hardware Developer API 21 Component PAPI

PAPI CUDA Component  HW performance counter measurement technology for NVIDIA CUDA platform  Access to HW counters inside the GPUs  Based on CUPTI (CUDA Performance Tool Interface)  PAPI CUDA component can provide detailed performance counter info regarding execution of GPU kernel ❍ Initialization, device management and context management are enabled by CUDA driver API ❍ Domain and event management are enabled by CUPTI  Names of events are established by the following hierarchy: Component.Device.Domain.Event Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Parallel Performance Tools: A Short Course, Beihang University, December 2-4, Portion of CUDA Events on Tesla C870