NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820.

Slides:



Advertisements
Similar presentations
Memory Management Unit
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 12 Reduce Miss Penalty and Hit Time
1 VR BIT MICROPROCESSOR โดย นางสาว พิลาวัณย์ พลับรู้การ นางสาว เพ็ญพรรณ อัศวนพเกียรติ
4/14/2017 Discussed Earlier segmentation - the process address space is divided into logical pieces called segments. The following are the example of types.
Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011.
1 Lecture 6 Performance Measurement and Improvement.
Memory Management.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Virtual Memory I Chapter 8.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Silberschatz, Galvin and Gagne  Operating System Concepts Segmentation Memory-management scheme that supports user view of memory. A program.
Main Memory. Background Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only.
Computer Systems Computer Performance.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Topics Introduction Hardware and Software How Computers Store Data
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 346, Royden, Operating System Concepts Operating Systems Lecture 24 Paging.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Higher Computing Computer structure. What we need to know! Detailed description of the purpose of the ALU and control unitDetailed description of the.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services
Deep Computing © 2008 IBM Corporation The IBM High Performance Computing Toolkit Advanced Computing Technology Center
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 34 Paging Implementation.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
6 Memory Management and Processor Management Management of Resources Measure of Effectiveness – On most modern computers, the operating system serves.
Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.
MPI Performance in a Production Environment David E. Skinner, NERSC User Services ScicomP 10 Aug 12, 2004.
Lab 2 Parallel processing using NIOS II processors
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Page Table Implementation. Readings r Silbershatz et al:
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
COMP 3500 Introduction to Operating Systems Paging: Basic Method Dr. Xiao Qin Auburn University Slides.
Measuring Performance Based on slides by Henri Casanova.
CMSC 611: Advanced Computer Architecture
MODERN OPERATING SYSTEMS Third Edition ANDREW S
Memory COMPUTER ARCHITECTURE
Lecture 12 Virtual Memory.
Chapter 8: Main Memory.
Pipelining: Advanced ILP
Operating System Concepts
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Module IV Memory Organization.
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
Topics Introduction Hardware and Software How Computers Store Data
Virtual Memory Overcoming main memory size limitation
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
Memory System Performance Chapter 3
CS703 - Advanced Operating Systems
Main Memory Background
ADSP 21065L.
Computer Organization and Design Chapter 4
Virtual Memory 1 1.
Presentation transcript:

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

NUG Meeting Introduction How to obtain performance numbers Tools based on IBM’s PMAPI Relevant for FY2003 ERCAP

NUG Meeting Agenda Low Level PAPI Interface HPM Toolkit – hpmcount – poe+ libhpm : hardware performance library

NUG Meeting Overview These tools are used for performance measurement All can be used to tune applications and measure performance Needed for FY 2003 ERCAP applications

NUG Meeting Vocabulary PMAPI – IBM’s low-level interface PAPI – Performance API (portable) hpmcount, poe+ report overall code performance libhpm can be used to instrument portions of code

NUG Meeting PAPI Standard application programming interface Portable, don’t confuse with IBM low-level PMAPI interface Can access hardware counter info V2.1 at NERSC See – –

NUG Meeting Using PAPI PAPI is available through a module – module load papi You place calls in source code – xlf –O3 source.F $PAPI #include "fpapi.h“ … integer*8 values(2) integer counters(2), ncounters, irc … irc = PAPI_VER_CURRENT CALL papif_library_init(irc) counters(1)=PAPI_FMA_INS counters(2)=PAPI_FP_INS ncounters=2 CALL papif_start_counters(counters,ncounters,irc) … call papif_stop_counters(values,ncounters,irc) write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2) …

NUG Meeting hpmcount Easy to use Does not affect code performance Profiles entire code Uses hardware counters Reports flip (floating point instruction) rate and many other quantities

NUG Meeting hpmcount usage Serial –%hpmcount executable Parallel –% poe hpmcount executable –nodes n -procs np Gives performance numbers for each task Prints output to STDOUT ( or use – o filename ) Beware! These profile the poe command – hpmcount poe executable – hpmcount executable (if compiled with mp* compilers)

NUG Meeting hpmcount example ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 hpmcount (V 2.3.1) summary Total execution time (wall clock time): seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : seconds Total amount of time in system mode : seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727 ####### End of Resource Statistics ########

NUG Meeting hpmcount output ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 PM_CYC (Cycles) : PM_INST_CMPL (Instructions completed) : PM_TLB_MISS (TLB misses) : PM_ST_CMPL (Stores completed) : PM_LD_CMPL (Loads completed) : PM_FPU0_CMPL (FPU 0 instructions) : PM_FPU1_CMPL (FPU 1 instructions) : PM_EXEC_FMA (FMAs executed) : Utilization rate : % Avg number of loads per TLB miss : Load and store operations : M Instructions per load/store : MIPS : Instructions per cycle : HW Float points instructions per Cycle : Floating point instructions + FMAs : M Float point instructions + FMA rate : Mflip/s FMA percentage : % Computation intensity : 1.008

NUG Meeting Floating point measures PM_FPU0_CMPL (FPU 0 instructions) PM_FPU1_CMPL (FPU 1 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. –Each FPU can start a new instruction at every cycle. –This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU. PM_EXEC_FMA (FMAs executed) –The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).

NUG Meeting Total flop rate Float point instructions + FMA rate –Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations. –The rate gives the code’s Mflops. –The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction) –Our example: 22 Mflops.

NUG Meeting Memory access Average number of loads per TLB miss –Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly. –Each time a TLB miss occurs, a new page (4KB, byte elements) is brought into the buffer. –A value of ~500 means each element is accessed ~1 time while the page is in the buffer. –A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly. –Our example: 2.0

NUG Meeting Cache hits The –sN option to hpmcount specifies a different statistics set -s2 will include L1 data cache hit rate –33.4% for our example –See for more options and descriptions.

NUG Meeting Optimizing the code Original code fragment DO I=1,N DO K=1,N DO J=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO

NUG Meeting Optimizing the code “Optimized” code: move I to inner loop DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO

NUG Meeting Optimized results Float point instructions + FMA rate –461 vs. 22 Mflips (ESSL 933) Avg number of loads per TLB miss –20,877 vs. 2.0 (ESSL: 162) L1 cache hit rate –98.9% vs. 33.4%

NUG Meeting Using libhpm libhpm can instrument code sections Embed calls into source code –Fortran, C, C++ Contained in hpmtoolkit module – module load hpmtoolkit compile with $HPMTOOLKIT – xlf –O3 source.F $HPMTOOLKIT Execute program normally

NUG Meeting hpmlib example … #include f_hpm.h … CALL f_hpminit(0,”someid") CALL f_hpmstart(1,"matrix-matrix multiply") DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO CALL f_hpmstop(1) CALL f_hpmterminate(0) …

NUG Meeting Parallel programs poe hpmcount executable –nodes n –procs np –Will print output to STDOUT separately for each task poe+ executable –nodes n –procs np –Will print aggregate number to STDOUT libhpm –Writes output to a separate file for each task Do not do these! – hpmcount poe executable … – hpmcount executable (if compiled with mp* compiler)

NUG Meeting Summary Utilities to measure performance –PAPI – hpmcount – poe+ – hpmlib You need to quote performance data in ERCAP application

NUG Meeting Where to Get More Information NERSC Website: hpcf.nersc.gov PAPI – hpmcount, poe+ – – hpmlib –