Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

Slides:



Advertisements
Similar presentations
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Advertisements

Automated Instrumentation and Monitoring System (AIMS)
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
Lecture 1: History of Operating System
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
1 Lecture 6 Performance Measurement and Improvement.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Memory Management 2010.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Chapter 4 Assessing and Understanding Performance
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
Chapter 6: An Introduction to System Software and Virtual Machines
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Programming In C++ Spring Semester 2013 Programming In C++, Lecture 1.
Basic Input Output System
1 Computer Performance: Metrics, Measurement, & Evaluation.
Operating Systems CS3502 Fall 2014 Dr. Jose M. Garrido
NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services
Implementing Processes and Process Management Brian Bershad.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Performance.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.
Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Operating System Principles And Multitasking
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
1 Software. 2 What is software ► Software is the term that we use for all the programs and data on a computer system. ► Two types of software ► Program.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Processes & Threads Introduction to Operating Systems: Module 5.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
Sunpyo Hong, Hyesoon Kim
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Tuning Threaded Code with Intel® Parallel Amplifier.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
September 2 Performance Read 3.1 through 3.4 for Tuesday
Defining Performance Which airplane has the best performance?
OPERATING SYSTEMS CS3502 Fall 2017
Chapter 2: System Structures
Lecture Topics: 11/1 Processes Process Management
Introduction to Operating System (OS)
Performance of computer systems
Performance.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView

DataStar Overview P655 :: ( 8-way, 16GB) 176 nodes P655+ :: ( 8-way, 32GB) 96 nodes P690 :: ( 32-way, 64GB) 2 nodes P690 :: ( 32-way, 128GB) 4 nodes P690 :: ( 32-way, 256GB) 2 nodes Total – 280 nodes :::: 2,432 processors.

Batch/Interactive computing Batch Job Queues: –Job queue Manager – Load Leveler (tool from IBM) –Job queue Scheduler – Catalina ( SDSC internal tool) –Job queue Monitoring – Various tools (commands) –Jobs Accounting – Job filter (SDSC internal PERL scripts)

DataStar Access Three Login Nodes :: Access modes (platforms) (usage mode) –dslogin.sdsc.edu :: Production runs (P690, 32-way, 64GB) –dspoe.sdsc.edu :: Test/debug runs (P655, 8-way, 16GB) –dsdirect.sdsc.edu :: Special needs (P690, 32-way, 256GB) Note : Above Usage modes division is not very strict.

Test/debug runs (Usage from dspoe) Queue/Class Name Node type Memory limit Max Wall clock hours Max Num Nodes interactivep655 nodes (8-CPU), 16 GB 2 hrs3 expressp655 nodes (8-CPU), 16 GB 2 hrs4 [dspoe.sdsc.edu :: P655, 8-way, 16GB] Access to two queues: –P655 nodes [shared] –P655 nodes [Not – shared] Job queues have Job filter + Load Leveler only (very fast) Special command line submission (along with job script).

Production runs (Usage from dslogin) Queue/Class Name Node type Memory limit Max Wall clock hours Max Num Nodes normalp655 nodes (8-CPU), 16 GB & 32 GB 18 hrs265 normal32p690 nodes (32-CPU), 128 GB 18 hrs5 [dslogin.sdsc.edu :: P690, 32-way, 64GB] Data transfer/ Src editing/Compliation etc… Two queues: Onto p655/p655+ nodes [not shared] Onto p690 nodes [shared ] Job ques have Job filter + LoadLeveler + Catalina ( Slowupdates )

All Special needs (Usage from dsdirect) [dsdirect.sdsc.edu :: P690, 32-way, 256GB] All Visualization needs All post data analysis needs Shared node (with 256 GB of memory) Process accounting in place Total (a.out) interactive usage. No Job filter, No Load Leveler, No Catalina

IBM Hardware Performance Monitor (hpm)

What is Performance? - Where is time spent and how is time spent? MIPS – Millions of Instructions Per Second MFLOPS – Millions of Floating-Point Operations Per Second Run time/CPU time

What is a Performance Monitor? -Provides detailed processor/system data Processor Monitors –Typically a group of registers –Special purpose registers keep track of programmable events –Non-intrusive counts result in “accurate” measurement of processor events –Typical Events counted are Instruction, floating point instruction, cache misses, etc. System Level Monitors –Can be hardware or software –Intended to measure system activity –Examples: bus monitor: measures memory traffic, can analyze cache coherency issues in multiprocessor system Network monitor: measures network traffic, can analyze web traffic internally and externally

Hardware Counter Motivations -To understand execution behavior of application code Why not use software? –Strength: simple, GUI interface –Weakness: large overhead, intrusive, higher level, abstraction and simplicity How about using a simulator? –Strength: control, low-level, accurate –Weakness: limit on size of code, difficult to implement, time- consuming to run When should we directly use hardware counters? –Software and simulators not available or not enough –Strength: non-intrusive, instruction level analysis, moderate control, very accurate, low overhead –Weakness: not typically reusable, OS kernel support

Ptools Project PMAPI Project –Common standard API for industry –Supported by IBM, SUN, SGI, COMPAQ etc PAPI Project –Standard application programming interface –Portable, available through a module –Can access hardware counter info HPM Toolkit –Easy to use –Doesn’t effect code performance –Use hardware counters –Designed specifically for IBM SPs and Power

Problem Set Should we collect all events all the time? –Not necessary and wasteful What counts should be used? –Gather only what you need Cycles Committed Instructions Loads Stores L1/L2 misses L1/L2 stores Committed fl pt instr Branches Branch misses TLB misses Cache misses

IBM HPM Toolkit H igh P erformance M onitor Developed for performance measurement of applications running on IBM Power3 systems. It consists of: –An utility ( hpmcount ) –An instrumentation library ( libhpm ) –A graphical user interface ( hpmviz ). Requires PMAPI kernel extensions to be loaded Works on IBM 630 and 604e processors Based on IBM’s PMAPI – low level interface

HPM Count Utilities for performance measurement of application Extra logic inserted to the processor to count specific events Updated at every cycle Provide a summary output at the end of the execution: –Wall clock time –Resource usage statistics –Hardware performance counters information –Derived hardware metrics Serial/Parallel, Gives each performance numbers for each task

Timers Time usually reports three metrics: User Time –The time used by your code on CPU, also CPU time –Total time in user mode = Cycles/Processor Frequency System Time –The time used by your code running kernel code (doing I/O, writing to disk, or printing to the screen etc). –It is worth to minimize the system time, by speeding up the disk I/O, doing I/O in parallel, or doing I/O in background while your CPU computes in the foreground Wall Clock time –Total execution time, the combination of the time 1 and 2 plus the time spent idle (waiting for resources) –In parallel performance tuning, only wall clock time counts –Interprocessor communication consumes a significant amount of your execution time (user/system time usually don’t account for it), need to rely on wall clock time for all the time consumed by the job

Floating Point Measures PM_FPU0_CMPL (FPU 0 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. Each FPU can start a new instruction at every cycle. This counter shows the number of floating point instructions that have been executed by the first FPU. PM_FPU1_CMPL (FPU 1 instructions) –This counter shows the number of floating point instructions (add, multiply, subtract, divide, multiply & add) that have been processed by the second FPU. PM_EXEC_FMA (FMAs executed) –This is the number of Floating point Multiply & Add (FMA) instructions. This instruction does a computation of following type x = s * a + b So two floating point operations are done within one instruction. The compiler generate this instruction as often as possible to speed up the program. But sometimes additional manual optimization is necessary to replace single multiply instructions and corresponding add instructions by one FMA.

Total Flop Rate Float point instructions + FMA rate –This is the most often mentioned performance index, the MFlops rate. –The peak performance of the POWER3-II processor is 1500 MFlops. (375 MHZ clock x 2 FPUs x 2 Flops/FMA instruction). –Many applications do not reach more than 10 percent of this peak performance. Average number of loads per TLB miss –This value is the ratio PM_LD_CMPL / PM_TLB_MISS. Each time after a TLB miss has been processed, fast access to a new page of data is possible. Small values for this metric indicate that the program has a poor data locality, a redesign of the data structures in the program may result in significant performance improvements. Computation intensity –Computational intensity is the ratio of Load and store operations and Floating point operations

PERF

The perf utility provides a succinct code performance report to help get the most out of HPM output or MPI_Trace output. It can help make your case for an allocation request.

Trace Libraries IBM Trace Libraries are a set of libraries used for MPI performance instrumentation. These libraries can measure the amount of time spent in each routine, what function was used, and how many bytes were sent. To use a library: Compile your code with the -g flag Relink your object files. For example, for mpitrace: -L/usr/local/apps/mpitrace -lmpiprof Make sure your code exits through mpi_finalize. It will produce mpi_profile.task_number output files.

Perf The perf utility provides a succinct code performance report to help get the most out of HPM output or MPI_Trace output. It can help make your case for an allocation request. To use perf: Add /usr/local/apps/perf/perf to your path OR Alias it in your.cshrc file: alias perf '/usr/local/apps/perf/perf \!*' Then run it in the same directory as your output files: perf hpm_out > perf_summary

Example of perf_summary Computation performance measured for all 4 cpus: Execution wall clock time = seconds Total FPU arithmetic results = 5.381e+09 (31.2% of these were FMAs) Aggregate flop rate = Gflop/s Average flop rate per cpu = Mflop/s = 2.6% of `peak‘ Communication wall clock time for 4 cpus: max = seconds min = seconds Communication took 0.17% of total wall clock time.

IPM - Integrated Performance Monitoring

Integrated Performance Monitoring (IPM) Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes. IPM is invoked by the user at the time a job is run. By default, a short, text-based summary of the code's performance is provided, and a more detailed Web page. More details at:

VAMPIR – Visualization and Analysis of MPI Programs

VAMPIR Much harder to debug and tune parallel programs than sequential ones. The reasons for performance problems, in particular, are notoriously hard to find. Assume that the performance is disappointing.Initially, the programmer has no idea where and for what to look to identify the performance bottleneck.

VAMPIR converts the trace information into a variety of graphical views, e.g.: timeline displays showing state changes and communication, communication statistics indicating data volumes and transmission rates, and more.

Setting the Vampir path and variables: setenv PAL_LICENSEFILE /usr/local/apps/vampir/etc/license.dat set path = ($path /usr/local/apps/vampir/bin) Compile: mpcc –o parpi – L/usr/local/apps/vampirtrace/lib –lVT –lm –lld parpi.c Run: poe parpi –nodes 1 –tasks_per_node 4 -rmpool 1 –euilib us –euidevice sn_all Calling Vampir: vampir parpi.stf

TotalView

Discovering TotalView The Etnus TotalView® debugger is a powerful, sophisticated, and programmable tool that allows you to debug, analyze, and tune the performance of complex serial, multiprocessor, and multithreaded programs. If you want to jump in and get started quickly, you should go to the Website at and select TotalView's "Getting Started" area. (It's the blue oval link on the right near the bottom.)