So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Automated Instrumentation and Monitoring System (AIMS)
Debugging What can debuggers do? Run programs Make the program stops on specified places or on specified conditions Give information about current variables’
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Lecture 5: Parallel Tools Landscape – Part 2 Allen D. Malony Department of Computer and Information Science.
Network File System (NFS) in AIX System COSC513 Operation Systems Instructor: Prof. Anvari Yuan Ma SID:
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,
Lecture 5: Parallel Tools Landscape – Part 2 Allen D. Malony Department of Computer and Information Science.
1 Parallel Performance Analysis with Open|SpeedShop Trilab Tools-Workshop Martin Schulz, LLNL/CASC LLNL-PRES
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ShopKeeper was designed from the ground up to manage your entire fleet maintenance operations … from 1 user to 100, including full security features that.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Portions © Intel Corporation | Portions © Hewlett-Packard Corporation * Other brands and names may be claimed as the property of others.
London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.
chap13 Chapter 13 Programming in the Large.
GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL.
DOS  In the 1980s or early 1990s, the operating system that shipped with most PCs was a version of the Disk Operating System (DOS) created by Microsoft:
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Module 7: Fundamentals of Administering Windows Server 2008.
Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.
Profiling Tools In Ranger Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ASAP RDF SGP RDF 1.2 and 1.3 Transfer of Information
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Guide To UNIX Using Linux Third Edition Chapter 8: Exploring the UNIX/Linux Utilities.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Portable Parallel Performance Tools Shirley Browne, UTK Clay Breshears, CEWES MSRC Jan 27-28, 1998.
Earth System Modeling Framework Python Interface (ESMP) October 2011 Ryan O’Kuinghttons Robert Oehmke Cecelia DeLuca.
Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Connections to Other Packages The Cactus Team Albert Einstein Institute
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
MPI and OpenMP.
Centroute, Tenet and EmStar: Development and Integration Karen Chandler Centre for Embedded Network Systems University of California, Los Angeles.
Postgraduate Computing Lectures PAW 1 PAW: Physicist Analysis Workstation What is PAW? –A tool to display and manipulate data. Learning PAW –See ref. in.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
The Development Process Compilation. Compilation - Dr. Craig A. Struble 2 Programming Process Problem Solving Phase We will spend significant time on.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
AdaptJ Sookmyung Women’s Univ. PSLAB. 1. 목차 1. Overview 2. Collecting Trace Data using the AdaptJ Agent 2.1 Recording a Trace 3. Analyzing Trace Data.
System Components Operating System Services System Calls.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
Performance Analysis Tools
Lecture Topics: 11/1 General Operating System Concepts Processes
Anant Mudambi, U. Virginia
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First goal: overview of …  Communication frequency and intensity  Types and complexity of communication  Source code locations of expensive MPI calls  Differences between processes

Basic Principle of Profiling MPI Intercept all MPI API calls  Using wrappers for all MPI calls Aggregate statistics over time  Number of invocations  Data volume  Time spent during function execution Multiple aggregations options/granularity  By function name or type  By source code location (call stack)  By process rank

mpiP: Efficient MPI Profiling Open source MPI profiling library  Developed at LLNL, maintained by LLNL & ORNL  Available from sourceforge  Works with any MPI library Easy-to-use and portable design  Relies on PMPI instrumentation  No additional tool daemons or support infrastructure  Single text file as output  Optional: GUI viewer

Running with mpiP 101 / Experimental Setup mpiP works on binary files  Uses standard development chain  Use of “-g” recommended Run option 1: Relink  Specify libmpi.a/.so on the link line  Portable solution, but requires object files Run option 2: library preload  Set preload variable (e.g., LD_PRELOAD) to mpiP  Transparent, but only on supported systems

Running with mpiP 101 / Running bash-3.2$ srun –n4 smg2000 mpiP: mpiP: mpiP V3.1.2 (Build Dec /17:31:26) mpiP: Direct questions and errors to mpip- mpiP: Running with these driver parameters: (nx, ny, nz) = (60, 60, 60) (Px, Py, Pz) = (4, 1, 1) (bx, by, bz) = (1, 1, 1) (cx, cy, cz) = ( , , ) (n_pre, n_post) = (1, 1) dim = 3 solver ID = 0 ============================================= Struct Interface: ============================================= Struct Interface: wall clock time = seconds cpu clock time = seconds ============================================= Setup phase times: ============================================= SMG Setup: wall clock time = seconds cpu clock time = seconds ============================================= Solve phase times: ============================================= SMG Solve: wall clock time = seconds cpu clock time = seconds Iterations = 7 Final Relative Residual Norm = e-07 mpiP: mpiP: Storing mpiP output in [./smg2000-p mpiP]. mpiP: bash-3.2$ Header Output File

mpiP 101 / Output – Command :./smg2000-p -n Version : MPIP Build date : Dec , Start time : Stop time : Timer Used : MPIP env var : Collector Rank : Collector PID : Final Output Dir Report generation : MPI Task Assignment : 0 MPI Task Assignment : 1 MPI Task Assignment : 2 MPI Task Assignment : 3 hera31

mpiP 101 / Output – Overview MPI Time (seconds) Task AppTime MPITime MPI% *

mpiP 101 / Output – Callsites Callsites: ID Lev File/Address Line Parent_Funct MPI_Call 1 0 communication.c 1405 hypre_CommPkgUnCommit Type_free 2 0 timing.c 419 hypre_PrintTiming Allreduce 3 0 communication.c 492 hypre_InitializeCommunication Isend 4 0 struct_innerprod.c 107 hypre_StructInnerProd Allreduce 5 0 timing.c 421 hypre_PrintTiming Allreduce 6 0 coarsen.c 542 hypre_StructCoarsen Waitall 7 0 coarsen.c 534 hypre_StructCoarsen Isend 8 0 communication.c 1552 hypre_CommTypeEntryBuildMPI Type_free 9 0 communication.c 1491 hypre_CommTypeBuildMPI Type_free 10 0 communication.c 667 hypre_FinalizeCommunication Waitall 11 0 smg2000.c 231 main Barrier 12 0 coarsen.c 491 hypre_StructCoarsen Waitall 13 0 coarsen.c 551 hypre_StructCoarsen Waitall 14 0 coarsen.c 509 hypre_StructCoarsen Irecv 15 0 communication.c 1561 hypre_CommTypeEntryBuildMPI Type_free 16 0 struct_grid.c 366 hypre_GatherAllBoxes Allgather 17 0 communication.c 1487 hypre_CommTypeBuildMPI Type_commit 18 0 coarsen.c 497 hypre_StructCoarsen Waitall 19 0 coarsen.c 469 hypre_StructCoarsen Irecv 20 0 communication.c 1413 hypre_CommPkgUnCommit Type_free 21 0 coarsen.c 483 hypre_StructCoarsen Isend 22 0 struct_grid.c 395 hypre_GatherAllBoxes Allgatherv 23 0 communication.c 485 hypre_InitializeCommunication Irecv

mpiP 101 / Output – per Function Timing Aggregate Time (top twenty, descending, milliseconds) Call Site Time App% MPI% COV Waitall e Isend e Irecv Waitall Type_commit Type_free Waitall Type_free Type_free Type_free Isend Isend Type_free Irecv Irecv

mpiP 101 / Output – per Function Message Size Aggregate Sent Message Size (top twenty, descending, bytes) Call Site Count Total Avrg Sent% Isend e Isend e Isend e Allreduce Allgatherv Allreduce Allreduce Allgather

mpiP 101 / Output – per Callsite Timing Callsite Time statistics (all, milliseconds): Name Site Rank Count Max Mean Min App% MPI% Allgather Allgather Allgather Allgather Allgather 16 * Allgatherv Allgatherv Allgatherv Allgatherv Allgatherv 22 * Allreduce Allreduce Allreduce Allreduce Allreduce 2 *

mpiP 101 / Output – per Callsite Message Size Callsite Message Sent statistics (all, sent bytes) Name Site Rank Count Max Mean Min Sum Allgather Allgather Allgather Allgather Allgather 16 * Allgatherv Allgatherv Allgatherv Allgatherv Allgatherv 22 * Allreduce Allreduce Allreduce Allreduce Allreduce 2 *

Fine Tuning the Profile Run mpiP Advanced Features  User controlled stack trace depth  Reduced output for large scale experiments  Application control to limit scope  Measurements for MPI I/O routines Controlled by MPIP environment variable  Set by user before profile run  Command line style argument list  Example: MPIP = “-c –o –k 4”

mpiP Parameters Param.DescriptionDefault -cConcise Output / No callsite data -f dirSet output directory -k nSet callsite stack traceback size to n1 -lUse less memory for data collection -nDo not truncate pathnames -oDisable profiling at startup -s nSet hash table size256 -t xPrint threshold0.0 -vPrint concise & verbose output

Controlling the Stack Trace Callsites are determined using stack traces  Starting from current call stack going backwards  Useful to avoid MPI wrappers  Helps to distinguishes library invocations Tradeoff: stack trace depth  Too short: can’t distinguish invocations  Too long: extra overhead / too many call sites User can set stack trace depth  -k parameter

Concise Output Output file contains many details  Users often only interested in summary  Per callsite/task data harms scalability Option to provide concise output  Same basic format  Omit collection of per callsite/task data User controls output format through parameters  -c = concise output only  -v = provide concise and full output files

Limiting Scope By default, mpiP measures entire execution  Any event between MPI_Init and MPI_Finalize Optional: controlling mpiP from within the application  Disable data collection at startup (-o)  Enable using MPI_Pcontrol(x) ▪x=0: Disable profiling ▪x=1: Enable profiling ▪x=2: Reset profile data ▪x=3: Generate full report ▪x=4: Generate concise report

Limiting Scope / Example for(i=1; i < 10; i++) { switch(i) { case 5: MPI_Pcontrol(2); MPI_Pcontrol(1); break; case 6: MPI_Pcontrol(0); MPI_Pcontrol(4); break; default: break; } /*... compute and communicate for one timestep... */ } Reset & Start in Iteration 5 Stop & Report in Iteration 6

mpiP Case Study First Principle Molecular Dynamics LLNL  Tuning at nodes on BG/L Two main observations  Few MPI_Bcasts dominate  Better node mappings ▪Up to 64% speedup  Significant time spent in datatype operations ▪Hidden in underlying support libraries  Simplified data type management ▪30% performance improvement

Platforms Highly portable design  Built on top of PMPI, which is part of any MPI  Very few dependencies Tested on many platforms, including  Linux (x86, x86-64, IA-64, MIPS64)  BG/L & BG/P  AIX (Power 3/4/5)  Cray XT3/4/5 with Catamount and CNL  Cray X1E

mpiP Installation Download from  Current release version:  CVS access to development version Autoconf-based build system with options to  Disable I/O support  Pick a timing option  Choose name demangling scheme  Build on top of the suitable stack tracer  Set maximal stack trace depth

mpiPView: The GUI for mpiP  Optional: displaying mpiP data in a GUI  Implemented as part of the Tool Gear project  Reads mpiP output file  Provides connection to source files  Usage process  First: select input metrics  Hierarchical view of all callsites  Source panel once callsite is selected  Ability to remap source directories

Example