Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Thoughts on Shared Caches Jeff Odom University of Maryland.
Tools for applications improvement George Bosilca.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.
F13 Forensic tool analysis Dr. John P. Abraham Professor UTPA.
Introduction to Systems Analysis and Design Trisha Cummings.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
OMIS Approach to Grid Application Monitoring Bartosz Baliś Marian Bubak Włodzimierz Funika Roland Wismueller.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Capabilities of Software. Object Linking & Embedding (OLE) OLE allows information to be shared between different programs For example, a spreadsheet created.
Portable Parallel Performance Tools Shirley Browne, UTK Clay Breshears, CEWES MSRC Jan 27-28, 1998.
Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.
HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
Preparatory Research on Performance Tools for HPC HCS Research Laboratory University of Florida November 21, 2003.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Full and Para Virtualization
Parallel Performance Wizard: a Performance Analysis Tool for UPC (and other PGAS Models) Max Billingsley III 1, Adam Leko 1, Hung-Hsun Su 1, Dan Bonachea.
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Concept Diagram Hung-Hsun Su UPC Group, HCS lab 1/27/2004.
Spring ’05 Independent Study Midterm Review Hans Sherburne HCS Research Laboratory University of Florida.
Single Node Optimization Computational Astrophysics.
Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Bootstrap Tutorial Overview Objective Learn how to use the bootstrap for configuring the system. Requirements Installed Version of.
SQL Database Management
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
In-situ Visualization using VisIt
Introduction to Computers
IS3440 Linux Security Unit 7 Securing the Linux Kernel
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research Assistant Professor Alan D. George, Principal Investigator HCS Research Laboratory University of Florida

Outline Introduction talk (~20 minutes) Hands on  PPW basics  Performance data collection  Performance analysis  Automatic analysis Feel free to ask question during the talk or hands-on 2

Parallel Performance Analysis The need for Performance Analysis  High-performance computing has performance as an explicit, fundamental goal I just got my parallel program working, and… My program does NOT yield the expected performance Why is this? How do I fix my program? The challenge of Performance Analysis  Understanding performance of sequential applications can be challenging  Complexity of parallel computing makes it more difficult to understand program performance without tools for performance analysis 3

Performance Analysis Approaches Three general performance analysis approaches  Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance measurement Pros: easy to use, fast, can be performed without running the program Cons: usually not very accurate  Simulation Pros: allow performance estimation of program with various system architectures Cons: slow, not generally applicable for regular UPC/SHMEM users  Experimental performance measurement Strategy used by most modern performance analysis tools Uses actual event measurement to perform analysis Pros: most accurate Cons: can be time-consuming (iterative tuning process) 4

Role of a Performance Analysis Tool Original Application Optimized Application Runtime Performance Data Gathering Data Processing and Analysis Data and Result Presentation 5

Performance Analysis Stages InstrumentationInsertion of code to facilitate perf. measurement MeasurementCollection of perf. data at runtime AnalysisExamination & processing of perf. data to find & potentially resolve bottlenecks PresentationDisplay of analyzed data to tool user OptimizationModifying application to remove perf. problems 6

Instrumentation Techniques Runtime/compiler instrumentation  Provides the most detailed information about user’s program  Requires vendor cooperation (modifications to compiler/runtime) Source instrumentation  Directly modify user’s source code  Can provide much information, but may interfere with compiler optimizations Interposition (“wrapper libraries”)  No recompilation needed, only relinking  Only get information about library calls  Can be difficult to get source-level information  Relies on alternate function entry points or dynamic linker hacks Binary instrumentation  Most of the benefits of source instrumentation without need for recompilation  Can be difficult to get source-level information  Highly platform-specific, existing toolkits lack support for some platforms (eg, Cray) 7

Measurement Techniques Profiling  Record statistical information about execution time and/or hardware counter values (PAPI)  Relate information to basic blocks (functions, upc_forall loops) in source code  Important concept: inclusive vs. exclusive time (self vs. total) Tracing  Record full log of when events happen at runtime and how long  Gives very complete information about what happened at runtime  Requires much more storage space than profiling! Sampling  Special low-overhead mode of profiling that attributes performance information via indirect measurement (samples) 8

Parallel Performance Wizard (PPW) Performance analysis tool developed in HCS Lab here at UF  Designed for partitioned global-address-space (PGAS) programming models (UPC and SHMEM)  Also supports MPI; other support in the works Features  Uses experimental measurement approach  Provides profiling and tracing support  Has numerous visualizations and advanced automated analysis Overarching design goals  Be user-friendly  Enhance productivity  Aim for portability 9

PPW Hands-on… 10

Hands-on Boot liveDVD in a VM or directly or hardware Initial Setup  Export PATH variable to include recent release of PPW and UPC export PATH=/usr/local/packages/ppw /bin/:/usr/local/packages/bupc /bin/:$PATH All applications we use today are in the directory cd /home/livetau/workshop-point/UPC_apps You can download these slides from (following slides has necessary commands and will come in handy),

Programming in UPC (bupc) Compiling an UPC program  upcc hello.c –o hello Execution  upcrun –n 4 hello 12

Using PPW in a Nutshell Recompile application (Instrumentation)  Use ppwupcc instead of upcc  ppwshemecc (for SHMEM) and ppwmpicc (for MPI) Run application (Measurement)  ppwrun View performance data (Analysis + Presentation)  ppw file.par Change code (Optimization), recompile, repeat 13

PPW(for UPC) in a Nutshell Recompile application (Instrumentation)  ppwupcc CAMEL_upc.c -o camel Run application (Measurement)  ppwrun -–output=file.par upcrun –n 4 camel abcd1234 View performance data (Analysis + Presentation)  ppw file.par Change code (Optimization), recompile, repeat Note: PPW should be compiled --with-upc and Berkeley UPC should be compiled with --with-multiconf=+opt_inst 14

PPW Useful Options Tracking user functions entry and exit  pass --inst-functions to ppwupcc Communication matrix  pass --comm-stats to ppwrun Just open the.par file using ppw to find all the data.  ppw file.par Source archive (.sar file)  Required during execution  Retain the.sar file in the same directory as executable 15

NPB 2.4 Compiling  cd NPB2.4/FT  make CLASS=X NP=N where X can be S,A,B,C. Preferably use S or A. Execution same as before NPB2.4 is developed and maintained by George Washington University (upc.gwu.edu) 16

Tracing Compilation is same as before using ppwupcc Pass --trace option to request tracing  ppwrun --trace --output=a.par upcrun -n 4 ft.A.4 Convert to slog2 using ppw (or par2slog2)  File -> Export -> Use jumpshot to view the trace  jumpshot ft.slog2 17

Export: Covert to Other Popular Formats par file can be exported to different popular performance data formats, supported formats include  TAU profile  CUBE profile  OTF trace file (Vampir)  SLOG-2 (Jumpshot) 18

Case Study: Analyzing FT of NPB2.4 NPB2.4 FT benchmark (class=A, np=4) executed on an IB cluster with 1 thread per node You can download the par file and slog2 file at

Case Study: FT Identify the bottleneck  Sort by total time, look for bottlenecks upc_getmem ft.c:1950  Cannot be confirmed by looking at profile, so take a look at the trace Observe the trace output and the behavior of the code section ft.c:1943 till ft.c:1953  Serialization of upc_getmem, which is unnecessary in this case 20

Case Study: FT How to fix?  Use bupc_getmem_async – Berkeley UPC extension for asynchronous getmem Did it improve performance?  Download the par file generated after changes to ft.c  Observe the changes in profile data 21

Automatic Analysis Why do we need automatic analysis?  Increasing size of performance data set makes it hard to identify and resolve bottlenecks What will automatic analyses do?  Automatically detect, diagnose and possibly resolve bottlenecks 22

Automatic Analysis Application analyses  Deals with a single run and includes, Bottleneck detection Cause analysis High-level analysis Experiment set analyses  Compare performance of related runs Scalability analysis Revision analysis 23

Application Analysis Bottleneck detection Examine profile data and identify the bottleneck profiling entries Baseline comparison and deviation evaluation method Cause analysis Identify the reason for bottlenecks and requires trace data to complete analysis High-level analysis High-level analysis is mainly used to detect bottleneck nodes that, when optimized, could improve the application performance for a single experiment 24

Application Analysis Analysis -> Run Application Analysis 25

Experiment Set Analyses Scalability analysis  Plots the scaling factor (relative speedup) values against the ideal scaling value  Scaling factor of 1.00 implies perfect scalability Analysis->Run Scalability Analysis Revision analysis  Compare and evaluate different versions of the same application Profile Charts -> Total Times by Function 26

For More Information on PPW Visit the PPW website  Website has  Overview of tool  Links to detailed online/printable user manual  Downloadable source code for entire tool  Workstation GUI installers Windows installer Linux packages  Publications covering PPW and related research 27