Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Chapter 13 Embedded Systems

EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

Computer System Architectures Computer System Software

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Cluster Reliability Project ISIS Vanderbilt University.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Connections to Other Packages The Cactus Team Albert Einstein Institute

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

2/22/2001Greenbook 2001/OASCR1 Greenbook/OASCR Activities Focus on technology to enable SCIENCE to be conducted, i.e. Software tools Software libraries.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

General requirements for BES III offline & EF selection software Weidong Li.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Applications Active Web Documents Active Web Documents.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Clouds , Grids and Clusters

University of Technology

Allen D. Malony, Sameer Shende

Outline Introduction Motivation for performance mapping SEAA model

TAU Performance DataBase Framework (PerfDBF)

Presentation transcript:

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational Science Institute, NeuroInformatics Center University of Oregon A Framework for Online Performance Analysis and Visualization of Large- Scale Parallel Applications

PPAM 20032Framework for Online Performance Analysis, and Visualization Outline  Problem description  Scaling and performance observation  Interest in online performance analysis  General online performance system architecture  Access models  Profiling issues and control issues  Framework for online performance analysis  TAU performance system  SCIRun computational and visualization environment  Experiments  Conclusions and future work

PPAM 20033Framework for Online Performance Analysis, and Visualization Problem Description  Need for parallel performance observation  Instrumentation, measurement, analysis, visualization  In general, there is the concern for intrusion  Seen as a tradeoff with accuracy of performance diagnosis  Scaling complicates observation and analysis  Issues of data size, processing time, and presentation  Online approaches add capabilities as well as problems  Performance interaction, but at what cost?  Tools for large-scale performance observation online  Supporting performance system architecture  Tool integration, effective usage, and portability

PPAM 20034Framework for Online Performance Analysis, and Visualization Scaling and Performance Observation  Consider “traditional” measurement methods  Profiling: summary statistics calculated during execution  Tracing: time-stamped sequence of execution events  More parallelism  more performance data overall  Performance specific to each thread of execution  Possible increase in number interactions between threads  Harder to manage the data (memory, transfer, storage, …)  More parallelism / performance data  harder analysis  More time consuming to analyze  More difficult to visualize (meaningful displays)  Need techniques to address scaling at all levels

PPAM 20035Framework for Online Performance Analysis, and Visualization Why Complicate Matters with Online Methods?  Adds interactivity to performance analysis process  Opportunity for dynamic performance observation  Instrumentation change  Measurement change  Allows for control of performance data volume  Post-mortem analysis may be “too late”  View on status of long running jobs  Allow for early termination  Computation steering to achieve “better” results  Performance steering to achieve “better” performance  Online performance observation may be intrusive

PPAM 20036Framework for Online Performance Analysis, and Visualization Related Ideas  Computational steering  Falcon (Schwan, Vetter): computational steering  Dynamic instrumentation and performance search  Paradyn (Miller): online performance bottleneck analysis  Adaptive control and performance steering  Active Harmony (Hollingsworth): auto decision control  Autopilot (Reed): actuator/sensor performance steering  Scalable monitoring  Peridot (Gerndt): automatic online performance analysis  MRNet (Miller): multi-case reduction for access / control  Scalable analysis and visualization  VNG (Brunst): parallel trace analyis

PPAM 20037Framework for Online Performance Analysis, and Visualization General Online Performance Observation System Performance Data Performance Measurement Performance Control Performance Analysis Performance Visualization Performance Instrument

PPAM 20038Framework for Online Performance Analysis, and Visualization Models of Performance Data Access (Monitoring)  Push Model  Producer/consumer style of access and transfer  Application decides when/what/how much data to send  External analysis tools only consume performance data  Availability of new data is signaled passively or actively  Pull Model  Client/server style of performance data access and transfer  Application is a performance data server  Access decisions are made externally by analysis tools  Two-way communication is required  Push/Pull Models

PPAM 20039Framework for Online Performance Analysis, and Visualization Online Profiling Issues  Profiles are summary statistics of performance  Kept with respect to some unit of parallel execution  Profiles are distributed across the machine (in memory)  Must be gathered and delivered to profile analysis tool  Profile merging must take place (possibly in parallel)  Consistency checking of profile data  Callstack must be updated to generate correct profile data  Correct communication statistics may require completion  Event identification (not necessary is save event names)  Sequence of profile samples allow interval analysis  Interval frequency depends on profile collection delay

PPAM Framework for Online Performance Analysis, and Visualization Performance Control  Instrumentation control  Dynamic instrumentation  Inserts / removes instrumentation at runtime  Measurement control  Dynamic measurement  Enabling / disabling / changing of measurement code  Dynamic instrumentation or measurement variables  Data access control  Selection of what performance data to access  Control of frequency of access

PPAM Framework for Online Performance Analysis, and Visualization TAU Performance System Framework  Tuning and Analysis Utilities (aka Tools Are Us)  Performance system framework for scalable parallel and distributed high-performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable performance profiling/tracing facility  Open software approach

PPAM Framework for Online Performance Analysis, and Visualization TAU Performance System Architecture EPILOG Paraver ParaProf

PPAM Framework for Online Performance Analysis, and Visualization Online Profile Measurement and Analysis in TAU  Standard TAU profiling  Per node/context/thread  Profile “dump” routine  Context-level  Profile file per each thread in context  Appends to profile file  Selective event dumping  Analysis tools access files through shared file system  Application-level profile “access” routine

PPAM Framework for Online Performance Analysis, and Visualization Online Performance Analysis and Visualization Application Performance Steering Performance Visualizer Performance Analyzer Performance Data Reader TAU Performance System Performance Data Integrator SCIRun (Univ. of Utah) // performance data streams // performance data output file system sample sequencing reader synchronization accumulated samples

PPAM Framework for Online Performance Analysis, and Visualization Profile Sample Data Structure in SCIRun node context thread

PPAM Framework for Online Performance Analysis, and Visualization Performance Analysis/Visualization in SCIRun SCIRun program

ParCo 2003 Mini-Symposium17Online Performance Monitoring, Analysis, and Visualization Uintah Computational Framework (UCF)  University of Utah  UCF analysis  Scheduling  MPI library  Components  500 processes  Use for online and offline visualization  Apply SCIRun steering

ParCo 2003 Mini-Symposium18Online Performance Monitoring, Analysis, and Visualization “Terrain” Performance Visualization F

ParCo 2003 Mini-Symposium19Online Performance Monitoring, Analysis, and Visualization Scatterplot Displays  Each point coordinate determined by three values: MPI_Reduce MPI_Recv MPI_Waitsome  Min/Max value range  Effective for cluster analysis  Relation between MPI_Recv and MPI_Waitsome

ParCo 2003 Mini-Symposium20Online Performance Monitoring, Analysis, and Visualization Online Unitah Performance Profiling  Demonstration of online profiling capability  Colliding elastic disks  Test material point method (MPM) code  Executed on 512 processors ASCI Blue Pacific at LLNL  Example 1 (Terrain visualization)  Exclusive execution time across event groups  Multiple time steps  Example 2 (Bargraph visualization)  MPI execution time and performance mapping  Example 3 (Domain visualization)  Task time allocation to “patches”

ParCo 2003 Mini-Symposium21Online Performance Monitoring, Analysis, and Visualization Example 1 (Event Groups)

ParCo 2003 Mini-Symposium22Online Performance Monitoring, Analysis, and Visualization Example 2 (MPI Performance)

ParCo 2003 Mini-Symposium23Online Performance Monitoring, Analysis, and Visualization Example 3 (Domain-Specific Visualization)

PPAM Framework for Online Performance Analysis, and Visualization ParaProf Framework Architecture  Portable, extensible, and scalable tool for profile analysis  Offer “best of breed” capabilities to performance analysts  Build as profile analysis framework for extensibility

PPAM Framework for Online Performance Analysis, and Visualization ParaProf Profile Display (VTF)  Virtual Testshock Facility (VTF), Caltech, ASCI Center  Dynamic measurement, online analysis, visualization

PPAM Framework for Online Performance Analysis, and Visualization Full Profile Display (SAMRAI++) 512 processes  Structured AMR toolkit (SAMRAI++), LLNL

PPAM Framework for Online Performance Analysis, and Visualization Evaluation of Experimental Approaches  Currently only supporting push model  File system solution for moving performance data  Is this a scalable solution?  Robust solution that can leverage high-performance I/O  May result in high intrusion  However, does not require IPC  Should be relatively portable  Analysis and visualization only runs sequentially

PPAM Framework for Online Performance Analysis, and Visualization Possible Improvements  Profile merging at context level to reduce number of files  Merging at node level may require explicit processing  Concurrent trace merging could also reduce files  Hierarchical merge tree  Will require explicit processing  Could consider IPC transfer  MPI (e.g., used in mpiP for profile merging)  Create own communicators  Sockets or PACX between computer server and analyzer  Leverage large-scale systems infrastructure  Parallel profile analysis

PPAM Framework for Online Performance Analysis, and Visualization Concluding Remarks  Interest in online performance monitoring, analysis, and visualization for large-scale parallel systems  Need to intelligently use  Benefit from other scalability considerations of the system software and system architecture  See as an extension to the parallel system architecture  Avoid solutions that have portability difficulties  In part, this is an engineering problem  Need to work with the system configuration you have  Need to understand if approach is applicable to problem  Not clear if there is a single solution

PPAM Framework for Online Performance Analysis, and Visualization Future Work  Build online support in TAU performance system  Extend to support PULL model capabilities  Develop hierarchical data access solutions  Performance studies of full system  Latency analysis  Bandwidth analysis  Integration with other performance tools  System performance monitors  ParaProf parallel profile analyzer  Development of 3D visualization library  Portability focus