Chee Wai Lee, Allen D. Malony, Alan Morris Department of Computer and Information Science Performance Research.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

TAUg: Runtime Global Performance Data Access using MPI Kevin A. Huck, Allen D, Malony, Sameer Shende and Alan Morris

Performance Evaluation of S3D using TAU Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Tree-Based Density Clustering using Graphics Processors

Overview of the Database Development Process

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Performance Technology for Scalable Parallel Systems

TAUmon: Scalable Online Performance Data Analysis in TAU

Allen D. Malony, Sameer Shende

Stack Trace Analysis for Large Scale Debugging using MRNet

Resource Allocation for Distributed Streaming Applications

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Chee Wai Lee, Allen D. Malony, Alan Morris Department of Computer and Information Science Performance Research Laboratory University of Oregon TAUmon: Scalable Online Performance Data Analysis in TAU

PROPER 2010 Outline  Motivation  Brief review of prior work  TAUmon design and objectives  Scalable analysis operations  Transports  MRNet  MPI  TAUmon experiments  Perspectives on understanding applications  Experiments  Scaling results  Remarks 2

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Motivation  Performance problem analysis is increasingly complex  Multi-core, heterogeneous, and extreme scale computing  Adaptive algorithms and runtime application tuning  Performance dynamics variability within/between executions  Neo-performance measurement and analysis perspective  Static, offline analysis dynamic, online analysis  Scalable runtime analysis of parallel performance data  Performance feedback to application for adaptive control  Integrated performance monitoring (measurement + query)  Co-allocation of additional (tool specific) system resources  Goal  Scalable, integrated parallel performance monitoring 3

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Parallel Performance Measurement and Data  Parallel performance tools measure locally and concurrently  Scaling dictates “local” measurements (profile, trace)  save data with "local context" (processes or threads)  Done without synchronization or central control  Parallel performance state is globally distributed as a result  Logically part of application’s global data space  Offline: outputs data at execution end for post-mortem analysis  Online: access to performance state for runtime analysis  Definition: Monitoring  Online access to parallel performance (data) state  May or may not involve runtime analysis

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Monitoring for Performance Dynamics  Runtime access to parallel performance data  Scalable and lightweight  Raises concerns of overhead and intrusion  Support for performance-adaptive, dynamic applications  Alternative 1: Extend existing performance measurement  Create own integrated monitoring infrastructure  Disadvantage: maintain own monitoring framework  Alternative 2: Couple with other monitoring infrastructure  Leverage scalable middleware from other supported projects  Challenge: measurement system / monitor integration 5

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Performance Dynamics: Parallel Profile Snapshots  Profile snapshots are parallel profiles recorded at runtime  Shows performance profile dynamics (all types allowed) Information Overhead Traces Profile Snapshots Profiles A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008.

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Parallel Profile Snapshots of FLASH 3.0 (UIC) Initialization Checkpointing Finalization  Simulation of astrophysical thermonuclear flashes  Snapshots show profile differences since last snapshot  Captures all events since beginning per thread  Mean profile calculated post- mortem  Highlight change in performance per iteration and at checkpointing

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 FLASH 3.0 Performance Dynamics (Periodic) INTRFC

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Prior Performance Monitoring Work  TAUoverSupermon (UO, Los Alamos National Laboratory)  TAUg (UO)  TAUoverMRNET (UO, University of Wisconsin, Madison) A. Nataraj, M. Sottile, A. Morris, A. Malony, and S. Shende, “TAUoverSupermon: Low-overhead Online Parallel Performance Monitoring,” EuroPar, A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, “A Framework for Scalable, Parallel Performance Monitoring using TAU and MRNet,” Computing Concurrency and Computation: Practice and Experience, 22(6):720–735, 2009, special issue on Scalable Tools for High-End Computing. A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, “In Search of Sweet-Spots in Parallel Performance Monitoring,” Conference on Cluster Computing (Cluster 2008). K. Huck, A. Malony, A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,” EuroPVMMPI, 2006.

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmon: Design  Design a transport-neutral application monitoring framework  Base on prior / existing work with various transport systems  Supermon, MRNet, MPI  Enable efficient development of monitoring functionality  Objectives  Scalable access to a running application’s performance  at end of the application (before parallel teardown)  while the application is still running  Support for scalable performance data analysis  reduction  statistical evaluation  Feedback (data, control) to application  Monitoring engineering and performance efficiency issues 10

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmon: Architecture MPI process 0 MPI process kMPI process P-1 TAUmon TAU profiles threads MPI monitoring infrastructure

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmon: Current Usage  TAU_ONLINE_DUMP() collective operations in application  Called by all thread / processes (originally to output profiles)  Arguments specify data analysis operation (future)  Appropriate version of TAU selected for transport system  TAUmonMRnet: TAUmon using MRNet infrastructure  TAUmonMPI: TAUmon using MPI infrastructure  User instruments application with TAU support for desired monitoring transport system (temporary)  User submits instrumented application to parallel job system  Other launch systems must be submitted along with the application to the job scheduler as needed  different machine-specific job-submission scripts 12

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmon: Parallel Profile Data Analysis  Total parallel profile data size depends on:  # events * size of event * # execution threads *  Event size depends on # metrics  Example: 200 events * 100 bytes * 64,000 threads = 1.28 G  Monitoring operations  Periodic profile data output (à la profile snapshorts)  Events unification  Basic statistics: mean, min/max, standard deviation,...  Advanced statistics: histogram, clustering,...  Strong motivation to implement the operations in parallel

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Profile Event Unification  TAU creates events for each process individually  Assigns event identifiers locally  Same event can have different identifiers on each process  Analysis requires event identifiers to be unified  Currently done offline  TAU must output full event information from each process  Output format stores event names leading to redundancy  Inflates the storage requirements (e.g., 1.28 G 5 GB)  Implement online parallel event unification  Two-phase process

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Parallel Profile Merging  TAU creates a file for every thread of execution  Profile merging will reduce the number of files generated  Profiles from each thread are sent to a root process  Root process concatenates into a single file  Pre-requisite: event unification  Event unification combined with profile merging leads to more compact storage (reduced)  PFLOTRAN example:  16K cores at 1.5 GB to 300 MB merged  131K cores at 27 GB to 600 MB merged 15

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 FLASH Sod 2D | N=1024 | Allreduce Sudden spike at iteration 100 Basic Statistics  Mean profile  Averaged values for all events and metrics across all threads  Easily created using simple reduction summation operations  Can generate other basic statistics in same way  Parallel statistical reduction of profile events can be very fast  Supports time-series observations  Significant events by mean value 16

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Histogramming  Determine distribution of threads per event  Dividing the range of values by a number of bins  Determine number of threads with event values in each bin  Pre-requisites: min/max values and number of bins  Implementation:  Broadcast min/max and # bin to each node  Node decides which bins to increment based on own its values  Partial bin increments from each node are summed via reduction tree to the root

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 FLASH Sod 2D | N=1024 | Allreduce No. of Ranks Histogramming (continued)  Histograms are useful for highlighting changes in thread distribution of a single event over time 18

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Basic K-Means Clustering  Discover K equivalence classes of thread behavior  Defined as the vector of all its event values over a single metric  Differences in behavior measured by computing Euclidean distance between the vectors in E dimensional space where E is the number of events 19 Event: MPI_Allreduce Metric: Exclusive Time Event: foo() Metric: Exclusive Time Euclidean Distance over 2 dimensions

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 K-means Clustering (continued)  Parallel K-Means Clustering algorithm (Root) Root-1: Choose initial K centroids (event-value vectors) Root-2: Broadcast initial centroids to each Node Root-3: While not converged: 3a: Receive vector of changes from each Node 3b: Apply change vector to K centroids 3c: If no change to centroids and centroid membership, converged is set to true 3d: Otherwise, broadcast new centroids to each Node Root-4: Broadcast convergence notice to each Node 20

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 K-means Clustering (continued)  Parallel K-Means Clustering algorithm (Node) Node-1: While not converged: 1a: Receive latest K centroid vectors from Root 1b: For each thread t’s event vector, determine which centroid it is closest to 1c: If t’s closest centroid changes from k to k-prime, subtract t’s event vector from k’s entry in the change vector and add the same value to k-prime’s entry 1d: Send change vector through the reduction tree to Root 1e: Receive convergence notification from Root  Algorithm produces K mean profiles, one for each cluster  Clustering reduces data and can discover performance trends 21

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMRNet (a.k.a. ToM Revisited)  TAU over MRNet (ToM)  Previously working with MRNet 2.1 (Cluster 2008 paper)  1-phase and 3-phase filters  Explore overlay network with different span out (nodes)  TAUmon re-engineered for MRNet 3.0 (released last week!)  Re-implement ToM functionality  Use new MRNet support  Current implementation uses pre-released MRNet 3.0 version  Testing with released version 22

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 MRNet Network Configuration  Scripts used to set up MRNet network configuration  Given P = number of cores for the application, the user can choose an appropriate N = number of tree nodes and K = fanout for deciding how to allocate sufficient computing resources for both application and MRNet  Number of network leaves can be computed as (N/K)*(K-1)  Probe processes discover and partition computing resources between the application and MRNet  mrnet_topgen utility will write a topology file given K and N and a list of processor hosts available exclusively for MRNet  TAU frontend reads topology file to create the MRNet tree and then write a new file to inform application how it can connect to the leaves of the tree 23

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Monitoring Operation with MRNet  Application collectively invokes TAU_ONLINE_DUMP() to start monitoring operations using current performance information  TAU data is accessed and sent through MRNet’s communication API via streams and filters  Filters perform appropriate aggregation operations on data  TAU frontend is responsible for collecting the data, storing it, and eventual delivery to a consumer 24

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMPI  Use MPI-based transport  No separate launch mechanisms  Parallel gather operations implemented as a binomial heap with staged MPI point-to-point calls (Rank 0 serves as root)  Current limitations:  Application shares parallel resources with monitoring transport  Monitoring operations may cause performance intrusion  No user control of transport network configuration  Potential advantages  Easy to use  Could be more robust overall rank 0

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmon Experiments: PFLOTRAN  Predictive modeling of subsurface reactive flows  Machines  ORNL Jaguar and UTK Kraken, Cray XT5  Processor counts  16,380 cores and 131Kcores, 12K (interactive)  Scaling  Instrumentation (Source, PMPI)  Full: 1131 events total, lots of small routines  Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19 PETSc)  with and without callpaths  Measurements (PAPI)  Execution time (TOT CYC)  Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL 26

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMPI Event Unification (Cray XT5) 27 TAU unification and merge time

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMPI Scaling (PFLOTRAN, Cray XT5) 28 New histogram timings 12288: secs 24576:

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMRnet Scaling (PFLOTRAN, Cray XT5) 29

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMPI Scaling (PFLOTRAN, BG/P) 30

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)  4104 cores running with 374 extra cores for MRNet transport  Each line bar shows the mean profile of an iteration 31

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)  Frames (iteration) 12, 17, 21 12k PFLOTRAN execution  Shifts in order of events sorted by average value over time 32

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMRnet Snapshot (FLASH, Cray XT5) 33  Sod 2D, 1,536 Cray XT5 cores  Over 200 iterations. 15 maximum levels of refinement.  MPI_Alltoall plateaus correspond to AMR refinement

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 TAUmonMRnet Clustering (FLASH, Cray XT5) 34 MPI_Init MPI_Alltoall COMPRESS_LIST MPI_Allreduce DRIVER_COMPUTEDT MPI_Init MPI_Alltoall MPI_Allreduce DRIVER_COMPUTEDT MPI_Alltoall MPI_Init MPI_Allreduce DRIVER_COMPUTEDT MPI_Alltoall COMPRESS_LISTMPI_Allreduce

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Validating Performance Monitoring Operations  Build parallel program that pre-loads parallel profiles  Use to validated quickly onitoring operation algorithms  Monitoring operation performance can be quickly observed, analyzed, and optimized  No need to pay repeated costs of running applications to a desired point in time with real pre-generated profiles  Currently developing TAUmon validation tool 35

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Conclusion  Scalable performance monitoring will be important  Reduce volume of performance data output  Take advantage of parallel analysis  Provide online feedback to application  Require scalable infrastructure and integration  TAUmon developed to support TAU monitoring  Targets two transport infrastructures: MRNet and MPI  Demonstrated with scalable applications  Prototype shows good analysis efficiency  Add support for application feedback  Release of TAUmon with TAU distribution before SC10 36

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 Support Acknowledgements  Department of Energy (DOE)  Office of Science  ASC/NNSA  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  NSF Software Development for Cyberinfrastructure (SDCI)  Research Centre Juelich  Argonne National Laboratory  Technical University Dresden  ParaTools, Inc.  NVIDIA 37