TAUmon: Scalable Online Performance Data Analysis in TAU

TAUmon: Scalable Online Performance Data Analysis in TAU
Chee Wai Lee, Allen D. Malony, Alan Morris Department of Computer and Information Science Performance Research Laboratory University of Oregon

Outline Motivation Brief review of prior work
TAUmon design and objectives Scalable analysis operations Transports MRNet MPI TAUmon experiments Perspectives on understanding applications Experiments Scaling results Remarks

Motivation Performance problem analysis is increasingly complex
Multi-core, heterogeneous, and extreme scale computing Adaptive algorithms and runtime application tuning Performance dynamics variability within/between executions Neo-performance measurement and analysis perspective Static, offline analysis dynamic, online analysis Scalable runtime analysis of parallel performance data Performance feedback to application for adaptive control Integrated performance monitoring (measurement + query) Co-allocation of additional (tool specific) system resources Goal Scalable, integrated parallel performance monitoring

Parallel Performance Measurement and Data
Parallel performance tools measure locally and concurrently Scaling dictates “local” measurements (profile, trace) save data with "local context" (processes or threads) Done without synchronization or central control Parallel performance state is globally distributed as a result Logically part of application’s global data space Offline: outputs data at execution end for post-mortem analysis Online: access to performance state for runtime analysis Definition: Monitoring Online access to parallel performance (data) state May or may not involve runtime analysis Here we make the claim that there are clear arguments for having runtime performance information. We also point out that the application’s needs come first.

Monitoring for Performance Dynamics
Runtime access to parallel performance data Scalable and lightweight Raises concerns of overhead and intrusion Support for performance-adaptive, dynamic applications Alternative 1: Extend existing performance measurement Create own integrated monitoring infrastructure Disadvantage: maintain own monitoring framework Alternative 2: Couple with other monitoring infrastructure Leverage scalable middleware from other supported projects Challenge: measurement system / monitor integration

Performance Dynamics: Parallel Profile Snapshots
Profile snapshots are parallel profiles recorded at runtime Shows performance profile dynamics (all types allowed) Traces Overhead This slide shows a four processor Flash run showing how different snapshots differ. Before the main loop beings, I write a snapshot to mark the performance data up to that point. It shows up as “Initialization”. At the end of each loop, a snapshot is written. At the end of execution, a final snapshot is written labeled “Finalization”. I’m 99% sure the four main spikes are the checkpointing phases involving a lot of IO. The big purple color on top is “Other”. I have it limited to top 20 functions + “other”. Profile Snapshots Profiles Information A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008. 6

Parallel Profile Snapshots of FLASH 3.0 (UIC)
Simulation of astrophysical thermonuclear flashes Snapshots show profile differences since last snapshot Captures all events since beginning per thread Mean profile calculated post- mortem Highlight change in performance per iteration and at checkpointing Initialization Finalization This slide shows a four processor Flash run showing how different snapshots differ. Before the main loop beings, I write a snapshot to mark the performance data up to that point. It shows up as “Initialization”. At the end of each loop, a snapshot is written. At the end of execution, a final snapshot is written labeled “Finalization”. I’m 99% sure the four main spikes are the checkpointing phases involving a lot of IO. The big purple color on top is “Other”. I have it limited to top 20 functions + “other”. Checkpointing 7

FLASH 3.0 Performance Dynamics (Periodic)
INTRFC Differential number of calls, line chart 8

Prior Performance Monitoring Work
TAUoverSupermon (UO, Los Alamos National Laboratory) TAUg (UO) TAUoverMRNET (UO, University of Wisconsin, Madison) A. Nataraj, M. Sottile, A. Morris, A. Malony, and S. Shende, “TAUoverSupermon: Low-overhead Online Parallel Performance Monitoring,” EuroPar, 2007. K. Huck, A. Malony, A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,” EuroPVMMPI, 2006. Some uses of online monitoring: 0. Real-time visualization Application performance steering 2 steering of the performance measurement itself A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, “A Framework for Scalable, Parallel Performance Monitoring using TAU and MRNet,” Computing Concurrency and Computation: Practice and Experience, 22(6):720–735, 2009, special issue on Scalable Tools for High-End Computing. A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, “In Search of Sweet-Spots in Parallel Performance Monitoring,” Conference on Cluster Computing (Cluster 2008).

TAUmon: Design Design a transport-neutral application monitoring framework Base on prior / existing work with various transport systems Supermon, MRNet, MPI Enable efficient development of monitoring functionality Objectives Scalable access to a running application’s performance at end of the application (before parallel teardown) while the application is still running Support for scalable performance data analysis reduction statistical evalation Feedback (data, control) to application Monitoring engineering and performance efficiency issues

TAUmon: Architecture ... ... TAUmon MPI TAU profiles threads
monitoring infrastructure TAU profiles ... ... ... ... ... threads MPI process 0 MPI process k MPI process P-1 MPI

TAUmon: Current Usage TAU_ONLINE_DUMP() collective operations in application Called by all thread / processes (originally to output profiles) Arguments specify data analysis operation to perform Appropriate version of TAU selected for transport system TAUmonMRnet: TAUmon using MRNet infrastructure TAUmonMPI: TAUmon using MPI infrastructure User instruments application with TAU support for desired monitoring transport system (temporary) User submits instrumented application to parallel job system Other launch systems must be submitted along with the application to the job scheduler as needed different machine-specific job-submission scripts

TAUmon: Parallel Profile Data Analysis
Total parallel profile data size depends on: # events * size of event * # execution threads * Event size depends on # metrics Example: 200 events * 100 bytes * 64,000 threads = 1.28 G Monitoring operations Periodic profile data output (à la profile snapshorts) Events unification Basic statistics: mean, min/max, standard deviation, ... Advanced statistics: histogram, clustering, ... Strong motivation to implement the operations in parallel

Profile Event Unification
TAU creates events for each process individually Assigns event identifiers locally Same event can have different identifiers on each process Analysis requires event identifiers to be unified Currently done offline TAU must output full event information from each process Output format stores event names leading to redundancy Inflates the storage requirements (e.g., 1.28 G GB) Implement online parallel event unification Two-phase process

Parallel Profile Merging
TAU creates a file for every thread of execution Profile merging will reduce the number of files generated Profiles from each thread are sent to a root process Root process concatenates into a single file Pre-requisite: event unification Event unification combined with profile merging leads to more compact storage (reduced) PFLOTRAN example: 16K cores at 1.5 GB to 300 MB merged 131K cores at 27 GB to 600 MB merged

Sudden spike at iteration 100
Basic Statistics Mean profile Averaged values for all events and metrics across all threads Easily created using simple reduction summation operations Can generate other basic statistics in same way Parallel statistical reduction of profile events can be very fast Supports time-series observations Significant events by mean value FLASH Sod 2D | N=1024 | Allreduce Sudden spike at iteration 100

Histogramming Determine distribution of threads per event
Dividing the range of values by a number of bins Determine number of threads with event values in each bin Pre-requisites: min/max values and number of bins Implementation: Broadcast min/max and # bin to each node Node decides which bins to increment based on own its values Partial bin increments from each node are summed via reduction tree to the root 2 2 4 1 3 1 2 1

Histogramming (continued)
Histograms are useful for highlighting changes in thread distribution of a single event over time FLASH Sod 2D | N=1024 | Allreduce No. of Ranks

Basic K-Means Clustering
Discover K equivalence classes of thread behavior Defined as the vector of all its event values over a single metric Differences in behavior measured by computing Euclidean distance between the vectors in E dimensional space where E is the number of events Event: MPI_Allreduce Metric: Exclusive Time Euclidean Distance over 2 dimensions Event: foo() Metric: Exclusive Time

K-means Clustering (continued)
Parallel K-Means Clustering algorithm (Root) Root-1: Choose initial K centroids (event-value vectors) Root-2: Broadcast initial centroids to each Node Root-3: While not converged: 3a: Receive vector of changes from each Node 3b: Apply change vector to K centroids 3c: If no change to centroids and centroid membership, converged is set to true 3d: Otherwise, broadcast new centroids to each Node Root-4: Broadcast convergence notice to each Node

K-means Clustering (continued)
Parallel K-Means Clustering algorithm (Node) Node-1: While not converged: 1a: Receive latest K centroid vectors from Root 1b: For each thread t’s event vector, determine which centroid it is closest to 1c: If t’s closest centroid changes from k to k-prime, subtract t’s event vector from k’s entry in the change vector and add the same value to k-prime’s entry 1d: Send change vector through the reduction tree to Root 1e: Receive convergence notification from Root Algorithm produces K mean profiles, one for each cluster Clustering reduces data and can discover performance trends

TAUmonMRNet (a.k.a. ToM Revisited)
TAU over MRNet (ToM) Previously working with MRNet 2.1 Cluster 2008 paper 1-phase and 3-phase filters Explore overlay network with different span out (nodes) TAUmon re-engineered for MRNet 3.0 (released last week!) Re-implement ToM functionality Use new MRNet support Current implementation uses pre-released MRNet 3.0 version Testing with released version

MRNet Network Configuration
Scripts used to set up MRNet network configuration Given P = number of cores for the application, the user can choose an appropriate N = number of tree nodes and K = fanout for deciding how to allocate sufficient computing resources for both application and MRNet Number of network leaves can be computed as (N/K)*(K-1) Probe processes discover and partition computing resources between the application and MRNet mrnet_topgen utility will write a topology file given K and N and a list of processor hosts available exclusively for MRNet TAU frontend reads topology file to create the MRNet tree and then write a new file to inform application how it can connect to the leaves of the tree

Monitoring Operation with MRNet
Application collectively invokes TAU_ONLINE_DUMP() to start monitoring operations using current performance information TAU data is accessed and sent through MRNet’s communication API via streams and filters Filters perform appropriate aggregation operations on data TAU frontend is responsible for collecting the data, storing it, and eventual delivery to a consumer

Monitoring Operation with MRNet

TAUmonMPI Use MPI-based transport Current limitations:
No separate launch mechanisms Parallel gather operations implemented as a binomial heap with staged MPI point-to-point calls (Rank 0 serves as root) Current limitations: Application shares parallel resources with monitoring transport Monitoring operations may cause performance intrusion No user control of transport network configuration Potential advantages Easy to use Could be more robust overall . . . rank 0

TAUmon Experiments: PFLOTRAN
Predictive modeling of subsurface reactive flows Machines ORNL Jaguar and UTK Kraken, Cray XT5 Processor counts 16,380 cores and 131Kcores, 12K (interactive) Scaling Instrumentation (Source, PMPI) Full: 1131 events total, lots of small routines Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19 PETSc) with and without callpaths Measurements (PAPI) Execution time (TOT CYC) Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL

TAUmonMRnet Event Unification (Cray XT5)
TAU unification and merge time

TAUmonMPI Scaling (PFLOTRAN, Cray XT5)

TAUmonMRnet Scaling (PFLOTRAN, Cray XT5)

TAUmonMPI Scaling (PFLOTRAN, BG/P)

TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)
4104 cores running with 374 extra cores for MRNet transport Each line bar shows the mean profile of an iteration

TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)
Frames (iteration) 12, 17, 21 12k PFLOTRAN execution Shifts in order of events sorted by average value over time

TAUmonMRnet Snapshot (FLASH, Cray XT5)
Sod 2D, 1,536 Cray XT5 cores Over 200 iterations. 15 maximum levels of refinement. MPI_Alltoall plateaus correspond to AMR refinement

TAUmonMRnet Clustering (FLASH, Cray XT5)
MPI_Init MPI_Alltoall COMPRESS_LIST MPI_Allreduce DRIVER_COMPUTEDT

TAUmonMRnet Timings: (PFLOTRAN (Cray XT5)
Only exclusive time is being monitored XT5 Nodes Cores Unification Time (per iteration) Mean Aggregation Time Histogram Generation Time Total Operation Time 342 4,104 s s 2.339s 2.384s 512 6,144 s s 2.06s 2.115s 683 8,196 s s 3.651s 4.278s 1,024 12,288 s s 0.8643s 0.9676s 1,366 16,392 s s 1.861s 3.053s 2,048 24,576 s s 0.6238s 0.6921s

Validating Performance Monitoring Operations
Build parallel program that pre-loads parallel profiles Use to validated quickly onitoring operation algorithms Monitoring operation performance can be quickly observed, analyzed, and optimized No need to pay repeated costs of running applications to a desired point in time with real pre-generated profiles

Conclusion

Support Acknowledgements
Department of Energy (DOE) Office of Science ASC/NNSA Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA

TAUmon: Scalable Online Performance Data Analysis in TAU

Similar presentations

Presentation on theme: "TAUmon: Scalable Online Performance Data Analysis in TAU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TAUmon: Scalable Online Performance Data Analysis in TAU

Similar presentations

Presentation on theme: "TAUmon: Scalable Online Performance Data Analysis in TAU"— Presentation transcript:

Similar presentations

About project

Feedback