Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment.

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

Configuration management

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew.

Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Service Computation 2010November 21-26, Lisbon.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

Service - Oriented Middleware for Distributed Data Mining on the Grid ，劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Architecture of Decision Support System

Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Connections to Other Packages The Cactus Team Albert Einstein Institute

TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.

21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.

1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

VisIt Project Overview

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

Performance Technology for Scalable Parallel Systems

Allen D. Malony, Sameer Shende

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

TAU Performance DataBase Framework (PerfDBF)

Presentation transcript:

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment Performance Data Management and Data Mining

UTK2 Outline of Talk  Performance problem solving  Scalability, productivity, and performance technology  Application-specific and autonomic performance tools  TAU parallel performance system  Performance data management and data mining  Performance Data Management Framework (PerfDMF)  PerfExplorer  Multi-experiment case studies  Comparative analysis (PERC tool study)  Clustering analysis  Future work and concluding remarks

Multi-Experiment Performance Data Management and Data MiningUTK3 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology

Multi-Experiment Performance Data Management and Data MiningUTK4 Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process may depend on scale of parallel system  Standard approaches deliver a lot of data with little value  What are the important events and performance metrics?  Tied to application structure and computational model  Process and tools can be more application-aware  Tools have poor support for application-specific aspects  What are the significant issues that will affect the technology used to support the process?  Enhance application development and benchmarking  New paradigm in performance process and technology

Multi-Experiment Performance Data Management and Data MiningUTK5 Role of Automation and Knowledge Discovery  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

Multi-Experiment Performance Data Management and Data MiningUTK6 TAU Performance System  Tuning and Analysis Utilities (13+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  University of Oregon, Research Center Jülich, LANL

Multi-Experiment Performance Data Management and Data MiningUTK7 TAU Performance System Architecture

Multi-Experiment Performance Data Management and Data MiningUTK8 TAU Performance System Architecture

Multi-Experiment Performance Data Management and Data MiningUTK9 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

Multi-Experiment Performance Data Management and Data MiningUTK10 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

Multi-Experiment Performance Data Management and Data MiningUTK11 Automatic Performance Analysis Tool (Concept) PSU: Kathryn Mohror, Karen Karavanic UO: Kevin Huck LLNL: John May, Brian Miller (CASC) PerfTrack Performance Database

Multi-Experiment Performance Data Management and Data MiningUTK12 Performance Data Management Framework

Multi-Experiment Performance Data Management and Data MiningUTK13 ParaProf Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

Multi-Experiment Performance Data Management and Data MiningUTK14 PerfExplorer (K. Huck, UO)  Performance knowledge discovery framework  Use the existing TAU infrastructure  TAU instrumentation data, PerfDMF  Client-server based system architecture  Data mining analysis applied to parallel performance data  Technology integration  Relational DatabaseManagement Systems (RDBMS)  Java API and toolkit  R-project / Omegahat statistical analysis  Web-based client  Jakarta web server and Struts (for a thin web-client)

Multi-Experiment Performance Data Management and Data MiningUTK15 PerfExplorer Architecture Client is a traditional Java application with GUI (Swing) Server accepts multiple client requests and returns results PerfDMF Java API used to access DBMS via JDBC Server supports R data mining operations built using RSJava Analyses can be scripted, parameterized, and monitored Browsing of analysis results via automatic web page creation and thumbnails

Multi-Experiment Performance Data Management and Data MiningUTK16 PERC Tool Requirements and Evaluation  Performance Evaluation Research Center (PERC)  DOE SciDAC  Evaluation methods/tools for high-end parallel systems  PERC tools study (led by ORNL, Pat Worley)  In-depth performance analysis of select applications  Evaluation performance analysis requirements  Test tool functionality and ease of use  Applications  Start with fusion code – GYRO  Repeat with other PERC benchmarks  Continue with SciDAC codes

Multi-Experiment Performance Data Management and Data MiningUTK17 GYRO Execution Parameters  Three benchmark problems  B1-std: 16n processors, 500 timesteps  B2-cy: 16n processors, 1000 timesteps  B3-gtc: 64n processors, 100 timesteps  Test different methods to evaluate nonlinear terms:  Direct method  FFT (“nl2” for B1 and B2, “nl1” for B3)  Task affinity enabled/disabled (p690 only)  Memory affinity enabled/disabled (p690 only)  Filesystem location (Cray X1 only)

Multi-Experiment Performance Data Management and Data MiningUTK18 Primary Evaluation Machines  Phoenix (ORNL – Cray X1)  512 multi-streaming vector processors  Ram (ORNL – SGI Altix (1.5 GHz Itanium2))  256 total processors  TeraGrid  ~7,738 total processors on 15 machines at 9 sites  Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))  864 total processors on 27 compute nodes  Seaborg (NERSC – IBM SP3)  6080 total processors on 380 compute nodes

Multi-Experiment Performance Data Management and Data MiningUTK19 Communication Region (Events) of Interest  Total program is measured, plus specific code regions  NL: nonlinear advance  NL_tr*: transposes before / after nonlinear advance  Coll: collisions  Coll_tr*: transposes before/after main collision routine  Lin_RHS : compute right hand side of the electron and ion GKEs (GyroKinetic (Vlasov) Equations)  Field: explicit or implicit advance of fields and solution of explicit maxwell equations  I/O, extras

Multi-Experiment Performance Data Management and Data MiningUTK20 Data Collected Thus Far…  User timer data  Self instrumentation in the GYRO application  Outputs aggregate data per N timesteps  N = 50 (B1, B3)  N = 125 (B2)  HPM (Hardware Performance Monitor) data  IBM platform (p690) only  MPICL profiling/tracing  Cray X1 and IBM p690  TAU (all platforms, profiling/tracing, in progress)  Data processed by hand into Excel spreadsheets

Multi-Experiment Performance Data Management and Data MiningUTK21 PerfExplorer Analysis of Self-Instrumented Data  PerfExplorer  Focus on comparative analysis  Apply to PERC tool evaluation study  Look at user timer data  Aggregate data  no per process data  process clustering analysis is not applicable  Timings output every N timesteps  some phase analysis possible  Goal  Recreate manually generated performance reports

Multi-Experiment Performance Data Management and Data MiningUTK22 Comparative Analysis  Supported analysis  Timesteps per second  Relative speedup and efficiency  For entire application (compare machines, parameters, etc.)  For all events (on one machine, one set of parameters)  For one event (compare machines, parameters, etc.)  Fraction of total runtime for one group of events  Runtime breakdown (as a percentage)  Initial analysis implemented as scalability study  Future analysis  Arbitrary organization  Parametric studies

Multi-Experiment Performance Data Management and Data MiningUTK23 PerfExplorer Interface Select experiments and trials of interest Data organized in application, experiment, trial structure (will allow arbitrary in future) Experiment metadata

Multi-Experiment Performance Data Management and Data MiningUTK24 PerfExplorer Interface Select analysis

Multi-Experiment Performance Data Management and Data MiningUTK25 B1-std B3-gtc Timesteps per Second  Cray X1 is the fastest to solution in all 3 tests  FFT (nl2) improves time for B3-gtc only  TeraGrid faster than p690 for B1-std?  Plots generated automatically B1-std B2-cy B3-gtc TeraGrid

Multi-Experiment Performance Data Management and Data MiningUTK26 Relative Efficiency (B1-std)  By experiment (B1-std)  Total runtime (Cheetah (red))  By event for one experiment  Coll_tr (blue) is significant  By experiment for one event  Shows how Coll_tr behaves for all experiments 16 processor base case CheetahColl_tr

Multi-Experiment Performance Data Management and Data MiningUTK27 Relative Speedup (B2-cy)  By experiment (B2-cy)  Total runtime (X1 (blue))  By event for one experiment  NL_tr (orange) is significant  By experiment for one event  Shows how NL_tr behaves for all experiments

Multi-Experiment Performance Data Management and Data MiningUTK28 Fraction of Total Runtime (Communication)  IBM SP3 (cyan) has the highest fraction of total time spent in communication for all three benchmarks  Cray X1 has the lowest fraction in communication B1-std B2-cy B3-gtc

Multi-Experiment Performance Data Management and Data MiningUTK29 Runtime Breakdown on IBM SP3  Communications grows as a percentage of total as the application scales (colors match in graphs)  Both Coll_tr (blue) and NL_tr (orange) scale poorly  I/O (green) scales poorly, but its percentage of total runtime is small

Multi-Experiment Performance Data Management and Data MiningUTK30 Clustering Analysis  “Scalable Analysis Techniques for Microprocessor Performance Counter Metrics,” Ahn and Vetter, SC2002  Applied multivariate statistical analysis techniques to large datasets of performance data (PAPI events)  Cluster Analysis and F-Ratio  Agglomerative Hierarchical Method - dendogram identified groupings of master, slave threads in sPPM  K-means clustering and F-ratio - differences between master, slave related to communication and management  Factor Analysis  shows highly correlated metrics fall into peer groups  Combined techniques (recursively) leads to observations of application behavior hard to identify otherwise

Multi-Experiment Performance Data Management and Data MiningUTK31 Similarity Analysis  Can we recreate Ahn and Vetter’s results?  Apply techniques from the phase analysis (Sherwood)  Threads of execution can be compared for similarity  Threads with abnormal behavior show up as less similar  Each thread is represented as a vector (V) of dimension n  n is the number of functions in the application V = [f 1, f 2, …, f n ] (represent event mix)  Each value is the percentage of time spent in that function  normalized from 0.0 to 1.0  Distance calculated between the vectors U and V: ManhattanDistance(U, V) = ∑ |u i - v i | i=0 n

Multi-Experiment Performance Data Management and Data MiningUTK32 sPPM on Blue Horizon (64x4, OpenMP+MPI) TAU profiles 10 events PerfDMF threads 32-47

Multi-Experiment Performance Data Management and Data MiningUTK33 sPPM on MCR (total instructions, 16x2) TAU/PerfDMF 120 events master (even) worker (odd)

Multi-Experiment Performance Data Management and Data MiningUTK34 sPPM on MCR (PAPI_FP_INS, 16x2) TAU profiles PerfDMF master/worker higher/lower Same result as Ahn/Vetter

Multi-Experiment Performance Data Management and Data MiningUTK35 sPPM on Frost (PAPI_FP_INS, 256 threads)  View of fewer than half of the threads of execution is possible on the screen at one time  Three groups are obvious:  Lower ranking threads  One unique thread  Higher ranking threads  3% more FP  Finding subtle differences is difficult with this view

Multi-Experiment Performance Data Management and Data MiningUTK36  Dendrogram shows 5 natural clusters:  Unique thread  High ranking master threads  Low ranking master threads  High ranking worker threads  Low ranking worker threads sPPM on Frost (PAPI_FP_INS, 256 threads) TAU profiles PerfDMF R direct access to DM R routine threads

Multi-Experiment Performance Data Management and Data MiningUTK37 sPPM on MCR (PAPI_FP_INS, 16x2 threads) masters slaves

Multi-Experiment Performance Data Management and Data MiningUTK38 sPPM on Frost (PAPI_FP_INS, 256 threads)  After K-means clustering into 5 clusters  Similar clusters are formed (seed with group means)  Each cluster’s performance characteristics analyzed  Dimensionality reduction (256 threads to 5 clusters!) SPPMINTERFDIFUZEDINTERFBarrier [OpenMP:runhyd3.F ]

Multi-Experiment Performance Data Management and Data MiningUTK39 Current and Future Work  ParaProf  Developing 3D performance displays  PerfDMF  Adding new database backends and distributed support  Building support for user-created tables  PerfExplorer  Extending comparative and clustering analysis  Adding new data mining capabilities  Building in scripting support  Performance regression testing tool (PerfRegress)  Integrate in Eclipse Parallel Tool Project (PTP)

Multi-Experiment Performance Data Management and Data MiningUTK40 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community

Multi-Experiment Performance Data Management and Data MiningUTK41 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  NSF  High-End Computing Grant  Research Centre Juelich  John von Neumann Institute  Dr. Bernd Mohr  Los Alamos National Laboratory