Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment.

Allen D. Malony malony@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment Performance Data Management and Data Mining

UTK2 Outline of Talk  Performance problem solving  Scalability, productivity, and performance technology  Application-specific and autonomic performance tools  TAU parallel performance system  Performance data management and data mining  Performance Data Management Framework (PerfDMF)  PerfExplorer  Multi-experiment case studies  Comparative analysis (PERC tool study)  Clustering analysis  Future work and concluding remarks

Multi-Experiment Performance Data Management and Data MiningUTK3 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology

Multi-Experiment Performance Data Management and Data MiningUTK4 Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process may depend on scale of parallel system  Standard approaches deliver a lot of data with little value  What are the important events and performance metrics?  Tied to application structure and computational model  Process and tools can be more application-aware  Tools have poor support for application-specific aspects  What are the significant issues that will affect the technology used to support the process?  Enhance application development and benchmarking  New paradigm in performance process and technology

Multi-Experiment Performance Data Management and Data MiningUTK5 Role of Automation and Knowledge Discovery  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

Multi-Experiment Performance Data Management and Data MiningUTK6 TAU Performance System  Tuning and Analysis Utilities (13+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  University of Oregon, Research Center Jülich, LANL

Multi-Experiment Performance Data Management and Data MiningUTK7 TAU Performance System Architecture

Multi-Experiment Performance Data Management and Data MiningUTK8 TAU Performance System Architecture

Multi-Experiment Performance Data Management and Data MiningUTK9 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

Multi-Experiment Performance Data Management and Data MiningUTK10 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

Multi-Experiment Performance Data Management and Data MiningUTK11 Automatic Performance Analysis Tool (Concept) PSU: Kathryn Mohror, Karen Karavanic UO: Kevin Huck LLNL: John May, Brian Miller (CASC) PerfTrack Performance Database

Multi-Experiment Performance Data Management and Data MiningUTK12 Performance Data Management Framework

Multi-Experiment Performance Data Management and Data MiningUTK13 ParaProf Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

Multi-Experiment Performance Data Management and Data MiningUTK14 PerfExplorer (K. Huck, UO)  Performance knowledge discovery framework  Use the existing TAU infrastructure  TAU instrumentation data, PerfDMF  Client-server based system architecture  Data mining analysis applied to parallel performance data  Technology integration  Relational DatabaseManagement Systems (RDBMS)  Java API and toolkit  R-project / Omegahat statistical analysis  Web-based client  Jakarta web server and Struts (for a thin web-client)

Multi-Experiment Performance Data Management and Data MiningUTK15 PerfExplorer Architecture Client is a traditional Java application with GUI (Swing) Server accepts multiple client requests and returns results PerfDMF Java API used to access DBMS via JDBC Server supports R data mining operations built using RSJava Analyses can be scripted, parameterized, and monitored Browsing of analysis results via automatic web page creation and thumbnails

Multi-Experiment Performance Data Management and Data MiningUTK16 PERC Tool Requirements and Evaluation  Performance Evaluation Research Center (PERC)  DOE SciDAC  Evaluation methods/tools for high-end parallel systems  PERC tools study (led by ORNL, Pat Worley)  In-depth performance analysis of select applications  Evaluation performance analysis requirements  Test tool functionality and ease of use  Applications  Start with fusion code – GYRO  Repeat with other PERC benchmarks  Continue with SciDAC codes

Multi-Experiment Performance Data Management and Data MiningUTK17 GYRO Execution Parameters  Three benchmark problems  B1-std: 16n processors, 500 timesteps  B2-cy: 16n processors, 1000 timesteps  B3-gtc: 64n processors, 100 timesteps  Test different methods to evaluate nonlinear terms:  Direct method  FFT (“nl2” for B1 and B2, “nl1” for B3)  Task affinity enabled/disabled (p690 only)  Memory affinity enabled/disabled (p690 only)  Filesystem location (Cray X1 only)

Multi-Experiment Performance Data Management and Data MiningUTK18 Primary Evaluation Machines  Phoenix (ORNL – Cray X1)  512 multi-streaming vector processors  Ram (ORNL – SGI Altix (1.5 GHz Itanium2))  256 total processors  TeraGrid  ~7,738 total processors on 15 machines at 9 sites  Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))  864 total processors on 27 compute nodes  Seaborg (NERSC – IBM SP3)  6080 total processors on 380 compute nodes

Multi-Experiment Performance Data Management and Data MiningUTK19 Communication Region (Events) of Interest  Total program is measured, plus specific code regions  NL: nonlinear advance  NL_tr*: transposes before / after nonlinear advance  Coll: collisions  Coll_tr*: transposes before/after main collision routine  Lin_RHS : compute right hand side of the electron and ion GKEs (GyroKinetic (Vlasov) Equations)  Field: explicit or implicit advance of fields and solution of explicit maxwell equations  I/O, extras

Multi-Experiment Performance Data Management and Data MiningUTK20 Data Collected Thus Far…  User timer data  Self instrumentation in the GYRO application  Outputs aggregate data per N timesteps  N = 50 (B1, B3)  N = 125 (B2)  HPM (Hardware Performance Monitor) data  IBM platform (p690) only  MPICL profiling/tracing  Cray X1 and IBM p690  TAU (all platforms, profiling/tracing, in progress)  Data processed by hand into Excel spreadsheets

Multi-Experiment Performance Data Management and Data MiningUTK21 PerfExplorer Analysis of Self-Instrumented Data  PerfExplorer  Focus on comparative analysis  Apply to PERC tool evaluation study  Look at user timer data  Aggregate data  no per process data  process clustering analysis is not applicable  Timings output every N timesteps  some phase analysis possible  Goal  Recreate manually generated performance reports

Multi-Experiment Performance Data Management and Data MiningUTK22 Comparative Analysis  Supported analysis  Timesteps per second  Relative speedup and efficiency  For entire application (compare machines, parameters, etc.)  For all events (on one machine, one set of parameters)  For one event (compare machines, parameters, etc.)  Fraction of total runtime for one group of events  Runtime breakdown (as a percentage)  Initial analysis implemented as scalability study  Future analysis  Arbitrary organization  Parametric studies

Multi-Experiment Performance Data Management and Data MiningUTK23 PerfExplorer Interface Select experiments and trials of interest Data organized in application, experiment, trial structure (will allow arbitrary in future) Experiment metadata

Multi-Experiment Performance Data Management and Data MiningUTK24 PerfExplorer Interface Select analysis

Multi-Experiment Performance Data Management and Data MiningUTK25 B1-std B3-gtc Timesteps per Second  Cray X1 is the fastest to solution in all 3 tests  FFT (nl2) improves time for B3-gtc only  TeraGrid faster than p690 for B1-std?  Plots generated automatically B1-std B2-cy B3-gtc TeraGrid

Multi-Experiment Performance Data Management and Data MiningUTK26 Relative Efficiency (B1-std)  By experiment (B1-std)  Total runtime (Cheetah (red))  By event for one experiment  Coll_tr (blue) is significant  By experiment for one event  Shows how Coll_tr behaves for all experiments 16 processor base case CheetahColl_tr

Multi-Experiment Performance Data Management and Data MiningUTK27 Relative Speedup (B2-cy)  By experiment (B2-cy)  Total runtime (X1 (blue))  By event for one experiment  NL_tr (orange) is significant  By experiment for one event  Shows how NL_tr behaves for all experiments

Multi-Experiment Performance Data Management and Data MiningUTK28 Fraction of Total Runtime (Communication)  IBM SP3 (cyan) has the highest fraction of total time spent in communication for all three benchmarks  Cray X1 has the lowest fraction in communication B1-std B2-cy B3-gtc

Multi-Experiment Performance Data Management and Data MiningUTK29 Runtime Breakdown on IBM SP3  Communications grows as a percentage of total as the application scales (colors match in graphs)  Both Coll_tr (blue) and NL_tr (orange) scale poorly  I/O (green) scales poorly, but its percentage of total runtime is small

Multi-Experiment Performance Data Management and Data MiningUTK30 Clustering Analysis  “Scalable Analysis Techniques for Microprocessor Performance Counter Metrics,” Ahn and Vetter, SC2002  Applied multivariate statistical analysis techniques to large datasets of performance data (PAPI events)  Cluster Analysis and F-Ratio  Agglomerative Hierarchical Method - dendogram identified groupings of master, slave threads in sPPM  K-means clustering and F-ratio - differences between master, slave related to communication and management  Factor Analysis  shows highly correlated metrics fall into peer groups  Combined techniques (recursively) leads to observations of application behavior hard to identify otherwise

Multi-Experiment Performance Data Management and Data MiningUTK31 Similarity Analysis  Can we recreate Ahn and Vetter’s results?  Apply techniques from the phase analysis (Sherwood)  Threads of execution can be compared for similarity  Threads with abnormal behavior show up as less similar  Each thread is represented as a vector (V) of dimension n  n is the number of functions in the application V = [f 1, f 2, …, f n ] (represent event mix)  Each value is the percentage of time spent in that function  normalized from 0.0 to 1.0  Distance calculated between the vectors U and V: ManhattanDistance(U, V) = ∑ |u i - v i | i=0 n

Multi-Experiment Performance Data Management and Data MiningUTK32 sPPM on Blue Horizon (64x4, OpenMP+MPI) TAU profiles 10 events PerfDMF threads 32-47

Multi-Experiment Performance Data Management and Data MiningUTK33 sPPM on MCR (total instructions, 16x2) TAU/PerfDMF 120 events master (even) worker (odd)

Multi-Experiment Performance Data Management and Data MiningUTK34 sPPM on MCR (PAPI_FP_INS, 16x2) TAU profiles PerfDMF master/worker higher/lower Same result as Ahn/Vetter

Multi-Experiment Performance Data Management and Data MiningUTK35 sPPM on Frost (PAPI_FP_INS, 256 threads)  View of fewer than half of the threads of execution is possible on the screen at one time  Three groups are obvious:  Lower ranking threads  One unique thread  Higher ranking threads  3% more FP  Finding subtle differences is difficult with this view

Multi-Experiment Performance Data Management and Data MiningUTK36  Dendrogram shows 5 natural clusters:  Unique thread  High ranking master threads  Low ranking master threads  High ranking worker threads  Low ranking worker threads sPPM on Frost (PAPI_FP_INS, 256 threads) TAU profiles PerfDMF R direct access to DM R routine threads

Multi-Experiment Performance Data Management and Data MiningUTK37 sPPM on MCR (PAPI_FP_INS, 16x2 threads) masters slaves

Multi-Experiment Performance Data Management and Data MiningUTK38 sPPM on Frost (PAPI_FP_INS, 256 threads)  After K-means clustering into 5 clusters  Similar clusters are formed (seed with group means)  Each cluster’s performance characteristics analyzed  Dimensionality reduction (256 threads to 5 clusters!) SPPMINTERFDIFUZEDINTERFBarrier [OpenMP:runhyd3.F ] 1612011910

Multi-Experiment Performance Data Management and Data MiningUTK39 Current and Future Work  ParaProf  Developing 3D performance displays  PerfDMF  Adding new database backends and distributed support  Building support for user-created tables  PerfExplorer  Extending comparative and clustering analysis  Adding new data mining capabilities  Building in scripting support  Performance regression testing tool (PerfRegress)  Integrate in Eclipse Parallel Tool Project (PTP)

Multi-Experiment Performance Data Management and Data MiningUTK40 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community

Multi-Experiment Performance Data Management and Data MiningUTK41 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  NSF  High-End Computing Grant  Research Centre Juelich  John von Neumann Institute  Dr. Bernd Mohr  Los Alamos National Laboratory

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment.

Similar presentations

Presentation on theme: "Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment.

Similar presentations

Presentation on theme: "Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment."— Presentation transcript:

Similar presentations

About project

Feedback