Allen D. Malony, Kevin Huck Department of Computer and Information Science Performance.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Test Automation Success: Choosing the Right People & Process

OpenFOAM on a GPU-based Heterogeneous Cluster

Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.

ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Performance Evaluation of S3D using TAU Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

Chapter 10 Application Development. Chapter Goals Describe the application development process and the role of methodologies, models and tools Compare.

Course Instructor: Aisha Azeem

1 CS101 Introduction to Computing Lecture 19 Programming Languages.

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Microsoft Visual Basic 2012 CHAPTER ONE Introduction to Visual Basic 2012 Programming.

Microsoft Visual Basic 2005 CHAPTER 1 Introduction to Visual Basic 2005 Programming.

Data Mining Chun-Hung Chou

An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,

Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.

CQoS Update Li Li, Boyana Norris, Lois Curfman McInnes Argonne National Laboratory Kevin Huck University of Oregon.

CPIS 357 Software Quality & Testing

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Cluster Reliability Project ISIS Vanderbilt University.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Model-Driven Analysis Frameworks for Embedded Systems George Edwards USC Center for Systems and Software Engineering

Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)

Kevin A. Huck Department of Computer and Information Science Performance Research Laboratory University of.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.

TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

UCI Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert David F. Redmiles Information and Computer Science.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

The PLA Model: On the Combination of Product-Line Analyses 강태준.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Advanced Computer Systems

Introduction to Visual Basic 2008 Programming

Allen D. Malony, Sameer Shende

Model-Driven Analysis Frameworks for Embedded Systems

Chapter 1 Database Systems

CMPT 733, SPRING 2016 Jiannan Wang

Chapter 5 Designing the Architecture Shari L. Pfleeger Joanne M. Atlee

CMPT 733, SPRING 2017 Jiannan Wang

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Allen D. Malony, Kevin Huck Department of Computer and Information Science Performance Research Laboratory University of Oregon Knowledge Support for Mining Parallel Performance Data

APART, November 11, Knowledge Support for Mining Parallel Performance Data Outline  Why mine parallel performance data?  Our first attempt  PerfDMF  PerfExplorer  How did we do? Why knowledge-driven data mining?  PerfExplorer v2  Analysis process automation  Metadata encoding and incorporation  Inference engine  Object persistence and provenance  Analysis examples

APART, November 11, Knowledge Support for Mining Parallel Performance Data Motivation for Performance Data Mining  High-end parallel applications and systems evolution  More sophisticated, integrated, heterogeneous operation  Higher levels of abstraction  Larger scales of execution  Evolution trends change performance landscape  Parallel performance data becomes more complex  Multivariate, higher dimensionality, heterogeneous  Greater scale and larger data size  Standard analysis techniques overwhelmed  Need data management and analysis automation  Provide foundation for performance analytics

APART, November 11, Knowledge Support for Mining Parallel Performance Data Performance Data Mining Objectives  Conduct parallel performance analysis in a systematic, collaborative and reusable manner  Manage performance data and complexity  Discover performance relationship and properties  Automate performance investigation process  Multi-experiment performance analysis  Large-scale performance data reduction  Summarize characteristics of large processor runs  Implement extensible analysis framework  Abtraction / automation of data mining operations  Interface to existing analysis and data mining tools

APART, November 11, Knowledge Support for Mining Parallel Performance Data Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP gprofcube mpiPO|SS psrunHPMToolkit …

APART, November 11, Knowledge Support for Mining Parallel Performance Data Analysis Framework (PerfExplorer)  Leverage existing TAU infrastructure  Focus on parallel profiles  Build on PerfDMF  Support large-scale performance analysis  Multiple experiments  Parametric studies  Apply data mining operations  Comparative, clustering, correlation, dimension reduction, …  Interface to existing tools (Weka, R)  Abstraction/automation

APART, November 11, Knowledge Support for Mining Parallel Performance Data Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

APART, November 11, Knowledge Support for Mining Parallel Performance Data Data: GYRO on various architectures Relative Comparisons  Total execution time  Timesteps per second  Relative efficiency  Relative efficiency per event  Relative speedup  Relative speedup per event  Group fraction of total  Runtime breakdown  Correlate events with total runtime  Relative efficiency per phase  Relative speedup per phase  Distribution visualizations

APART, November 11, Knowledge Support for Mining Parallel Performance Data Cluster Analysis Data: sPPM on Frost (LLNL), 256 threads count PCA scatterplot topology minavgmax PerfDMF databases

APART, November 11, Knowledge Support for Mining Parallel Performance Data Correlation Analysis Data: FLASH on BGL(LLNL), 64 nodes Strong negative linear correlation between CALC_CUT_BLOCK_CONTRIBUTIONS and MPI_Barrier

APART, November 11, Knowledge Support for Mining Parallel Performance Data 4-D Visualization 4 “significant” events are selected clusters and correlations are visible Data: FLASH on BG/L (LLNL), 1024 nodes

APART, November 11, Knowledge Support for Mining Parallel Performance Data PerfExplorer Critique (Describe vs. Explain)  Specific parametric study support (not general)  No way to capture the analysis processes  No analysis history - how were these results generated?  PerfExplorer just redescribed the performance results  PerfExplorer should explain performance phenomena  What are the causes for performance observed?  What are the factors and how do they interrelate?  Performance analytics, forensics, and decision support  Automated analysis needs good informed feedback  Iterative tuning, performance regression testing  Performance model generation requires interpretation

APART, November 11, Knowledge Support for Mining Parallel Performance Data How to explain behavior? Add Knowledge!  Offline parallel performance tools should not have to treat the application and system as a “black box”  Need to add knowledge to do more intelligent things  Where does it come from?  Experiment context  Application-specific information  System-specific performance  General performance expertise  We need better methods and tools for  Integrating meta-information  Knowledge-based performance problem solving

APART, November 11, Knowledge Support for Mining Parallel Performance Data Performance Knowledge Metadata and Knowledge Role in Analysis SourceCode Build Environmen t Run Environmen t Performance Result Execution You have to capture these......to understand this Application Machine Performanc e Problems Context Knowledge

APART, November 11, Knowledge Support for Mining Parallel Performance Data neighbors 3 neighbors 4 neighbors 4x4 example: Center cells Corner cells Data: Sweep3D on Linux Cluster, 16 processes Example: Sweep3D Domain Decomposition  Wavefront evaluation with a recursion dependence in all 3 grid directions  Edge cells: 3 neighbors  Corner cells: 2 neighbors  Communication is affected

APART, November 11, Knowledge Support for Mining Parallel Performance Data PerfExplorer v2 – Requirements and Features  Component-based analysis process  Analysis operations implemented as modules  Linked together in analysis process and workflow  Scripting  Provides process/workflow development and automation  Metadata input, management, and access  Inference engine  Reasoning about causes of performance phenomena  Analysis knowledge captured in expert rules  Persistence of intermediate results  Provenance  Provides historical record of analysis results

APART, November 11, Knowledge Support for Mining Parallel Performance Data PerfExplorer v2 – Design new

APART, November 11, Knowledge Support for Mining Parallel Performance Data Component Interaction

APART, November 11, Knowledge Support for Mining Parallel Performance Data Analysis Components for Scripting  Analysis components implement data mining operations  Support easy to use interfaces (Java) Basic StatisticsExtract eventsTop X events CopyExtract metricsTop X percent events CorrelationExtract phasesANOVA Correlation with metadataExtract rankLinear regression* Derive metricsk-meansNon-linear regression* DifferenceMerge trialsBackward elimination* Extract callpath eventsPCACorrelation elimination* Extract non-callpath eventsScalability * future development

APART, November 11, Knowledge Support for Mining Parallel Performance Data Embedded Scripting  Jython (a.k.a. JPython) scripting provides API access to Java analysis components  Makes new analyses and processes easier to program  Allows for repeatable analysis processing  Provides for automation  Multiple datasets  Tuning iteration  Supports workflow creation  Could use other scripting languages (JRuby, Jacl, …) print " JPython test script start " # create a rulebase for processing ruleHarness = RuleHarness.useGlobalRules("rules/GeneralRules.drl ") ruleHarness.addRules("rules/ApplicationRules.drl") ruleHarness.addRules("rules/MachineRules.drl") # load the trial and get the metadata Utilities.setSession("apart") trial = Utilities.getTrial("sweep3d", "jaguar", "16") trialResult = TrialResult(trial) trialMetadata = TrialThreadMetadata(trial) # extract the top 5 events getTop5 = TopXEvents(trial, trial.getTimeMetric(), AbstractResult.EXCLUSIVE, 5) top5 = getTop5.processData().get(0) # correlate the event data with metadata correlator = CorrelateEventsWithMetadata(top5, trialMetadata) output = correlator.processData().get(0) RuleHarness.getInstance().assertObject(output); # process rules and output result RuleHarness.getInstance().processRules() print " JPython test script end "

APART, November 11, Knowledge Support for Mining Parallel Performance Data Loading Metadata into PerfDMF  Three ways to incorporate metadata  Measured hardware/system information (TAU, PERI-DB)  CPU speed, memory in GB, MPI node IDs, …  Application instrumentation (application-specific)  TAU_METADATA() used to insert any name/value pair  Application parameters, input data, domain decomposition  PerfDMF data management tools can incorporate an XML file of additional metadata  Compiler flags, submission scripts, input files, …  Metadata can be imported from / exported to PERI-DB  PERI SciDAC project (UTK, NERSC, UO, PSU, TAMU)  Performance data and metadata integration

APART, November 11, Knowledge Support for Mining Parallel Performance Data Profile Data Metadata into PerfDMF Profile Data Auto-collected metadata User-specified metadata metadata.xml Build metadata Runtime metadata Submission metadata PerfDMF Other metadata Performance measurements

APART, November 11, Knowledge Support for Mining Parallel Performance Data Metadata in PerfExplorer  GTC on 1024 processors of Jaguar (Cray XT3/4)

APART, November 11, Knowledge Support for Mining Parallel Performance Data Inference Engine  Metadata and analysis results are asserted as facts  Examples: number of processors, an input parameter, a derived metric, a speedup measurement  Analysis rules with encoded “expert knowledge” of performance process the assertions  Example: “When processor count increases by a factor of X, runtime should reduce by a factor of Y” (expectation)  Example: “When an event’s cache hit ratio is less than the overall ratio, alert the user to the event” (criticality)  Processed rules can assert new facts, which can fire new rules - provides a declarative programming environment  JBoss Rules rules engine for Java  Implements efficient Rete algorithm

APART, November 11, Knowledge Support for Mining Parallel Performance Data Metadata and Inference Rules CategoryMetadata ExamplesPossible Rules Machines processor speed/type, memory size, number of cores CPU A faster than CPU B Components MPI implementation, linear algebra library, runtime components Component A faster than B Input problem size, input data, problem decomposition smaller problem means faster execution Algorithms FFT v. DFT, sparse matrix v. dense matrix algorithm A faster than B for problem size > X Configurations number of processors, runtime parameters, number of iterations more processors means faster execution, domain decomposition effect on execution Compiler compiler choice, options, versions, code transformations execution time: -O0 > -O2 > -O3 Code Relationships call order, send-receive partners, concurrency, functionality code region has expected concurrency of X Code Changescode change between revisionsnewer code expected to be not slower

APART, November 11, Knowledge Support for Mining Parallel Performance Data Inference Rules with Application Metadata rule "Differences in Particles Per Cell" when // there exists a difference operation between metadata collections d : DifferenceMetadataOperation () f : FactWrapper ( factName == "input:micell", factType == DifferenceMetadataOperation.NAME ) then String[] values = (String[])d.getDifferences().get("input:micell"); System.out.println("Differences in particles per cell... " + values[0] + ", " + values[1]); double tmp = Double.parseDouble(values[0]) / Double.parseDouble(values[1]); // an increase in particles per cell means an increase in time d.setExpectedRatio(d.getExpectedRatio() / tmp); System.out.println("New Expected Ratio: " + d.getExpectedRatio()); d.removeFact("input:micell"); end Conditions are tested variables bound If conditions met, execute code (full access to public Java objects) Fact is de-asserted (removed)  JBoss Rules has Java-like syntax

APART, November 11, Knowledge Support for Mining Parallel Performance Data Data Persistence and Provenance  Analysis results should include where they came from  Data persistence captures intermediate analysis results and final results and saves for later access  Persistence allows analysis results to be reused  Some analysis operations can take a long time  Breadth-wise inference analysis and cross-workflow  Storing all operations in the analysis workflow generates full provenance of the intermediate and final results  Supports confirmation and validation of analysis results  Inference engine may need as well  Persistence/Provence used to create “chain of evidence”

APART, November 11, Knowledge Support for Mining Parallel Performance Data Figure: The (turbulent) electrostatic potential from a GYRO simulation of plasma microturbulence in the DIII-D tokamak. Source: html Data: GTC on XT3/XT4 with 512 processes Example: GTC (Gyrokinetic Toroidal Code)  Particle-in-cell simulation  Fortran 90, MPI  Main events:  PUSHI - update ion locations  CHARGEI - calculate ion gather-scatter coefficients  SHIFTI - redistribute ions across processors  Executed on ORNL Jaguar  Cray XT3/XT4, 512 processors  Problem:  ions are accessed regularly  grid cells have poor cache reuse  Scripted analysis, inference rules

APART, November 11, Knowledge Support for Mining Parallel Performance Data Example: Workflow Load Data Extract Non-callpath data Extract top 10 events Extract main event Merge events Derive metrics Compare events to main Process inference rules

APART, November 11, Knowledge Support for Mining Parallel Performance Data Example: Output doing single trial analysis for gtc on jaguar Loading Rules... Reading rules: rules/GeneralRules.drl... done. Reading rules: rules/ApplicationRules.drl... done. Reading rules: rules/MachineRules.drl... done. loading the data... Getting top 10 events (sorted by exclusive time)... Firing rules... The event SHIFTI [{shifti.F90} {1,12}] has a lower than average L2 hit rate. Try improving cache reuse for improved performance. Average L2 hit rate: , Event L2 hit rate: Percentage of total runtime: 06.10% The event SHIFTI [{shifti.F90} {1,12}] has a lower than average L1 hit rate. Try improving cache reuse for improved performance. Average L1 hit rate: , Event L1 hit rate: Percentage of total runtime: 06.10% The event PUSHI [{pushi.f90} {1,12}] has a higher than average FLOP rate. This appears to be computationally dense region. If this is not a main computation loop, try performing fewer calculations in this event for improved performance. Average MFLOP/second: , Event MFLOP/second: Percentage of total runtime: 50.24% The event CHARGEI [{chargei.F90} {1,12}] has a lower than average L1 hit rate. Try improving cache reuse for improved performance. Average L1 hit rate: , Event L1 hit rate: Percentage of total runtime: 37.70%...done with rules. Identified poor cache reuse Identified main computation

APART, November 11, Knowledge Support for Mining Parallel Performance Data Data: Sweep3D on XT3/XT4 with 256 processes Example: Sweep3D  ASCI Benchmark code  256 processors, with 800x800x1000 problem  Problem: number of logical neighbors in decomposition will determine communication performance  Problem:  XT3/XT4 imbalances  Scripted analysis, metadata, inference rules

APART, November 11, Knowledge Support for Mining Parallel Performance Data Example: Workflow Load Data Extract Non-callpath data Extract top 5 events Load metadata Correlate events With metadata Process inference rules

APART, November 11, Knowledge Support for Mining Parallel Performance Data Example: Output JPython test script start doing single trial analysis for sweep3d on jaguar Loading Rules... Reading rules: rules/GeneralRules.drl... done. Reading rules: rules/ApplicationRules.drl... done. Reading rules: rules/MachineRules.drl... done. loading the data... Getting top 10 events (sorted by exclusive time)... Firing rules... MPI_Recv(): "CALLS" metric is correlated with the metadata field "total neighbors". The correlation is 1.0 (direct). MPI_Send(): "CALLS"metric is correlated with the metadata field "total Neighbors". The correlation is 1.0 (direct). MPI_Send(): "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is correlated with the metadata field "total neighbors". The correlation is (moderate). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)". The correlation is (very high). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)". The correlation is (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is (very high)....done with rules JPython test script end Identified hardware differences Correlated communication behavior with metadata

APART, November 11, Knowledge Support for Mining Parallel Performance Data Data: GTC on XT3/XT4 with 64 through 512 processes Example: GTC Scalability  Weak scaling example  Comparing 64 and 128 processes  Superlinear speedup observed  Can PerfExplorer detect and explain why?  Uses scripted analysis, metadata, inference rules

APART, November 11, Knowledge Support for Mining Parallel Performance Data Example: Workflow Load Data Extract Non-callpath data Extract top 5 events Load MetadataCompare Metadata Do Scalability Comparison Process inference rules Load Data Load Metadata

APART, November 11, Knowledge Support for Mining Parallel Performance Data Example Output JPython test script start doing single trial analysis for gtc on jaguar > Firing rules... Differences in processes... 64, 128 New Expected Ratio: 0.5 Differences in particles per cell , 200 New Expected Ratio: 1.0 The comparison trial has superlinear speedup, relative to the baseline trial Expected ratio: 1.0, Actual ratio: Ratio = baseline / comparison Event / metric combinations which may contribute: CHARGEI [{chargei.F90} {1,12}] PAPI_L1_TCM CHARGEI [{chargei.F90} {1,12}] PAPI_L2_TCM MPI_Allreduce() PAPI_L1_TCA MPI_Allreduce() PAPI_TOT_INS MPI_Sendrecv() PAPI_L2_TCM MPI_Sendrecv() PAPI_TOT_INS PUSHI [{pushi.f90} {1,12}] PAPI_L1_TCM PUSHI [{pushi.f90} {1,12}] PAPI_L2_TCM JPython test script end Identified possible reasons Identified superlinear speedup Identified weak scaling 64 run 128 run

APART, November 11, Knowledge Support for Mining Parallel Performance Data Current Status  Working now  Several analysis operations are written  Metadata collection is available  Scripting and inference engine are in place  To be developed  Results persistence  Provenance capture and storage

APART, November 11, Knowledge Support for Mining Parallel Performance Data Conclusion  Mining performance data will depend on adding knowledge to analysis frameworks  Application, hardware, environment metadata  Analysis processes and workflow  Rules for inferencing and analysis search  Expert knowledge combined with performance results can explain performance phenomena  Redesigned PerfExplorer framework is one approach  Community performance knowledge engineering  Developing inference rules  Constructing analysis processes  Application-specific metadata and analysis

APART, November 11, Knowledge Support for Mining Parallel Performance Data Acknowledgments  US Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  NSF  Software and Tools for High-End Computing  PERI SciDAC  PERI-DB project  TAU and PerfExplorer demos: NNSA / ASC, booth #1617, various times daily  PERI-DB demo: RENCI booth #3215, Wednesday at 2:30 PM