Download presentation
Presentation is loading. Please wait.
Published byMercy Austin Modified over 9 years ago
1
Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende {malony,nchaimov,khuck,scott,sameer}@cs.uoregon.edu University of Oregon
2
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Outline Motivation Performance engineering and autotuning Performance tool integration with autotuning process TAU performance system overview Performance database (TAUdb) Framework for empirical-based performance tuning Integration with CHiLL and Active Harmony Integration with Orio (preliminary) Conclusions and future directions
3
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parallel Performance Engineering Scalable, optimized applications deliver HPC promise Optimization through performance engineering process Understand performance complexity and inefficiencies Tune application to run optimally on high-end machines How to make the process more effective and productive? What is the nature of the performance problem solving? What is the performance technology to be applied? Performance tool efforts have been focused on performance observation, analysis, problem diagnosis Application development and optimization productivity Programmability, reusability, portability, robustness Performance technology part of larger programming system
4
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Traditionally an empirically-based approach observation experimentation diagnosis tuning Performance technology developed for each level characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology Data mining Models Expert systems Performance Technology Parallel Performance Engineering Process
5
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parallel Performance Diagnosis
6
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning “Extreme” (Progressive) Performance Engineering Increased performance complexity forces the engineering process to be more intelligent and automated Automate performance data analysis / mining / learning Automated performance problem identification Performance engineering tools and practice must incorporate a performance knowledge discovery process Model-oriented knowledge Computational semantics of the application Symbolic models for algorithms Performance models for system architectures / components Application developers can be more directly involved in the performance engineering process
7
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning “Extreme” Performance Engineering Empirical performance data evaluated with respect to performance expectations at various levels of abstraction
8
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning is a Performance Engineering Process Autotuning methodology incorporates aspects common to “traditional” application performance engineering Empirical performance observation Experiment-oriented Autotuning embodies progressive engineering techniques Automated experimentation and performance testing Guided optimization by (intelligent) search space exploration Model-based (domain-specific) comptational semantics Autotuning is a different approach to performance diagnosis style of performance engineering There are shared objectives for performance technology and opportunities for tool integration
9
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Performance System ® (http://tau.uoregon.edu) Parallel performance framework and toolkit Supports all HPC platforms, compilers, runtime system Provides portable instrumentation, measurement, analysis
10
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Components Instrumentation Fortran, C, C++, UPC, Chapel, Python, Java Source, compiler, library wrapping, binary rewriting Automatic instrumentation Measurement MPI, OpenSHMEM, ARMCI, PGAS Pthreads, OpenMP, other thread models GPU, CUDA, OpenCL, OpenACC Performance data (timing, counters) and metadata Parallel profiling and tracing Analysis Performance database technology (TAUdb, formerly PerfDMF) Parallel profile analysis (ParaProf) Performance data mining / machine learning (PerfExplorer)
11
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Performance Database – TAUdb Started in 2004 (Huck et al., ICPP 2005) Performance Data Management Framework (PerfDMF) Database schema and Java API Profile parsing Database queries Conversion utilities (parallel profiles from other tools) Provides DB support for TAU profile analysis tools ParaProf, PerfExplorer, EclipsePTP Used as regression testing database for TAU Used as performance regression database Ported to several DBMS PostgreSQL, MySQL, H2, Derby, Oracle, DB2
12
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Database Schema Parallel performance profiles Timer and counter measurements with 5 dimensions Physical location: process / thread Static code location: function / loop / block / line Dynamic location: current callpath and context (parameters) Time context: iteration / snapshot / phase Metric: time, HW counters, derived values Measurement metadata Properties of the experiment Anything from name:value pairs to nested, structured data Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)
13
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Programming APIs Java Original API Basis for in-house analysis tool support Command line tools for batch loading into the database Parses 15+ profile formats TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, … Supports Java embedded databases (H2, Derby) C programming interface under development PostgreSQL support first, others as requested Query Prototype developed Plan full-featured API: Query, Insert, & Update Evaluating SQLite support
14
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Tool Support ParaProf Parallel profile viewer / analyzer 2, 3+D visualizations Single experiment analysis PerfExplorer Data mining framework Clustering, correlation Multi-experiment analysis Scripting engine Expert system
15
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning PerfExplorer Architecture DBMS (TAUdb)
16
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Integration with CHiLL and Active Harmony Major goals: Integrate TAU with existing autotuning frameworks U se TAU to gather performance data for autotuning/specialization Store performance data tagged with metadata about execution environment and input in a centralized database Use machine learning and data mining techniques to increase the level of automation of autotuning and specialization Using TAU in two ways: Using multi-parameter-based profiling support to generate separate profiles based function parameters (or outlined code) Using TAU metrics stored in PerfDMF/TauDB as performance measures in optimization
17
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Components ROSE Outliner ROSE is a compiler with built-in support for source-to- source transformations ROSE outliner, given a reference to an AST node, extracts the AST node into its own function or file CHiLL provides a domain specific language for specifying transformations on loops Active Harmony Searches space of parameters to transformation recipes TAU Performance instrumentation and measurement
18
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Multi-Parameter Profiling Added multi-parameter-based profiling in TAU to support specialization User can select which parameters are of interest using a selective instrumentation file Consider a matrix multiply function We can generate profiles based on the dimensions of the matrices encountered during execution: for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N
19
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Using Parameterized Profiling in TAU BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=“foo.c” routine=“matrix#” param file=“foo.c” routine=“matmult” param=“L” param=“M” param=“N” END_INSTRUMENT_SECTION int matmult(float **, float **, float **, int, int, int) C
20
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parameterized Profiling / Autotuning with TauDB
21
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning with TauDB Methodology Each time the program executes a code variant, we store metadata in the performance database indicating by what process the variant was produced: Source function Name of CHiLL recipe Parameters to CHiLL recipe The database also contains metadata on what parameters were called and also on the execution environment: OS name, version, release, native architecture CPU vendor, ID, clock speed, cache sizes, # cores Memory size Any metadata specified by the end user
22
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Machine Learning Given all these data stored in TauDB... OS name, OS Release, CPU Type, CPU MHz, CPU cores param, param Chillscript Metric... we can build a decision tree which selects the best-performing code variant given information available at run-time
23
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Decision Tree PerfExplorer already has an interface to Weka Use Weka to generate decision trees based upon the data in the performance database
24
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Wrapper Generation Use a ROSE-based tool to generate a wrapper function Carries out the decisions in the decision tree and executes the best code variant Decision tree code generation tool takes Weka-generated decision tree and a set of decision functions as input If using custom metadata, the user needs to provide a custom decision function Decision functions for metadata automatically collected by TAU are provided
25
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Example: Sparse MM Specialization with CHiLL Previous study: CHiLL and Active Harmony were used to specialize matrix multiply in Nek5000 (a fluid dynamics solver) for small matrices based on input size. Limitations: histogram of input sizes generated manually, code to evaluate input data and select specialized variant generated manually. We can automate these processes with parameterized profiling and machine learning over the collected data. Replicated small-matrix specialization study using TAU and TauDB.
26
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Introduction to Orio Orio is an annotation-based empirical performance tuning framework Source code annotations allow Orio to generate a set of low-level performance optimizations After each optimization (or transformation) is applied the kernel is run Set of optimizations is searched for the best transformations to be applied for a given kernel First effort to integrate Orio with TAU was to collect performance data about each experiment that Orio runs Move performance data from Orio into TAUdb Orio could read from TAUdb in future
27
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU's GPU Measurement Library Focused on Orio ’ s CUDA kernel transformations TAU uses NVIDIA's CUPTI interface to gather information about the GPU execution Memory transfers Kernels (runtime performance, performance attributes) GPU counters Using the CUPTI interface does not require any recompiling or re-linking of the application
28
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Orio and TAU Integration
29
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning PerfExplorer and Weka PerfExplorer is TAU ’ s performance data mining tool PerfExplorer features: Database management (both local and remote databases) Easy generation of charts for common performance studies Clustering and Correlation analysis in addition to custom scripted analysis The Weka machine learning component can be useful when analyzing autotuned data Correlation of tuning transforms with execution time
30
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Orio Tuning of Vector Multiplication Orio tuning of a simple 3D vector multiplication 2,048 experiments fed into TAUdb Use TAU PerfExplorer with Weka to do component analysis Threads Per Block# of BlocksPreferred L1 SizeUnroll factorCFLAGWarps Per SM Kernel Execution Time Small c orrelation with runtime Better correlated with runtime
31
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Number of Threads per Kernel GPU occupancy (# warps) increases with larger # threads Greater occupancy improves memory latency hiding resulting in faster execution time Number of Threads Kernel Execution Time (us) GPU occupancy (warps)
32
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning Measurements in MPI Applications How is the execution context of a MPI application captured and used for autotuning experiments? Is this a problem? Avoid running the entire MPI application for one variant How about integrate autotuning in the MPI application? Each process runs code variants in vivo Each variant is measured separately and stored in the profile How are code variants generated and selected during run? code transformations, compiler options, runtime parameters, … What assumptions are there about the MPI application? Functional consistency: does it continue to work properly? Execution regularity: is the context stable between iterations?
33
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning Measurements in MPI Application
34
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Conclusions Autotuning IS a performance engineering process It is complementary with performance engineering for empirical-based performance diagnosis and optimization There are opportunities to integrate application parallel performance tools with autotuning methodologies Performance experiment instrumentation and measurement Performance data/metadata, analysis, data mining Knowledge engineering is key (at all levels) Performance + parallel computation + system Represented in form the tools can reason about Bridge between application performance characterization methodology with autotuning methodology Integration will help to explain performance
35
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Future Work DOE SUPER SciDAC project Integration of TAU with autotuning frameworks CHiLL, Active Harmony, Orio Apply tools for end-to-end application performance Build performance databases Enable exploration and understanding of search spaces Enable association of multi-dimensional data/metadata Relate performance across compilers, platforms, … Feedback of semantic information to explain performance Explore data mining and machine learning techniques Discovery of relationships, factors, latent variables, … Create performance models and feedback to autotuning Learn optimal algorithm parameters for specific scenarios Bridge between model-based and experimental-based Create knowledge-based integrated performance system
36
DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Support Acknowledgements Department of Energy (DOE) Office of Science ASC/NNSA Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.