Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Slides:

Advertisements

Similar presentations

17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.

Advertisements

Configuration management

NGAS – The Next Generation Archive System Jens Knudstrup NGAS The Next Generation Archive System.

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Active Harmonys Plug-in Interface A World of Possibilities Ray S. Chen Jeffrey K. Hollingsworth Paradyn/Dyninst Week 2013 April 30, 2013.

Houssam Haitof Technische Univeristät München

Testing Workflow Purpose

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.

From Model-based to Model-driven Design of User Interfaces.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

TAU Performance SystemS3D Scalability Study1 Total Execution Time.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,

Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.

German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

MathCore Engineering AB Experts in Modeling & Simulation WTC.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Kevin A. Huck Department of Computer and Information Science Performance Research Laboratory University of.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.

TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

Scaling up R computation with high performance computing resources.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Profiling OpenSHMEM with TAU Commander

Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Introduction to the TAU Performance System®

Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.

TAU integration with Score-P

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

Advanced TAU Commander

A configurable binary instrumenter

TAU: A Framework for Parallel Performance Analysis

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Presentation transcript:

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Outline Brief introduction to TAU Motivation Relevant TAU Tools: TAUdb PerfExplorer Using TAU in an autotuning workflow Machine Learning with PerfExplorer Future Work

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Motivation Goals: Generate code that adapts to changes in the execution environment and input datasets. Avoid spending large amounts of time performing search to autotune code. Method: learn from past performance data in order to Automatically generate code to select a variant at runtime based upon execution environment and input dataset properties. Learn classifiers to select search parameters (such as initial configuration) to speed the search process.

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAU Performance System ® ( Tuning and Analysis Utilities (20+ year project) Performance problem solving framework for HPC Integrated, scalable, flexible, portable Target all parallel programming / execution paradigms Integrated performance toolkit Multi-level performance instrumentation Flexible and configurable performance measurement Widely-ported performance profiling / tracing system Performance data management and data mining Open source (BSD-style license) Broad use in complex software, systems, applications

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAU Organization Parallel performance framework and toolkit Supports all HPC platforms, compilers, runtime system Provides portable instrumentation, measurement, analysis

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAU Components Instrumentation Fortran, C, C++, UPC, Chapel, Python, Java Source, compiler, library wrapping, binary rewriting Automatic instrumentation Measurement MPI, OpenSHMEM, ARMCI, PGAS Pthreads, OpenMP, other thread models GPU, CUDA, OpenCL, OpenACC Performance data (timing, counters) and metadata Parallel profiling and tracing Analysis Performance database technology (TAUdb, formerly PerfDMF) Parallel profile analysis (ParaProf) Performance data mining / machine learning (PerfExplorer)

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAU Instrumentation Mechanisms Source code Manual (TAU API, TAU component API) Automatic (robust) C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec) Object code Compiler-based instrumentation (-optCompInst) Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked (tau_wrap) Executable code Binary re-writing and dynamic instrumentation (DyninstAPI, U. Wisconsin, U. Maryland) Virtual machine instrumentation (e.g., Java using JVMPI) Interpreter based instrumentation (Python) Kernel based instrumentation (KTAU)

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Instrumentation: Re-writing Binaries Support for both static and dynamic executables Specify the list of routines to instrument/exclude from instrumentation Specify the TAU measurement library to be injected Simplify the usage of TAU: To instrument: % tau_run a.out –o a.inst To perform measurements, execute the application: % mpirun –np 8./a.inst To analyze the data: % paraprof 8

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony DyninstAPI 8.1 support in TAU TAU v supports DyninstAPI v8.1 Improved support for static rewriting Integration for static binaries in progress Support for loop level instrumentation Selective instrumentation at the routine and loop level

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAUdb: Framework for Managing Performance Data10

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAU Performance Database – TAUdb Started in 2004 (Huck et al., ICPP 2005) Performance Data Management Framework (PerfDMF) Database schema and Java API Profile parsing Database queries Conversion utilities (parallel profiles from other tools) Provides DB support for TAU profile analysis tools ParaProf, PerfExplorer, EclipsePTP Used as regression testing database for TAU Used as performance regression database Ported to several DBMS PostgreSQL, MySQL, H2, Derby, Oracle, DB2

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAUdb Database Schema Parallel performance profiles Timer and counter measurements with 5 dimensions Physical location: process / thread Static code location: function / loop / block / line Dynamic location: current callpath and context (parameters) Time context: iteration / snapshot / phase Metric: time, HW counters, derived values Measurement metadata Properties of the experiment Anything from name:value pairs to nested, structured data Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAUdb Programming APIs Java Original API Basis for in-house analysis tool support Command line tools for batch loading into the database Parses 15+ profile formats TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, … Supports Java embedded databases (H2, Derby) C programming interface under development PostgreSQL support first, others as requested Query Prototype developed Plan full-featured API: Query, Insert, & Update Evaluating SQLite support

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony TAUdb Tool Support ParaProf Parallel profile viewer / analyzer 2, 3+D visualizations Single experiment analysis PerfExplorer Data mining framework Clustering, correlation Multi-experiment analysis Scripting engine Expert system

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer DBMS (TAUdb)

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer – Relative Comparisons Total execution time Timesteps per second Relative efficiency Relative efficiency per event Relative speedup Relative speedup per event Group fraction of total Runtime breakdown Correlate events with total runtime Relative efficiency per phase Relative speedup per phase Distribution visualizations 16

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer – Correlation Analysis Data: FLASH on BGL(LLNL), 64 nodes Strong negative linear correlation between CALC_CUT_BLOCK_CONTRIBUTIONS and MPI_Barrier 17

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer – Correlation Analysis indicates strong, negative relationship. As CALC_CUT_BLOCK_CONTRIBUTIONS() increases in execution time, MPI_Barrier() decreases 18

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer – Cluster Analysis19

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer – Cluster Analysis Four significant events automatically selected Clusters and correlations are visible 20

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer – Performance Regression21

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Usage Scenarios: Evaluate Scalability22

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony PerfExplorer Scripting Interface Control PerfExplorer analyses with Python scripts. Perform built-in PerfExplorer analyses. Call machine learning routines in Weka. Export data to R for analysis. Utilities.setSession("peri_s3d") trial = Utilities.getTrial("S3D", "hybrid-study", "hybrid") result = TrialResult(trial) reducer = TopXEvents(result1, 10) reduced = reducer.processData().get(0) for metric in reduced.getMetrics(): k = 2 while k<= 10: kmeans = KMeansOperation(reduced, metric, AbstractResult.EXCLUSIVE, k) kmeans.processData()

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Using TAU in an Autotuning Workflow Active Harmony proposes variant. Instrument code variant with TAU Captures time measurements and hardware performance counters Interfaces for PAPI, CUPTI, etc. Captures metadata describing execution environment OS name, version, release, native architecture, CPU vendor, ID, clock speed, cache sizes, # cores, memory size, etc. plus user-defined metadata Save performance profiles into TAUdb Profiles tagged with provenance metadata describing which parameters produced this data. Repeat autotuning across machines/architectures and/or datasets. Analyze stored profiles with PerfExplorer.

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Multi-Parameter Profiling Added multi-parameter-based profiling in TAU to support specialization User can select which parameters are of interest using a selective instrumentation file Consider a matrix multiply function We can generate profiles based on the dimensions of the matrices encountered during execution: e.g., for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Using Parameterized Profiling in TAU BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=foo.c routine=matrix# param file=foo.c routine=matmult param=L param=M param=N END_INSTRUMENT_SECTION int matmult(float **, float **, float **, int, int, int) C

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Specialization using Decision-Tree Learning For a matrix multiply kernel: Given a dataset containing matrices of different sizes and for which some matrix sizes are more common than others automatically generate function to select specialized variants at runtime based on matrix dimensions

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Specialization using Decision-Tree Learning For a matrix multiply kernel: Given a dataset containing matrices of different sizes and for which some matrices are small enough to fit in the cache, while others do not automatically generate function to select specialized variants at runtime based on matrix dimensions

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Initial Configuration Selection Speed autotuning search process by learning classifier to select an initial configuration. When starting out autotuning a new code: Use default initial configuration Capture performance data into TAUdb Once sufficient data is collected: Generate classifier On subsequent autotuning runs: Use classifier to propose an initial configuration for search

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Initial Configuration Selection Example Matrix multiplication kernel in C CUDA code generated using CUDA-CHiLL Tuned on several different NVIDIA GPUs. S1070, C2050, C2070, GTX480 Learn on data from three GPUs, test on remaining one. Results in reduction in evaluations required to converge.

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Ongoing Work Guided Search We choose an initial configuration largely because this was easy to implement Active Harmony already provided the functionality to specify this. With the Active Harmony plugin interface, we could provide input beyond the first step of the search. e.g, at each step, incorporate newly acquired data into the classifier and select a new proposal.

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Ongoing Work Real applications! So far we have only used kernels in isolation. Currently working on tuning OpenCL derived field generation routines in VisIt visualization tool. Cross-architecture: x86, NVIDIA GPU, AMD GPU, Intel Xeon Phi

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Acknowledgments – U. Oregon Prof. Allen D. Malony, Professor CIS, and Director NeuroInformatics Center Dr. Sameer Shende, Director, Performance Research Lab Dr. Kevin Huck Wyatt Spear Scott Biersdorff David Ozog David Poliakoff Dr. Robert Yelle Dr. John Linford, ParaTools, Inc. 33

ParaDyn Week 2013April 29, 2013 Machine Learning-based Autotuning with TAU and Active Harmony Support Acknowledgements Department of Energy (DOE) Office of Science ASC/NNSA Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA