Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.

Computer Architecture Instruction-Level Parallel Processors

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Introduction CS 524 – High-Performance Computing.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

May 25, 2010 Mary Hall May 25, 2010 Advancing the Compiler Community’s Research Agenda with Archiving and Repeatability * This work has been partially.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Data Mining Techniques

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

Design Space Exploration

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CS 594:SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE ANALYSIS TOOLS: PART III Gabriel Marin Includes slides from John Mellor-Crummey.

Page 1 Trilinos Software Engineering Technologies and Integration Numerical Algorithm Interoperability and Vertical Integration –Abstract Numerical Algorithms.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Components for Beam Dynamics Douglas R. Dechow, Tech-X Lois Curfman McInnes, ANL Boyana Norris, ANL With thanks to the Common Component Architecture (CCA)

Building an Electron Cloud Simulation using Bocca, Synergia2, TxPhysics and Tau Performance Tools Phase I Doe SBIR Stefan Muszala, PI DOE Grant No DE-FG02-08ER85152.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.

University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Code Optimization.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Many-core Software Development Platforms

Presented By: Darlene Banta

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity of Maryland Boyana NorrisArgonne National Laboratory Gabriel MarinOak Ridge National Laboratory Active Harmony CHiLL ORIO MIAMI TAU Integration Introduction Methods for performance optimization of petascale applications must address the growing complexity of new HPC hardware/software environments that limit the ability of manual efforts in successful performance problem triage, software transformation, and static/dynamic parameter configuration. Greater performance tool integration and tuning process automation are necessary to manage and share performance information, to generate correct multiple code variants, to conduct controlled performance experiments, and to efficiently search and discover high-performant solutions, thereby improving performance portability overall. SUPER is advancing autotuning capabilities through the coupling of performance measurement, analysis, and database tools, compiler and program translators, and autotuning frameworks. Support for this work was provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Computing Research Web-based User Interface showing selected configurations and performance evolution Auto-tuning Search Engine Supports multiple search algorithms Parallel Rank Order Nelder-Mead Exhaustive Random Allows Parallel Search Each node can run a different configuration >./tuna -i=tile,1,10,1 -i=unroll,2,12,2 -n=25 matrix_mult -t % -u % Parameter 1 Low: 1 High: 10 Step: 1 Parameter 1 Low: 1 High: 10 Step: 1 Parameter 2 Low: 2 High: 12 Step: 2 Parameter 2 Low: 2 High: 12 Step: 2 Command line to run % - replaced by parameters Command line to run % - replaced by parameters Tuna: Command line tool makes auto-tuning setup easy Case Study: Hierarchical-grid electronic-structure solver 7 parameter search space Coarse-grain grid {x, y, z} Procs per coarse-grain grid node Fine-grain grid {x, y, z} 64-node suitability tests Wall-time recorded for 14 runs Average - 3:48:09 Standard deviation - 0:56:21 int nrows=m*n; int ndiags=nos; begin Loop(transform CUDA(threadCount=TC, blockCount=BC, preferL1Size=PL, unrollInner=UIF) for(i = 0; i <= nrows-1; i++) { for(j = 0; j <= ndiags-1; j++){ col = i + offsets[j]; if (col >= 0 && col < nrows) y[i] += A[i+j*nrows] * x[col]; } Performance of autotuned sparse matrix-vector product for an Intel Xeon (dual quad-core E5462 processors), 2.8GHz; GPU: NVIDIA Fermi C2070 “Autotuning stencil-based computations on GPUs.” A. Mametjanov, D. Lowell, C.-C. Ma, and B. Norris. Proceedings of IEEE Cluster Key Ideas DSLs enable Generation of high-performance implementations targeting specific architectures A larger set of transformations than general- purpose languages Programming productivity and portable performance Integration of tuned code back into legacy software Autotuning requires numerical optimization Empirical tuning search space cannot be searched exhaustively Ad-hoc techniques unlikely to find high quality minimum x86 object code CFGs, edge counts PIN MIAMI code IR instr /μop / registers XED Machine model (MDL) Loop nesting structure Dependence graph at loop level Dependence graph customized for machine instruction latencies, idiom replacement Memory reuse distance analysis PIN Set assoc. cache miss predictions data reuse insight Performance predictions, performance limiters, potential for performance improvement map metrics to source code and data structures modulo scheduler binutils XML performance database/tuning recipes hpcviewer Approach Extract static and dynamic characteristics from optimized x86-64 application binaries Frequency of executed paths through the binary Control flow graph, loop nesting structure Instruction mixes, instruction schedule dependencies Data reuse and memory access patterns Machine description language used to define models of target architectures Map application model onto an architecture model, understand resources mismatch Impact Automatic construction of machine-independent application performance models Pinpoint and rank opportunities for application performance tuning Compute code transformation recipes for performance optimization Machine Independent Application Models for performance Insight Case study: understand the performance bottlenecks of a naïve matrix multiply code Compute application tuning recipes Analyze full application binaries Reschedule application instructions for the target architecture Understand inefficiencies due to low ILP, failed vectorization, resource contention Perform memory reuse simulation Compute “loop balance”, compare with peak bandwidth Understand if instruction schedule inefficiencies are on critical path Analyze data reuse patterns to look for improvement opportunities Suggest code transformations In the absence of cache misses, the main performance bottleneck is the issue bandwidth on the Load/Store units - Apply tiling for register reuse About 98% of cache misses and 49% of TLB misses are due to long reuse within the 3 rd level loop Loop at level 1 carries most of the cache misses. These misses are on accesses to array ‘b’. - Move the level 1 loop to an inner position, or - Block the level 2 loop and move the loop over blocks outside of the level 1 loop. Approach Integrate TAU measurement in autotuning workflow CHiLL + Active Harmony, Orio ROSE outlining for instrumenting code variants TAU parameter profiling for specialization Capture performance experiments with code variants Store data /metadata in database (TAUdb) Mine performance data Feedback back directly to autotuning search Apply machine learning to generate decision trees Apply decision tree knowledge to specialize application TAU integration with Orio TAU integration with CHiLL and Active Harmony Decision tree analysis and specialization Example annotation: Sparse matrix-vector product in structured grid PDE solution CHiLL is a transformation and code generation framework designed specifically for compiler-based auto-tuning; CUDA-CHiLL generates CUDA for Nvidia GPUs. A layered design permits users to reach into the compiler’s algorithms for code mapping to explore a search space of possible implementations of a computation, making it possible to achieve performance comparable to manual tuning: Transformation recipes describe code optimizations to be composed; a set of parameterized recipes form a search space. High-level recipes can target lower-level transformation sequences implemented by an expert user, facilitating encapsulation for less-expert users. An underlying polyhedral code generator and transformation system robustly implements the parameterized code transformations. The figure below shows the performance gains from applying CHiLL to optimize PETSc functions in the context of three scalable applications run on hopper at NERSC.