SUPER Bob Lucas University of Southern California Sept. 23, 2011 SciDAC-3 Institute for Sustained Performance, Energy, and Resilience

Slides:

Advertisements

Similar presentations

The Access Grid Ivan R. Judson 5/25/2004.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Performance Models for Application Optimization

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

*University of Utah † Lawrence Berkeley National Laboratory

Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Introductory Courses in High Performance Computing at Illinois David Padua.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

May 25, 2010 Mary Hall May 25, 2010 Advancing the Compiler Community’s Research Agenda with Archiving and Repeatability * This work has been partially.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

4/27/2006Education Technology Presentation Visual Grid Tutorial PI: Dr. Bina Ramamurthy Computer Science and Engineering Dept. Graduate Student:

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

SAGE: Self-Tuning Approximation for Graphics Engines

Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity.

SUPER 1 Bob Lucas University of Southern California Sept. 23, 2011 Science Pipeline Allen D. Malony University of Oregon May 6, 2014 Support for this work.

Michelle Mills Strout OpenAnalysis: Representation- Independent Program Analysis CCA Meeting January 17, 2008.

November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

CCA Common Component Architecture CCA Forum Tutorial Working Group Welcome to the Common.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Technology + Process SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Rick Kufrin *,

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

BG/Q vs BG/P—Applications Perspective from Early Science Program Timothy J. Williams Argonne Leadership Computing Facility 2013 MiraCon Workshop Monday.

System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Architectural Impact of Stateful Networking Applications Javier Verdú, Jorge García Mario Nemirovsky, Mateo Valero The 1st Symposium on Architectures for.

University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.

May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Bob Lucas University of Southern California Sept. 23, 2011 Transforming Geant4 for the Future Bob Lucas and Rob Roser USC and FNAL May 8, 2012.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Presented by Performance Engineering Research Institute (PERI) Patrick H. Worley Computational Earth Sciences Group Computer Science and Mathematics Division.

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

Xolotl: A New Plasma Facing Component Simulator Scott Forest Hull II Jr. Software Developer Oak Ridge National Laboratory

ComPASS Summary, Budgets & Discussion Panagiotis Spentzouris, Fermilab ComPASS PI.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Sunpyo Hong, Hyesoon Kim

Center for Component Technology for Terascale Simulation Software (CCTTSS) 110 April 2002CCA Forum, Townsend, TN This work has been sponsored by the Mathematics,

PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.

Mexico 2050 Calculator Training Week – th August 2013.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Employing compression solutions under openacc

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

Performance Technology for Scalable Parallel Systems

Parallel Programming By J. H. Wang May 2, 2017.

University of Technology

Supporting Fault-Tolerance in Streaming Grid Applications

for more information ... Performance Tuning

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Outline Introduction Motivation for performance mapping SEAA model

Department of Computer Science, University of Tennessee, Knoxville

Presentation transcript:

SUPER Bob Lucas University of Southern California Sept. 23, 2011 SciDAC-3 Institute for Sustained Performance, Energy, and Resilience Bob Lucas, PI University of Southern California Argonne PI: Boyana Norris Support for this work was provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research

SUPER SciDAC Performance Efforts  SciDAC has always had an effort focused on performance  Performance Evaluation Research Center (PERC) – Benchmarking, modeling, and understanding  Performance Engineering Research Institute (PERI) – Performance engineering, modeling, and engagement – Three SciDAC-e projects  Institute for Sustained Performance, Energy, and Resilience (SUPER) – Performance engineering – Energy minimization – Resilient applications – Optimization of the above SUPER

SUPER Outline SUPER Team Research Directions Broader Engagement SUPER

SUPER SUPER Team SUPER Gabriel Marin Philip Roth Patrick Worley ORNL David Bailey Lenny Oliker Sam Williams LBNL Bronis de Supinski Daniel Quinlan LLNL UCSD Laura Carrington Ananta Tiwari Jeffrey Hollingsworth UMD Dan Terpstra UTK Allen Malony Sameer Shende Oregon Utah Mary Hall ANL Paul Hovland Boyana Norris Stefan Wild USC Jacque Chame Pedro Diniz Bob Lucas (PI) Rob Fowler Allan Porterfield UNC Shirley Moore UTEP

SUPER Broadly Based Effort  All PIs have independent research projects – SUPER money alone isn’t enough to support any of its investigators – SUPER leverages other work and funding  SUPER contribution is integration, results beyond any one group – Follows successful PERI model (tiger teams and autotuning) – Collaboration extends to others having similar research goals John Mellor-Crummey at Rice is an active collaborator Other likely collaborators include LANL, PNNL, Portland State, and UT San Antonio Perhaps Juelich and Barcelona too? SUPER

SUPER Outline SUPER Team Research Directions Broader Engagement SUPER

SUPER SUPER Research Objectives Performance Optimization of DOE Applications Automatic performance tuning Energy minimization Resilient computing Optimization of the above Collaboration with SciDAC-3 Institutes Engagement with SciDAC-3 Applications SUPER

SUPER Optimization of Particle-to-Grid Methods on Multicore Systems (Oliker and Williams, also supported by XStack) GoalsObjectives  Develop multicore-specific algorithms for a broad spectrum of particle-mesh applications.  Explore optimization, challenged by complex data hazards and irregular data access.  Demonstrate a unique synchronization and decomposition approach that improves performance on many of multicore platforms.  Our methodology resulted in substantial reduction in memory usage while simultaneously improving performance.  Our unique application-specific combination of optimizations achieved 4x and 5.6x improvement for heart and fusion particle-to-grid simulations.  This work is a critical step toward designing autotuning frameworks, which will extend the result to numerous particle methods and platforms.  Published in IEEE Transactions of Parallel and Distributed Processing (TPDS) Progress & Accomplishments Heart Code GTC Speedup of optimized version vs reference parallel code on Nehalem, Opteron, and Niagara platforms Particle-to-grid methods are very common, e.g., fusion, astrophysics, chemistry, and fluid mechanics. Performance is challenged on modern multicores by memory and bandwidth constraints. Our objective is performance enhancement of particle-to-grid applications, working in the context of a heart simulation and a fusion code.

SUPER Performance Engineering Research  Measurement and monitoring – Adopting University of Oregon’s TAU system – Building on UTK’s PAPI measurement library – Also collaborating with Rice and its HPCToolkit  Performance Database – Extending TAUdb to enable online collection and analysis  Performance modeling – PBound and Roofline models to bound performance expectations – MIAMI to model impact of architectural variation – PSINS to model communication SUPER

SUPER Generating Performance Models with PBound  Use source code analysis to generate performance bounds (best/worst case scenarios)  Can be used for – Understanding performance problems on current architectures For example, when presented in the context of roofline models (introduced by Sam Williams, LBNL) – Projecting performance to hypothetical architectures

SUPER Example Roofline Model locality walls 11  Remember, memory traffic includes more than just compulsory misses.  As such, actual arithmetic intensity may be substantially lower.  Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic+conflict miss traffic FLOPs Conflict + Capacity + Allocations + Compulsory AI = Model Observed Performance

SUPER 12 SUPE R 12 Sample PBound Output #include "pbound_list.h" void axpy4(int n, double *y, double a1, double *x1, double a2, double *x2, double a3, double *x3, double a4, double *x4) { #ifdef pbound_log * ((n - 1) + 1) + 4, 1 * ((n - 1) + 1),3 * ((n - 1) + 1) + 1,4 * ((n - 1) + 1)); #endif } void axpy4(int n, double *y, double a1, double *x1, double a2, double *x2, double a3, double *x3, double a4, double *x4) { register int i; for (i=0; i<=n-1; i++) y[i]=y[i]+a1*x1[i]+a2*x2[i]+a3*x3[i]+a4*x4[i]; }

SUPER Autotuning  Led by Mary Hall, University of Utah  Extend PERI autotuning system for future architectures – New TAU front-end for triage – CUDA-CHiLL, Orio to target GPUs – OpenMP-CHiLL for SMP multicores – Active Harmony provides search engine Drive empirical autotuning experimentation Balance threads and MPI ranks in hybrids of OpenMP and MPI Extend to surface/area to volume, or halo size, experiments – Targeted Autotuning Domain-specific languages Users write simple code and leave the tuning to us – Whole program autotuning Parameters, algorithm choice, libraries linked, etc. SUPER

SUPER Autotuning PETSc Solvers on GPUs with Orio B. Norris and A. Mametjanov, Argonne ImpactObjectives  Generate high-performance implementations sparse matrix algebra kernels defined using a high-level syntax  Optimize performance automatically on different architectures, including cache-based multicore CPUs and heterogeneous CPU/GPU systems  Integrate autotuning into the Portable Extensible Toolkit for Scientific Computing (PETSc)  Efficiently search the space of possible optimizations  Optimizations for multicore CPUs and GPUs from the same high-level input.  Improved code readability and maintainability.  Performance of autotuned kernels typically exceeds that of tuned vendor numerical library implementations. Accomplishments FY2012 Intel Xeon 2.8 GHz E5462 (8 cores), 2.8GHz; GPU: NVIDIA Fermi C2070 Example: Autotuning a structured-grid PDE application using PETSc for GPUs (solid fuel ignition). lower is better  Developed new transformations for sparse matrix operations targeting GPUs.  Autotuning resulted in improvements ranging from 1.5x to 2x over tuned vendor libraries on different GPUs for a solid fuel ignition PETSc application.  “Autotuning stencil-based computations on GPUs.” A. Mametjanov, D. Lowell, C.-C. Ma, and B. Norris. Proceedings of IEEE Cluster

SUPER Pflotran Autotuning Example CompilerOriginalActive HarmonyExhaustive Time (u1,u2)SpeedupTime(u1,u2)Speedup pathscale (3,11) (3,15)1.93 gnu (5,13) (5,7)1.54 pgi (5,3) (5,3)1.70 cray (15,5) (15,15)1.63 PETSc Triangular Solve Kernel 15 10/18/12

SUPER TAU (Oregon) HPC Toolkit (Rice) ROSE (LLNL) MIAMI (ORNL) PAPI (UTK) CHiLL (USC/ISI and Utah) ROSE (LLNL) Orio (Argonne) { OSKI (LBNL) Active Harmony (UMD) GCO (UTK) PerfTrack (LBNL, SDSC, RENCI) Autotuning Framework

SUPER Energy Minimization  Led by Laura Carrington, University of California at San Diego  Develop new energy aware APIs for users – I know the processor on the critical path in my multifrontal code  Obtain more precise data regarding energy consumption – Extend PAPI to sample hardware power monitors – Build new generation of PowerMon devices – Extend performance models  Transform codes to minimize energy consumption  Inform systems to allow them to exploit DVFS SUPER

SUPER MILC (6%) GTC (5%) PSCYEE (19%) FT (20.5%) LAMMPS (32%) Green Queue: Customized Large-scale Clock Frequency Scaling (Carrington and Tiwari) Research PlanOverall Objectives  The power wall is one of the major factors impeding the development and deployment of exascale systems  Solutions are needed that reduce energy from both the hardware and software side  Solutions must be easy to use, fully automated, scalable and application-aware  Develop static and runtime application analysis tools to quantify application phases that affect both system-wide power draw and performance  Use the properties of these application phases to develop models of the power draw  Combine power and performance models to create phase- specific Dynamic Voltage-Frequency Scaling (DVFS) settings Developed Green Queue, a framework that can make CPU clock frequency changes based on intra-node (e.g., memory latency bound) and inter-node (e.g., load imbalance) aspects of an application Deployed Green Queue on SDSC’s Gordon supercomputer using one full rack (1024 cores) and a set scientific applications Green Queue shown to reduce the operational energy required for HPC applications by an average of 12% (maximum 32%) Deployed a test system for SUPER team members to facilitate collaboration on energy efficiency research and tool integration Progress & Accomplishments Power monitoring during application runs. Numbers in parenthesis show energy savings. Application Name (% energy saved)

SUPER Resilient Computing  Led by Bronis de Supinski, Livermore National Laboratory  Investigate directive-based API for users – Enable users to express their knowledge w/r resilience – Not all faults are fatal errors – Those that can’t be tolerated can often be ameliorated  Automating vulnerability assessment – Modeled on success of PERI autotuning – Conduct fault injection experiments – Determine which code regions or data structures fail catastrophically – Determine what transformations enable them to survive  In either event, extend ROSE compiler to implement transformations SUPER

SUPER Fault Resilience of an Algebraic Multi-Grid (AMG) Solver (Casas Guix and de Supinski) Goals of the SUPER resilience effortGeneral AMG Features  AMG solves linear systems of equations derived from the discretization of partial differential equations.  AMG is an iterative method that operates on nested grids of varying refinement.  Two operators (restriction and interpolation) propagate the linear system through the different grids.  Identify vulnerable data and code regions.  Design and implement simple and effective resilience strategies to improve vulnerability of sensitive pieces of code.  Long term: develop a general methodology to automatically improve the reliability of generic HPC codes.  Developed of a methodology to automatically inject faults to assess the vulnerability of codes to soft errors.  Performed a vulnerability study of AMG.  Determined that AMG is most vulnerable to soft errors in pointer arithmetic, which lead to fatal segmentation faults.  Demonstrated that triple modular redundancy in pointer calculations reduces the vulnerability of AMG to soft errors  Presented at ICS 2012 Progress & Accomplishments In three different experiments, increasing pointer replication in AMG reduces the number of fatal segmentation faults.

SUPER Optimization Led by Paul Hovland, Argonne National Laboratory Performance, energy, and resilience are implicitly related and require simultaneous optimization E.g., Processor pairing covers soft errors, but halves throughput SUPER Results in a stochastic, mixed integer, nonlinear, multi-objective, optimization problem Only sample small portion of search space: Requires efficient derivative-free numerical optimization algorithms Need to adapt algorithms from continuous to discrete autotuning domain

SUPER Optimization: Time vs. Energy  Conventional wisdom suggests that best strategy to minimize energy is to minimize execution time  However, in practice, real tradeoffs between time and energy are observed – Depends on computation being performed – Depends on problem size SUPER

SUPER Outline SUPER Team Research Directions Broader Engagement SUPER

SUPER SciDAC-3 Application Partnerships  Collaboration with SciDAC Application Partnerships is expected – Yet SUPER funding is spread very thin  SUPER investigators included in 12 Application Partnerships – Our time costs money, like everybody else  Principles used to arrange teams – Technical needs of the proposal – Familiarity of the people – Proximity SUPER

SUPER Application Engagement  Led by Pat Worley, Oak Ridge National Laboratory  PERI strategy: proactively identify application collaborators – Based on comprehensive application survey at beginning of SciDAC-2 – Exploited proximity and long-term relationships  SUPER strategy: broaden our reach – Key is partnering with staff at ALCF, OLCF, and NERSC – Augment TAUdb to capture data from applications from centers – Initial interactions between Oregon and ORNL with and LCFs – Collaborate with other SciDAC-3 investigators  Focused engagement as requested by DOE SUPER

SUPER SUPER Summary  Research components – Automatic performance tuning New focus on portability – Addressing the “known unknowns” Energy minimization Resilient computing – Optimization of the above  Near-term impact on DOE computational science applications – Application engagement coordinated with ALCF, NLCF, and NERSC – Tool integration, making research artifacts more approachable – Participation in SciDAC-3 Application Partnerships – Outreach and tutorials SUPER

SUPER Bonus Slides SUPER

SUPER Outreach and Tutorials  Led by David Bailey, Lawrence Berkeley National Laboratory  We will offer training to ALCF, OLCF, and NERSC staff – Facilitate limited deployment of our research artifacts  We will organize tutorials for end users of our tools – Offer them at widely attended forums such as SC conferences  UTK is hosting a Web site, UTEP is maintaining it –  UTEP coordinating quarterly news letters SUPER

SUPER SciDAC-3 Institute Collaboration  PERI was directed not to work with math and CS institutes  PERI nevertheless found itself tuning math libraries – E.g., PETSc kernels are computational bottlenecks in PFLOTRAN  SciDAC-3 is not SciDAC-2; collaboration is encouraged  Initial discussions with FASTMath and QUEST – Optimize math libraries with FASTMath – Optimize end-to-end workflows with QUEST SUPER

SUPER Participation in SciDAC-3 Application Partnerships BER Applying Computationally Efficient Schemes for BioGeochemical Cycles (ORNL) BERMultiScale Methods for Accurate, Efficient, and Scale-Aware Models of the Earth Sys. (LBNL) BERPredicting Ice Sheet and Climate Evolution at Extreme Scales (LBNL & ORNL) BESDeveloping Advanced Methods for Excited State Chemistry in the NWChem S/W Suite (LBNL) BESOptimizing Superconductor Transport Properties through Large-scale Simulation (ANL) BESSimulating the Generation, Evolution and Fate of Electronic Excitations in Molecular and Nanoscale Materials with First Principles Methods (LBNL) FESPartnership for Edge Plasma Simulation (ORNL) FESPlasma Surface Interactions (ORNL) HEPCommunity Petascale Project for Accelerator Science and Simulation (ANL) NPA MultiScale Approach to Nuclear Structure and Reactions (LBNL) NPComputing Properties of Hadrons, Nuclei and Nuclear Matter from QCD (UNC) NPNuclear Computational Low Energy Initiative (ANL) SUPER

SUPER Outlined code (from ROSE outliner) for (si = 0; si < stencil_size; si++) for (kk = 0; kk < hypre__mz; kk++) for (jj = 0; jj < hypre__my; jj++) for (ii = 0; ii < hypre__mx; ii++) rp[((ri+ii)+(jj*hypre__sy3))+(kk*hypre__sz3)] -= ((Ap_0[((ii+(jj*hypre__sy1))+ (kk*hypre__sz1))+ (((A->data_indices)[i])[si])])* (xp_0[((ii+(jj*hypre__sy2))+(kk*hypre__sz2))+(( *dxp_s)[si])])); CHiLL transformation recipe permute([2,3,1,4]) tile(0,4,TI) tile(0,3,TJ) tile(0,3,TK) unroll(0,6,US) unroll(0,7,UI) Constraints on search 0 ≤ TI, TJ, TK ≤ ≤ UI ≤ 16 0 ≤ US ≤ 10 compilers ∈ {gcc, icc} Search space: x16x10x2 = 581,071,360 points Autotuning the central SMG2000 kernel

SUPER Collaboration with NNSA Evaluating accuracy and performance tradeoffs of the physical model Reducing SMN execution frequency: Improves performance Reduces the quality of the simulation SUPER ParaDiS models dislocation lines in a physical volume The volume is divided into several computational domains The dislocations interact and move in response to physical forces External stress and inter-dislocation interactions determine the forces

SUPER SMG2000 Autotuning Result SUPER Selected parameters: TI=122,TJ=106,TK=56,UI=8,US=3,Comp=gcc Performance gain on residual computation: 2.37X Performance gain on full app: 27.23% improvement Parallel search evaluates 490 points and converges in 20 steps

SUPER Tool Integration Led by Al Malony, University of Oregon TAU replaces HPCToolkit as primary triage tool TAUdb replaces PERI performance database New tools to enable performance portability CUDA-CHiLL and OpenMP-CHiLL PAPI GPU, Q, RAPL, MIC Integration of autotuning framework and TAUmon Enable online autotuning Already using online binary patch for empirical tuning experiments SUPER

SUPER Meeting Schedule OregonSept , 2011 UNC RENCIMarch29-30, 2012 ANLSept.25-26, 2012 RiceMarch2013 LBNLSept.2013 UtahFeb.2014 MarylandSept.2014 UCSD SDSCMarch2015 ORNL & UTKSept.2016 USC ISIMarch2017 SUPER

SUPER Management Structure  Overall management – Bob Lucas and David Bailey  Distributed leadership of research – Follows PERI model, adapts as needed  Weekly project teleconferences – Tuesdays, 2 PM Eastern – Wednesdays, noon Eastern – First Wednesday of the month is management topics – Technically focused otherwise  Regular face-to-face project meetings – Monday mornings at SC – All hands, twice per year, each institution takes a turn hosting Allows students and staff to attend at least once SUPER