Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.

AlphaZ: A System for Design Space Exploration in the Polyhedral Model

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.

Automated Software Testing: Test Execution and Review Amritha Muralidharan (axm16u)

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

More MR Fingerprinting

CORK: DYNAMIC MEMORY LEAK DETECTION FOR GARBAGE- COLLECTED LANGUAGES A TRADEOFF BETWEEN EFFICIENCY AND ACCURATE, USEFUL RESULTS.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

SE 450 Software Processes & Product Metrics Reliability: An Introduction.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

VLSI Systems--Spring 2009 Introduction: --syllabus; goals --schedule --project --student survey, group formation.

Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

IMSE Week 18 White Box or Structural Testing Reading:Sommerville (4th edition) ch 22 orPressman (4th edition) ch 16.

Automatic Detection of Previously-Unseen Application States for Deployment Environment Testing and Analysis Chris Murphy, Moses Vaughan, Waseem Ilahi,

EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,

The Calibration Process

Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.

Introduction to Software Testing

State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.

Resource Fabrics: The Next Level of Grids and Clouds Lei Shi.

Automated Diagnosis of Software Configuration Errors

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Software Reliability Categorising and specifying the reliability of software systems.

University of Toronto Department of Computer Science © 2001, Steve Easterbrook CSC444 Lec22 1 Lecture 22: Software Measurement Basics of software measurement.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space

I. Pribela, M. Ivanović Neum, Content Automated assessment Testovid system Test generator Module generators Conclusion.

Software Testing. Definition To test a program is to try to make it fail.

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Network Aware Resource Allocation in Distributed Clouds.

Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.

1.  Project: temporary endeavor to achieve some specific objectives in a defined time  Project management ◦ Dynamic process ◦ Controlled and structured.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Monday, January 31, 2000 Aiming Wu Department of Computing and Information Sciences, KSU Readings: Chown and Dietterich.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Introduction Complex and large SW. SW crises Expensive HW. Custom SW. Batch execution Structured programming Product SW.

Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.

Test Drivers and Stubs More Unit Testing Test Drivers and Stubs CEN 5076 Class 11 – 11/14.

Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

On the Use of Finite Difference Matrix-Vector Products in Newton-Krylov Solvers for Implicit Climate Dynamics with Spectral Elements ImpactObjectives 

1 He Says vs. She Says Model Validation and Calibration Kevin Chang HNTB Corporation

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

Sunpyo Hong, Hyesoon Kim

Assess usability of a Web site’s information architecture: Approximate people’s information-seeking behavior (Monte Carlo simulation) Output quantitative.

Introduction to emulators Tony O’Hagan University of Sheffield.

Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.

Software Design and Development Development Methodoligies Computing Science.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Downscaling and Why Local Predictions are Difficult to Make Preparing Your Coast.

Prediction of Interconnect Net-Degree Distribution Based on Rent’s Rule Tao Wan and Malgorzata Chrzanowska- Jeske Department of Electrical and Computer.

Resource Utilization in Large Scale InfiniBand Jobs

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School for Simulation Sciences 2)ETH Zurich

Felix Wolf 2

Analytical performance modeling Disadvantages Time consuming Danger of overlooking unscalable code 3 Identify kernels Parts of the program that dominate its performance at larger scales Identified via small-scale tests and intuition Create models Laborious process Still confined to a small community of skilled experts

Our approach Generate an empirical model for each part of the program automatically Run a manageable number of small-scale performance experiments Launch our tool Compare extrapolated performance to expectations Key ideas Exploit that space of function classes underlying such model is small enough to be searched by a computer program Abandon model accuracy as the primary success metric and rather focus on the binary notion of scalability bugs Create requirements models alongside execution-time models 4

Usage model Input Set of performance measurements (profiles) on different processor counts {p 1, …, p max } w/ weak scaling Individual measurement broken down by program region (call path) Input Set of performance measurements (profiles) on different processor counts {p 1, …, p max } w/ weak scaling Individual measurement broken down by program region (call path) Output List of program regions (kernels) ranked by their predicted execution time at target scale p t > p max Or ranked by growth function (p t  ∞) Output List of program regions (kernels) ranked by their predicted execution time at target scale p t > p max Or ranked by growth function (p t  ∞) 5 Not 100% accurate but good enough to draw attention to right kernels False negatives when phenomenon at scale is not captured in data False positives possible but unlikely Can also model parameters other than p

Workflow 6 Performance measurements Performance measurements Performance profiles Model generation Model generation Scaling models Performance extrapolation Performance extrapolation Ranking of kernels Statistical quality control Statistical quality control Model generation Model generation Accuracy saturated? Accuracy saturated? Model refinement Model refinement Scaling models Yes No Kernel refinement Kernel refinement

Model generation Performance Model Normal Form (PMNF) Not exhaustive but works in most practical scenarios An assignment of n, i k and j k is called model hypothesis i k and j k are chosen from sets I,J  Q n, |I|, |J| don’t have to be arbitrarily large to achieve good fit Instead of deriving model through reasoning, make reasonable choices for n, I, J and try all assignment options one by one Select winner through cross-validation 7

Search space parameters used in our evaluation 8

Model refinement Start with coarse approximation Refine to the point of statistical shrinkage  Protection against over-fitting 9 Hypothesis generation; hypothesis size n Hypothesis generation; hypothesis size n Scaling model Input data Hypothesis evaluation via cross-validation Hypothesis evaluation via cross-validation Computation of for best hypothesis Computation of for best hypothesis No Yes

Requirements modeling 10 Program ComputationCommunication FLOPS Load Store P2P Collective … Time Disagreement may be indicative of wait states

Sweep3D Solves neutron transport problem 3D domain mapped onto 2D process grid Parallelism achieved through pipelined wave-front process LogGP model for communication developed by Hoisie et al. 11

Sweep3D (2) 12 Kernel Runtime[%] p t =262k Increase t(p=262k) t(p=64) Model [s] t = f(p) p i ≤ 8k Predictive error [%] p t =262k sweep->MPI_Recv √p5.10 sweep global_int_sum-> MPI_Allreduce √p+ 0.03√p*log(p) sweep->MPI_Send √p*log(p)15.40 source *10 -5 log(p)0.01 #bytes = const. #msg = const.

Sweep3D (3) 13

HOMME 14 Core of the Community Atmospheric Model (CAM) Spectral element dynamical core on a cubed sphere grid Kernel Model [s] t = f(p) p i ≤15k Predictive error [%] p t = 130k Box_rearrange- >MPI_Reduce  p 3/  p Vlaplace_sphere_vk … Compute_and_apply_rhs

HOMME 15 Core of the Community Atmospheric Model (CAM) Spectral element dynamical core on a cubed sphere grid Kernel Model [s] t = f(p) p i ≤43k Predictive error [%] p t = 130k Box_rearrange- >MPI_Reduce 3.6  p 3/  p Vlaplace_sphere_vk  p … Compute_and_apply_rhs

HOMME (2) Two issues Number of iterations inside a subroutine grows with p 2 Ceiling for up to and including 15k Developers were aware of this issue and had developed work-around Growth of time spent in reduce function grows with p 3 Previously unknown Function invoked during initialization to funnel data to dedicated I/O processes Execution time at 183k ~ 2h, predictive error ~40% 16 The G8 Research Councils Initiative on Multilateral Research Funding Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues

Cost of first prediction Assumptions Input experiments at scales {2 0, 2 1, 2 2, …, 2 m} Target scale at 2 m+k Application scales perfectly Jitter may require more experiments per input scale, but to be conclusive experiments at the target scale would have to be repeated as well 17 kFull scale [%]Input [%] 1100< < < < nd prediction comes for free

HOMME (3) 18

Conclusion Automated performance modeling is feasible Generated models accurate enough to identify scalability bugs or show their absence with high probability Advantages of mass production also performance models Approximate models are acceptable as long as the effort to create them is low and they do not mislead the user Code coverage is as important as model accuracy Future work Study influence of further hardware parameters More efficient traversal of search space (allows more model parameters) Integration into Scalasca 19

Acknowledgement Alexandru Calotoiu, German Research School for Simulation Sciences Torsten Höfler, ETH Zurich Marius Poke, German Research School for Simulation Sciences Paper 20 Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. Supercomputing (SC13).