Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School.

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School for Simulation Sciences 2)ETH Zurich

Felix Wolf 2

Analytical performance modeling Disadvantages Time consuming Danger of overlooking unscalable code 3 Identify kernels Parts of the program that dominate its performance at larger scales Identified via small-scale tests and intuition Create models Laborious process Still confined to a small community of skilled experts

Our approach Generate an empirical model for each part of the program automatically Run a manageable number of small-scale performance experiments Launch our tool Compare extrapolated performance to expectations Key ideas Exploit that space of function classes underlying such model is small enough to be searched by a computer program Abandon model accuracy as the primary success metric and rather focus on the binary notion of scalability bugs Create requirements models alongside execution-time models 4

Usage model Input Set of performance measurements (profiles) on different processor counts {p 1, …, p max } w/ weak scaling Individual measurement broken down by program region (call path) Input Set of performance measurements (profiles) on different processor counts {p 1, …, p max } w/ weak scaling Individual measurement broken down by program region (call path) Output List of program regions (kernels) ranked by their predicted execution time at target scale p t > p max Or ranked by growth function (p t  ∞) Output List of program regions (kernels) ranked by their predicted execution time at target scale p t > p max Or ranked by growth function (p t  ∞) 5 Not 100% accurate but good enough to draw attention to right kernels False negatives when phenomenon at scale is not captured in data False positives possible but unlikely Can also model parameters other than p

Workflow 6 Performance measurements Performance measurements Performance profiles Model generation Model generation Scaling models Performance extrapolation Performance extrapolation Ranking of kernels Statistical quality control Statistical quality control Model generation Model generation Accuracy saturated? Accuracy saturated? Model refinement Model refinement Scaling models Yes No Kernel refinement Kernel refinement

Model generation Performance Model Normal Form (PMNF) Not exhaustive but works in most practical scenarios An assignment of n, i k and j k is called model hypothesis i k and j k are chosen from sets I,J  Q n, |I|, |J| don’t have to be arbitrarily large to achieve good fit Instead of deriving model through reasoning, make reasonable choices for n, I, J and try all assignment options one by one Select winner through cross-validation 7

Search space parameters used in our evaluation 8

Model refinement Start with coarse approximation Refine to the point of statistical shrinkage  Protection against over-fitting 9 Hypothesis generation; hypothesis size n Hypothesis generation; hypothesis size n Scaling model Input data Hypothesis evaluation via cross-validation Hypothesis evaluation via cross-validation Computation of for best hypothesis Computation of for best hypothesis No Yes

Requirements modeling 10 Program ComputationCommunication FLOPS Load Store P2P Collective … Time Disagreement may be indicative of wait states

Sweep3D Solves neutron transport problem 3D domain mapped onto 2D process grid Parallelism achieved through pipelined wave-front process LogGP model for communication developed by Hoisie et al. 11

Sweep3D (2) 12 Kernel Runtime[%] p t =262k Increase t(p=262k) t(p=64) Model [s] t = f(p) p i ≤ 8k Predictive error [%] p t =262k sweep->MPI_Recv65.3516.544.03√p5.10 sweep20.870.23582.190.01 global_int_sum-> MPI_Allreduce 12.8918.681.06√p+ 0.03√p*log(p) 13.60 sweep->MPI_Send0.400.2311.49+0.09√p*log(p)15.40 source0.250.046.86+9.13*10 -5 log(p)0.01 #bytes = const. #msg = const.

Sweep3D (3) 13

HOMME 14 Core of the Community Atmospheric Model (CAM) Spectral element dynamical core on a cubed sphere grid Kernel Model [s] t = f(p) p i ≤15k Predictive error [%] p t = 130k Box_rearrange- >MPI_Reduce 0.03 + 2.5  10 -6 p 3/2 + 1.2  10 -12 p 3 57.0 Vlaplace_sphere_vk49.599.3 … Compute_and_apply_rhs48.71.7

HOMME 15 Core of the Community Atmospheric Model (CAM) Spectral element dynamical core on a cubed sphere grid Kernel Model [s] t = f(p) p i ≤43k Predictive error [%] p t = 130k Box_rearrange- >MPI_Reduce 3.6  10 -6 p 3/2 + 7.2  10 -13 p 3 30.3 Vlaplace_sphere_vk24.4 + 2.3  10 -7 p 2 4.3 … Compute_and_apply_rhs49.10.8

HOMME (2) Two issues Number of iterations inside a subroutine grows with p 2 Ceiling for up to and including 15k Developers were aware of this issue and had developed work-around Growth of time spent in reduce function grows with p 3 Previously unknown Function invoked during initialization to funnel data to dedicated I/O processes Execution time at 183k ~ 2h, predictive error ~40% 16 The G8 Research Councils Initiative on Multilateral Research Funding Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues

Cost of first prediction Assumptions Input experiments at scales {2 0, 2 1, 2 2, …, 2 m} Target scale at 2 m+k Application scales perfectly Jitter may require more experiments per input scale, but to be conclusive experiments at the target scale would have to be repeated as well 17 kFull scale [%]Input [%] 1100<100 2100<50 3100<25 4100<12.5 2 nd prediction comes for free

HOMME (3) 18

Conclusion Automated performance modeling is feasible Generated models accurate enough to identify scalability bugs or show their absence with high probability Advantages of mass production also performance models Approximate models are acceptable as long as the effort to create them is low and they do not mislead the user Code coverage is as important as model accuracy Future work Study influence of further hardware parameters More efficient traversal of search space (allows more model parameters) Integration into Scalasca 19

Acknowledgement Alexandru Calotoiu, German Research School for Simulation Sciences Torsten Höfler, ETH Zurich Marius Poke, German Research School for Simulation Sciences Paper 20 Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. Supercomputing (SC13).

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School.

Similar presentations

Presentation on theme: "Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School.

Similar presentations

Presentation on theme: "Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School."— Presentation transcript:

Similar presentations

About project

Feedback