Debashis Ganguly and John R. Lange

Debashis Ganguly and John R. Lange
The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange Hi, I am Debashis Ganguly from University of Pittsburgh and this is a work with my advisor Dr. John Lange.

Changing Face of HPC Environments
Task-based Runtimes: Potential solution Future: Collocated Workloads Supercomputer Simulation Visualization Traditional: Dedicated Resources Storage Cluster Processing Cluster As you might know, historically, HPC applications had enjoyed dedicated system resources. They used to run on independent supercomputing nodes where they had full control over the resources. But, with Exascale computing paradigm, we are seeing a lot of challenges with data movements across nodes as part of post processing. This led to emergence of new generation of composed application like coupled physics simulation, in-situ data processing etc. These composed applications are primarily envisioned to be collocated on multi-core nodes and either space or time share the underlying resources of the same node. Thus, the face of traditional HPC space is evolving and it is starting to resemble more and more like cloud infrastructure. This, however, comes with its own cloud like problems like heterogeneous hardware, time-shared variable CPU time which manifest broadly as asymmetric performance. The prevailing software technology and programming frameworks were designed with assumption of dedicated computing resources. Now, clearly, they are inept in dealing with these new challenges with variable performance. Thus, over the past decade, systems community has seen a lot of money and effort flown into finding solution to address these challenges. Of particular interest is many-task runtimes. This goal of this work is to evaluate many-task runtimes, in particular how effectively they can handle asymmetric performance. Goal: Can asynchronous task-based runtimes handle asymmetric performance

Task-based Runtimes Experiencing renewal in interest in systems community Assumed to better address performance variability Adopt (Over-)Decomposed task-based model Allow fine-grained scheduling decisions Able to adapt to asymmetric/variable performance But… Originally designed for application induced load imbalances, e.g., an adaptive mesh refinement (AMR) based application Performance asymmetry can be of finer granularity, e.g., variable CPU time in time-shared environments For all who might not know, these task based runtimes adopt over-decomposed task-based model where they have finer-granularity of control in task scheduling and work sharing. This adaptive dynamic nature of these task based runtimes helped them gain popularity and they are assumed to better address performance variability. However, these runtimes were deigned to specifically target algorithm induced performance asymmetry. For example, applications like AMR introduce workload imbalance over the course of execution. Now, if we step back and think that even variable CPU-time in time-shared environment manifest itself in the same way as algorithm-induced workload imbalance. So, the question here is “is our assumption about task based runtimes correct? Are they ready to deal with performance asymmetry in new era HPC space?”

Basic Experimental Evaluation
Synthetic situation Emulate performance asymmetry in time-shared configuration Static and predictable setting Benchmark on 12 cores, share one core with background workload Vary the percentage of CPU time of competing workload Environment: 12 core dual socket compute node, hyperthreading disabled Used cpulimit to control percentage of CPU time To answer this question, we created a very simple test to emulate performance asymmetry in time-shared deployment. It is a synthetic situation, but the configuration is static and predictable. The objective is to provide a constrained and controlled problem to the task-based runtimes which should be easy enough to deal with. We run chosen benchmarks to quantify performance on a 12 core node, where we run a competing workload either dedicatedly or sharing CPU time on a single core. We vary the CPU time of this background workload using CPULIMIT utility. ****Space partitioned vs time shared

Workload Configuration
Node 0 Node 1 11 cores settings Idle Benchmark Competing Workload 12 cores settings Now, whether we run the competing workload dedicatedly on a single node or share it with the benchmark gives us two situations to compare with which going forward we will address as 11-cores and 12 cores settings respectively. As we can see in 11 cores setting the benchmark has to solve same workload with lesser number of CPU cores. This has an advantage of reduced cross-workload interference. On the other hand, in the 12 core setting, the benchmark gets to utilize idle CPU cycles from the competing workload.

Experimental Setup Evaluated two different runtimes:
Charm++: LeanMD HPX-5: LULESH, HPCG, LibPXGL Competing Workload: Prime Number Generator: entirely CPU bound, a minimal memory footprint Kernel Compilation: stresses internal OS features such as I/O and memory subsystem

Charm++ Iterative over-decomposed applications
Object based programming model Tasks implemented as C++ objects Objects can migrate across intra and inter-node boundaries

Charm++ A separate centralized load balancer component
Preempts application progress Actively migrates objects based on current state Causes computation to block across the other cores

Choice of Load Balancer Matters
Comparing performance of different load balancing strategies and without any load balancer 198% divergence We selected RefineSwapLB for the rest of the experiments.

Invocation Frequency Matters
MetaLB: Invoke load balancer less frequently based on heuristics Load balancing overhead of RefineSwapLB with or without MetaLB We enabled MetaLB for our experiments.

25% to background workload
Charm++: LEANMD 25% to background workload 53% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator 12 cores are worse than 11 cores …unless you have at least 75% of the core’s capacity. If the application cannot get more than 75% of the core’s capacity, then is better off ignoring the core completely.

Charm++: LEANMD More variable, but consistent mean performance.
25% to background workload Sensitivity of perc. of CPU utilization by the background workload of kernel compilation

HPX-5 Parcel: Follows Work-First principle of Cilk-5.
Contains a computational task and a reference to the data the task operates on Follows Work-First principle of Cilk-5. Every scheduling entity processes parcels from top of their scheduling queues.

HPX-5 Implemented using Random Work Stealing
f f f f f Implemented using Random Work Stealing No centralized decision making process Overhead of work stealing is assumed by the stealer.

OpenMP: LULESH Overall application performance determined by the slowest rank. Vulnerable to asymmetries in performance. Rely on collective based communication. 185% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator

HPX-5: LULESH A traditional BSP application implemented using task-based programming 42% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator No cross-over point 12 cores are consistently worse than 11 cores

HPX-5: HPCG Another BSP application implemented in task- based model 10% to background workload 5% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator Better than the theoretical expectation 12 cores are consistently worse than 11 cores

HPX-5: LibPXGL An asynchronous graph processing library A more natural fit 22% to background workload 5% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator No cross-over point 12 cores are consistently worse than 11 cores

HPX-5: Kernel Compilation
More immediate, instead of gradual decline. LULESH HPCG LibPXGL

Conclusion Better than BSP, but…
Performance asymmetry is still challenging Preliminary evaluation: Tightly controlled time-shared CPUs Static and consistent configuration Better than BSP, but… On average a CPU loses its utility to a task based runtime as soon as its performance diverges by only 25%.

Thank You Debashis Ganguly The Prognostic Lab
Ph.D. Student, Computer Science Department, University of Pittsburgh The Prognostic Lab

Debashis Ganguly and John R. Lange

Similar presentations

Presentation on theme: "Debashis Ganguly and John R. Lange"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Debashis Ganguly and John R. Lange

Similar presentations

Presentation on theme: "Debashis Ganguly and John R. Lange"— Presentation transcript:

Similar presentations

About project

Feedback