Debashis Ganguly and John R. Lange

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

CS 104 Introduction to Computer Science and Graphics Problems

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Achieving Isolation in Consolidated Environments Jack Lange Assistant Professor University of Pittsburgh.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Multi-stack System Software Jack Lange Assistant Professor University of Pittsburgh.

A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

for Event Driven Servers

Background Computer System Architectures Computer System Software.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

OPERATING SYSTEMS CS 3502 Fall 2017

Reducing OLTP Instruction Misses with Thread Migration

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

Memory Segmentation to Exploit Sleep Mode Operation

Current Generation Hypervisor Type 1 Type 2.

Introduction to Load Balancing:

Chapter 8: Main Memory.

CS 6560: Operating Systems Design

Distributed Processors

Operating Systems (CS 340 D)

Conception of parallel algorithms

Dynamic Graph Partitioning Algorithm

April 6, 2001 Gary Kimura Lecture #6 April 6, 2001

Parallel Objects: Virtualization & In-Process Components

Ching-Chi Lin Institute of Information Science, Academia Sinica

Parallel Algorithm Design

PA an Coordinated Memory Caching for Parallel Jobs

Performance Evaluation of Adaptive MPI

Main Memory Management

Operating Systems (CS 340 D)

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Economics, Administration & Information system

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Improved schedulability on the ρVEX polymorphic VLIW processor

Department of Computer Science University of California, Santa Barbara

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Integrated Runtime of Charm++ and OpenMP

Data and Computer Communications

Dept. of Computer Science, Univ. of Rochester

Adaptive Data Refinement for Parallel Dynamic Programming Applications

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Virtual Memory: Working Sets

Department of Computer Science University of California, Santa Barbara

Operating System Overview

Performance-Robust Parallel I/O

Support for Adaptivity in ARMCI Using Migratable Objects

Parallel Implementation of Adaptive Spacetime Simulations A

Presentation transcript:

Debashis Ganguly and John R. Lange The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange Hi, I am Debashis Ganguly from University of Pittsburgh and this is a work with my advisor Dr. John Lange.

Changing Face of HPC Environments Task-based Runtimes: Potential solution Future: Collocated Workloads Supercomputer Simulation Visualization Traditional: Dedicated Resources Storage Cluster Processing Cluster As you might know, historically, HPC applications had enjoyed dedicated system resources. They used to run on independent supercomputing nodes where they had full control over the resources. But, with Exascale computing paradigm, we are seeing a lot of challenges with data movements across nodes as part of post processing. This led to emergence of new generation of composed application like coupled physics simulation, in-situ data processing etc. These composed applications are primarily envisioned to be collocated on multi-core nodes and either space or time share the underlying resources of the same node. Thus, the face of traditional HPC space is evolving and it is starting to resemble more and more like cloud infrastructure. This, however, comes with its own cloud like problems like heterogeneous hardware, time-shared variable CPU time which manifest broadly as asymmetric performance. The prevailing software technology and programming frameworks were designed with assumption of dedicated computing resources. Now, clearly, they are inept in dealing with these new challenges with variable performance. Thus, over the past decade, systems community has seen a lot of money and effort flown into finding solution to address these challenges. Of particular interest is many-task runtimes. This goal of this work is to evaluate many-task runtimes, in particular how effectively they can handle asymmetric performance. Goal: Can asynchronous task-based runtimes handle asymmetric performance

Task-based Runtimes Experiencing renewal in interest in systems community Assumed to better address performance variability Adopt (Over-)Decomposed task-based model Allow fine-grained scheduling decisions Able to adapt to asymmetric/variable performance But… Originally designed for application induced load imbalances, e.g., an adaptive mesh refinement (AMR) based application Performance asymmetry can be of finer granularity, e.g., variable CPU time in time-shared environments For all who might not know, these task based runtimes adopt over-decomposed task-based model where they have finer-granularity of control in task scheduling and work sharing. This adaptive dynamic nature of these task based runtimes helped them gain popularity and they are assumed to better address performance variability. However, these runtimes were deigned to specifically target algorithm induced performance asymmetry. For example, applications like AMR introduce workload imbalance over the course of execution. Now, if we step back and think that even variable CPU-time in time-shared environment manifest itself in the same way as algorithm-induced workload imbalance. So, the question here is “is our assumption about task based runtimes correct? Are they ready to deal with performance asymmetry in new era HPC space?”

Basic Experimental Evaluation Synthetic situation Emulate performance asymmetry in time-shared configuration Static and predictable setting Benchmark on 12 cores, share one core with background workload Vary the percentage of CPU time of competing workload Environment: 12 core dual socket compute node, hyperthreading disabled Used cpulimit to control percentage of CPU time To answer this question, we created a very simple test to emulate performance asymmetry in time-shared deployment. It is a synthetic situation, but the configuration is static and predictable. The objective is to provide a constrained and controlled problem to the task-based runtimes which should be easy enough to deal with. We run chosen benchmarks to quantify performance on a 12 core node, where we run a competing workload either dedicatedly or sharing CPU time on a single core. We vary the CPU time of this background workload using CPULIMIT utility. ****Space partitioned vs time shared

Workload Configuration Node 0 Node 1 11 cores settings Idle Benchmark Competing Workload 12 cores settings Now, whether we run the competing workload dedicatedly on a single node or share it with the benchmark gives us two situations to compare with which going forward we will address as 11-cores and 12 cores settings respectively. As we can see in 11 cores setting the benchmark has to solve same workload with lesser number of CPU cores. This has an advantage of reduced cross-workload interference. On the other hand, in the 12 core setting, the benchmark gets to utilize idle CPU cycles from the competing workload.

Experimental Setup Evaluated two different runtimes: Charm++: LeanMD HPX-5: LULESH, HPCG, LibPXGL Competing Workload: Prime Number Generator: entirely CPU bound, a minimal memory footprint Kernel Compilation: stresses internal OS features such as I/O and memory subsystem

Charm++ Iterative over-decomposed applications Object based programming model Tasks implemented as C++ objects Objects can migrate across intra and inter-node boundaries

Charm++ A separate centralized load balancer component Preempts application progress Actively migrates objects based on current state Causes computation to block across the other cores

Choice of Load Balancer Matters Comparing performance of different load balancing strategies and without any load balancer 198% divergence We selected RefineSwapLB for the rest of the experiments.

Invocation Frequency Matters MetaLB: Invoke load balancer less frequently based on heuristics Load balancing overhead of RefineSwapLB with or without MetaLB We enabled MetaLB for our experiments.

25% to background workload Charm++: LEANMD 25% to background workload 53% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator 12 cores are worse than 11 cores …unless you have at least 75% of the core’s capacity. If the application cannot get more than 75% of the core’s capacity, then is better off ignoring the core completely.

Charm++: LEANMD More variable, but consistent mean performance. 25% to background workload Sensitivity of perc. of CPU utilization by the background workload of kernel compilation

HPX-5 Parcel: Follows Work-First principle of Cilk-5. Contains a computational task and a reference to the data the task operates on Follows Work-First principle of Cilk-5. Every scheduling entity processes parcels from top of their scheduling queues.

HPX-5 Implemented using Random Work Stealing f f f f f Implemented using Random Work Stealing No centralized decision making process Overhead of work stealing is assumed by the stealer.

OpenMP: LULESH Overall application performance determined by the slowest rank. Vulnerable to asymmetries in performance. Rely on collective based communication. 185% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator

HPX-5: LULESH A traditional BSP application implemented using task-based programming 42% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator No cross-over point 12 cores are consistently worse than 11 cores

10% to background workload HPX-5: HPCG Another BSP application implemented in task- based model 10% to background workload 5% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator Better than the theoretical expectation 12 cores are consistently worse than 11 cores

22% to background workload HPX-5: LibPXGL An asynchronous graph processing library A more natural fit 22% to background workload 5% divergence Sensitivity of perc. of CPU utilization by the background workload of prime number generator No cross-over point 12 cores are consistently worse than 11 cores

HPX-5: Kernel Compilation More immediate, instead of gradual decline. LULESH HPCG LibPXGL

Conclusion Better than BSP, but… Performance asymmetry is still challenging Preliminary evaluation: Tightly controlled time-shared CPUs Static and consistent configuration Better than BSP, but… On average a CPU loses its utility to a task based runtime as soon as its performance diverges by only 25%.

Thank You Debashis Ganguly The Prognostic Lab Ph.D. Student, Computer Science Department, University of Pittsburgh debashis@cs.pitt.edu https://people.cs.pitt.edu/~debashis/ The Prognostic Lab http://www.prognosticlab.org