Brian Kocoloski Jack Lange Department of Computer Science

Slides:



Advertisements
Similar presentations
Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines J. LeVasseur V. Uhlig J. Stoess S. G¨otz University of Karlsruhe,
Advertisements

Partition and Isolate: Approaches for Consolidating HPC and Commodity Workloads Jack Lange Assistant Professor University of Pittsburgh.
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
Palacios and Kitten: New High Performance Operating Systems For Scalable Virtualized and Native Supercomputing John R. Lange and Kevin Pedretti Trammell.
Minimal-overhead Virtualization of a Large Scale Supercomputer John R. Lange and Kevin Pedretti, Peter Dinda, Chang Bae, Patrick Bridges, Philip Soltero,
Xen , Linux Vserver , Planet Lab
HPMMAP: Lightweight Memory Management for Commodity Operating Systems
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Chapter 6 - Implementing Processes, Threads and Resources Kris Hansen Shelby Davis Jeffery Brass 3/7/05 & 3/9/05 Kris Hansen Shelby Davis Jeffery Brass.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
CSE598C Virtual Machines and Their Applications Operating System Support for Virtual Machines Coauthored by Samuel T. King, George W. Dunlap and Peter.
SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls John R. Lange and Peter Dinda University of Pittsburgh (CS) Northwestern University (EECS)
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Jakub Szefer, Eric Keller, Ruby B. Lee Jennifer Rexford Princeton University CCS October, 2011 報告人:張逸文.
Achieving Isolation in Consolidated Environments Jack Lange Assistant Professor University of Pittsburgh.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
การติดตั้งและทดสอบการทำคลัสเต อร์เสมือนบน Xen, ROCKS, และไท ยกริด Roll Implementation of Virtualization Clusters based on Xen, ROCKS, and ThaiGrid Roll.
An architecture for space sharing HPC and commodity workloads in the cloud Jack Lange Assistant Professor University of Pittsburgh.
Multi-stack System Software Jack Lange Assistant Professor University of Pittsburgh.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
The Role of Virtualization in Exascale Production Systems Jack Lange Assistant Professor University of Pittsburgh.
Full and Para Virtualization
Lecture 26 Virtual Machine Monitors. Virtual Machines Goal: run an guest OS over an host OS Who has done this? Why might it be useful? Examples: Vmware,
Virtual Machines Mr. Monil Adhikari. Agenda Introduction Classes of Virtual Machines System Virtual Machines Process Virtual Machines.
Background Computer System Architectures Computer System Software.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction to Operating Systems Concepts
The Post Windows Operating System
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Virtual Machine Monitors
Operating System Structures
Agenda Hardware Virtualization Concepts
Is Virtualization ready for End-to-End Application Performance?
Operating System Structure
Presented by Yoon-Soo Lee
Current Generation Hypervisor Type 1 Type 2.
Operating System Structure
Mechanism: Limited Direct Execution
Effective Data-Race Detection for the Kernel
Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor
Chapter 1: Introduction
1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No
Group 8 Virtualization of the Cloud
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Virtualization Techniques
Chapter 2: System Structures
Intro. To Operating Systems
HW & Systems: Operating Systems IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, October 22, 2013 Carolyn Seaman University of Maryland, Baltimore.
CSCE 313 – Introduction to UNIx process
Multithreaded Programming
Windows Virtual PC / Hyper-V
Outline Chapter 2 (cont) OS Design OS structure
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Chapter 4: Threads & Concurrency
- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts:  multiprogramming, multiprocessing, multitasking,
System calls….. C-program->POSIX call
Department of Computer Science University of California, Santa Barbara
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters
Test Optimization Using Software Virtualization
Presentation transcript:

Better than Native: Using Virtualization to Improve Compute Node Performance Brian Kocoloski Jack Lange Department of Computer Science University of Pittsburgh 1/12/2019

Linux is becoming the dominant supercomputing OS … Source: http://en.wikipedia.org/wiki/File:Operating_systems_used_on_top_500_supercomputers.svg

… but some applications need less overhead Lightweight Kernels (LWKs) provide low overhead access to hardware Q: How do we provide LWKs to applications that need them, but not to those that don’t? A: Virtualization Applications running in a virtual environment can outperform the same applications running natively Virtualization - We can launch LWKs inside of virtual machines (VMs) to provide isolated, low overehad environments to applications. In doing so, we improve the performance of the host OS

Drawbacks of Linux Memory Management OS Noise Non-technical Biggest problem Widely recognized as a source of overhead OS Noise HPC apps are tightly synchronized Timing is a big deal Non-technical Noise: any events that interrupt a running program Ex: kernel timer, task schedulers, and even general management processes such as mail clients Overheads experienced by 1 node of a bulk synchronous parallel application are often inflicted on other nodes, as all nodes execute in lockstep Non-technical: Linux is not inherently an HPC OS, so it requires modification to avoid the overhead inherent in it’s design Huge burden on OS developers that have to keep up with the ever-changing Linux code base

Disadvantages of Current Schemes ZeptoOS “Big Memory” Memory is statically sized, allocated at boot time Compatibility Cray’s CNL HugeTLBfs Maximum of 2MB-sized memory regions available ZeptoOS compute node Linux consists of modifications to improve the usability of the Linux kernel for HPC ZeptoOS - “Big Memory” – preallocation of large contiguous regions of memory at boot time Preallocated memory is static and at boot time Compatibility – latest version based on kernel version 2.6.19 (released in 2006) Cray’s CNL is a runtime environment based on the Linux kernel CNL – appears to mimic the use of hugetlbfs architecture Limits the size of contiguous regions that can be given to applications

Application Completes Our Approach Linux Compute Node OS Hardware No Linux Compute Node OS HPC Application Yes Lightweight Kernel Application Completes Needs LWK? Hardware LWKs are able to manage memory more efficiently and eliminate OS noise After: 1. VMs allow us to launch LWKs when they are “needed” – memory available to host if not 2. Requires no modification to host OS 3. Requires no system reconfiguration 4. We are able to deliver large allocations in a way that requires no modification to the Linux kernel, in contrast to previous schemes VMM layer Linux Compute Node OS Hardware

Palacios OS-independent embeddable virtual machine monitor Strip resources away from host OS Low noise, low overhead memory management Developed at Northwestern University, University of New Mexico, and University of Pittsburgh Open source and freely available Raw resources allow low overhead access for guest OS http://www.v3vee.org/palacios

Kitten Lightweight Kernel from Sandia National Labs Moves resource management as close to application as possible Mostly Linux-compatible user environment Modern code-base with Linux-like organization Open source and freely available Like many LWKS, emphasizes efficiency over functionality – provide only what is necessary for applications – this limits noise and makes memory management easier Familiar user-environment and code-base makes it easier for application developers http://software.sandia.gov/trac/kitten

System Architecture

System Architecture (1) Insertable as a kernel module at runtime – thus, requires no modification or reconfiguration of any kind to the Linux kernel

System Architecture (1) Resource Managers – strip control of raw hardware from host OS Raw access allows us to bypass host OS, and control two main areas (1) Memory Management, and (2) When and where Linux scheduler runs (2) LWK/HPC on top of Palacios (3) Finally, again, the Linux based kernel only needs to expose the module API for this approach to be feasible

Palacios’ Approach Memory Management OS Noise Bypass the Linux memory management strategies completely, at run time OS Noise Control when the Linux scheduler is able to run Take advantage of tickless host kernel Linux Management Linux Management Palacios Management Offlined Linux Management 1. Initially, we have the Linux kernel managing all of the memory in the system. 2. Hotpluggable memory interface provided by Linux allows us to “offline” – this has the effect of making the memory seem unavailable to Linux. Our approach is to offline only contiguous regions. Thus, when we take control of these offlined blocks, the resulting simplicity of guest to host memory address translations allows us to manage it with very little overhead 3. Finally, when guest OS is finished, we can return the memory to the Linux host 4. OS noise – control when by determining which VM exits will yield control to the Linux scheduler 5. Tickless host kernel – avoid frequent host timer interrupts – more on this later

Evaluation Two part evaluation: Evaluation is preliminary Microbenchmarks – Stream, Selfish Miniapplications – HPCCG, pHPCCG Evaluation is preliminary Currently limited to a single node running a commodity Fedora 15 kernel Environments are not fully optimized Microbenchmarks – used to measure memory performance and ability to control noise Miniapplications – used to measure performance with a “real” HPC app

Environment Two 6-core processors and 16 GB memory NUMA design Kitten VM was configured with 1 GB of memory Stream, HPCCG used OpenMP for shared memory and ran 10 times 1 GB – due to initial limitations

Stream Palacios provides ~400 MB/s better memory performance on average than Linux (4.74%) 0.34 GB/s lower standard deviation on average Stream is a microbenchmark that measures memory bandwidth

Selfish Detour Linux Kitten Selfish Detour - microbenchmark that measures and detects interruptions of its execution Fair amount of low level noise in both environments, but… Quick: Kitten significantly less noisy – consistent, predictable

Selfish Detour Linux Virtualized Kitten There is overhead – low level noise goes from roughly 7 ms to around 10, but … …Notice host 100 HZ timer gone – thanks to tickless With few exceptions, Palacios is indeed able to isolate the noise of the Linux host – and it doesn’t insert too much noise itself

HPCCG Performs the conjugate gradient method to solve a system of linear equations represented by a sparse matrix Workload similar to that of many HPC applications Separate experiments to represent both CPU and memory intensive workloads Single-precision – more CPU-intensive Double-precision – more memory-intensive

HPCCG – CPU intensive Palacios provides better performance than native Linux when running on the same NUMA node (1,2,4 cores), but slightly worse performance on 8 cores On 8 cores, we believe that Palacios is worst because of NUMA – the Kitten VM has no understanding of the underlying NUMA layout. Linux, however, does have that information available Palacios’ performance is most consistent which indicates that it is likely to exhibit better scalability that linux Again, while not better across the board, Palacios was indeed better in 2 scenarios, showing that it is capable of outperforming a native Linux host

HPCCG – memory intensive Only experiment in which Palacios typically performs slightly worse than native Linux – above 1 core however, results are nearly identical However, Palacios did provide lower variance than either Linux configuration – again good for scaling

Future Work Extend to actual Cray hardware with a CNL host Show definitively if this approach can work Explore the possibility that this approach can be deployed in a cloud setting to provide virtual HPC environments on commodity clouds Previously infeasible, due to the contention, noise, etc. Problems we think can be solved by the same techniques used in this work

Conclusions Palacios is capable of providing superior performance to native Linux Palacios can provide a low noise environment, even when running on a noisy Linux host While results are preliminary, they show that this approach is feasible at small scales Superior, particularly in Stream Feasible at small scales, and the consistency demonstrated in all environments is indicative that it is likely to scale well

Acknowledgments Palacios: http://www.v3vee.org/palacios Kitten: https://software.sandia.gov/trac/kitten Email: briankoco@cs.pitt.edu jacklange@cs.pitt.edu