 You have exascale problems? ◦ Load Balancing? ◦ Failure? ◦ Power Management?  My system software will solve these problems.

Slides:

Advertisements

Similar presentations

1 Towards Virtual Passthrough I/O on Commodity Devices Lei Xia, Jack Lange, Peter Dinda {lxia, jarusl, Department of Electrical.

Advertisements

Virtualization Technology

Xen and the Art of Virtualization Ian Pratt University of Cambridge and Founder of XenSource Inc. Computer Laboratory.

Virtual Machine Technology Dr. Gregor von Laszewski Dr. Lizhe Wang.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Partition and Isolate: Approaches for Consolidating HPC and Commodity Workloads Jack Lange Assistant Professor University of Pittsburgh.

COS 461 Fall 1997 Workstation Clusters u replace big mainframe machines with a group of small cheap machines u get performance of big machines on the cost-curve.

Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.

Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!

Palacios and Kitten: New High Performance Operating Systems For Scalable Virtualized and Native Supercomputing John R. Lange and Kevin Pedretti Trammell.

Minimal-overhead Virtualization of a Large Scale Supercomputer John R. Lange and Kevin Pedretti, Peter Dinda, Chang Bae, Patrick Bridges, Philip Soltero,

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Network Implementation for Xen and KVM Class project for E : Network System Design and Implantation 12 Apr 2010 Kangkook Jee (kj2181)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy

The Whats and Whys of Whole System Virtualization Peter A. Dinda Prescience Lab Department of Computer Science Northwestern University

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.

Virtual Machines. Virtualization Virtualization deals with “extending or replacing an existing interface so as to mimic the behavior of another system”

Virtualization for Cloud Computing

Distributed Systems CS Virtualization- Overview Lecture 22, Dec 4, 2013 Mohammad Hammoud 1.

Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Tanenbaum 8.3 See references

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Microkernels, virtualization, exokernels Tutorial 1 – CSC469.

A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.

Achieving Isolation in Consolidated Environments Jack Lange Assistant Professor University of Pittsburgh.

Xen Overview for Campus Grids Andrew Warfield University of Cambridge Computer Laboratory.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

An architecture for space sharing HPC and commodity workloads in the cloud Jack Lange Assistant Professor University of Pittsburgh.

Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.

Virtual Machine Monitors: Technology and Trends Jonathan Kaldor CS614 / F07.

Virtual Machine and its Role in Distributed Systems.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.

Virtualization Part 2 – VMware. Virtualization 2 CS5204 – Operating Systems VMware: binary translation Hypervisor VMM Base Functionality (e.g. scheduling)

CS533 Concepts of Operating Systems Jonathan Walpole.

Multi-stack System Software Jack Lange Assistant Professor University of Pittsburgh.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Processes Introduction to Operating Systems: Module 3.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

The Role of Virtualization in Exascale Production Systems Jack Lange Assistant Professor University of Pittsburgh.

Introduction to virtualization

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Partitioned Multistack Evironments for Exascale Systems Jack Lange Assistant Professor University of Pittsburgh.

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Full and Para Virtualization

Operating-System Structures

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->

Tackling I/O Issues 1 David Race 16 March 2010.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

E Virtual Machines Lecture 1 What is Virtualization? Scott Devine VMware, Inc.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.

Intro To Virtualization Mohammed Morsi

Virtualization for Cloud Computing

A move towards Greener Planet

Virtualization.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Is Virtualization ready for End-to-End Application Performance?

Building a Virtual Infrastructure

Group 8 Virtualization of the Cloud

Ron Brightwell, R&D Manager Scalable System Software Department

Brian Kocoloski Jack Lange Department of Computer Science

CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors

Virtualization Dr. S. R. Ahmed.

Presentation transcript:

 You have exascale problems? ◦ Load Balancing? ◦ Failure? ◦ Power Management?  My system software will solve these problems

 Coordinated checkpointing to the traditional parallel file system won’t scale  Checkpoint commit approaches node MTBF => Application efficiency drops quickly

 Each MPI process runs twice, only fail if both processes in a rank fail  Handle full MPI semantics at scale Ferreira, et al. SC 2011.

Your machine power budget and hardware acquisition budget (*) Act now, and you’ll get twice the capacity computing functionality for FREE! (*) plus contracting and granting

 Costs and benefits are really easy to understand ◦ Large and node-scalable reduction in system mean time to interrupt (MTTI) ◦ Using it as the primary fault tolerance technique means twice the power consumption on capability problems ◦ Buying twice the number of nodes is also quite painful  SC13 Panel: “Replication is too expensive…We [as a community] will have failed if we can't do better than that. ” – Marc Snir

Department of Computer Science Patrick G. Bridges April 22, 2014

 Theorem Every individual complete system- level solution to an application exascale problem is “too expensive” for some real workload  Rationale ◦ OS doesn’t know your application ◦ General solutions are expensive ◦ Specialized solutions have limited power or applicability

 Save us, vendors! ◦ Adding reliability on the compute and control path is potentially hardware-intensive ◦ How much to pay in transistors, power, and $$? ◦ While stepping off the commodity price/performance curve…  Burst Buffers ◦ How much budget to spend on the I/O system? ◦ Memory is a scarse resource at exascale ◦ NVRAM and network bandwidth aren’t free in power ◦ Some nice recent work in this area

 Idea: Each node checkpoints when most convenient and out of sync with other nodes  Benefit: get checkpointing off the peak B/W curve onto the sustained B/W curve  Has some (low) obvious costs, some less obvious costs

Apps and BenchmarksProxy Applications Ferreira, et al. In submission. Note how bimodal these performance curves are! Clustered asynchronous checkpointing may hold promise here

Levy, et al. In submission. Cheap and powerful is here

 No one inexpensive technique enough, but each solves part of the problem  System software must stop trying to “rescue” the application and work with the application ◦ Application/runtime can cover part of the space ◦ System software can provide “last resort” solutions when the application cannot easily recover ◦ Right solution application and hardware dependent ◦ Like it is for linear solvers and load balancing  Not just a resilience issue

 Characterization of techniques at scale  Continued development of new techniques  Good decision support ◦ Yet more knobs someone needs to turn ◦ Many of the tradeoffs are non-linear, stochastic, etc ◦ Different problem areas interact “interestingly” ◦ Complex influence on acquisition decisions, too  Clean interfaces to runtime and application ◦ “From a runtime developer’s perspective, the way that current operating systems manage resources is fundamentally broken” – Mike Bauer, Legion project

Linux (like OSF/1) will solve all your problems for you ◦ Whether you like it or not ◦ While making sure you can’t do the things you (think you) should do ◦ Which is fine, as long as you don’t need to do anything interesting

 Runtimes: “…it is the OS's job to provide mechanism and stay out of the way…”  Sandia lightweight kernels: “The QK provides mechanism, PCT encapsulates policy”  Go ahead and try – if you fall, I’ll catch you

 Applications more complex than when the LWK was originally designed ◦ Users want more complex interfaces and services ◦ Runtimes still want low-level hardware access ◦ But we still have to provide some level of isolation ◦ As well as backstop mechanisms in cooperation with hardware  Two predominant approaches: ◦ Composite OS (Fused OS, MAHOS, Argo OS/R, etc.) ◦ Virtualization (Kitten+Palacios VMM, Hobbes OS/R)

 Safe low-level hardware access for runtime systems  Supports bringing your own OS with you  Don’t have to muck with the insides of Linux  Can be very fast HPCC FFT over virtualized 10GbE CTH on Palacios/Kitten on Red Storm

 Multiple virtualization architectures, not just one  Pick the point on the spectrum that provides the mechanisms your application/runtime needs  Interesting research challenges on the right mechanisms and interfaces to provide at and between each point LWK Virtual Linux Evironment (Kitten, CNK) LWK Custom (Catamount, HybridVM) Heaviest Weight Fused OS Multiple-native OSes (Pisces, Argo) Para-virtual Implicit, VMM Changes Guest OS (Gears, Guarded Modules) Para-virtual Explicit, Guest OS Modified or Augmented (Orig. Xen, Device Drivers) Full HW VM Runs Unmodified Guest OSes, Passthru (Palacios, KVM, …) Software Virt Emulate HW, Binary Translation, … (Qemu, Vmware, Emulate HW Trans Memory pre-product) Lightest Weight

 Assumption is that the runtime (and/or virtualized OS) will do this for the LWK  Is a semi-static policy + local (HW or runtime) adaptation sufficient?  Or global dynamic adaptive runtime system that sets policy and resource allocation for millions of cores? ◦ With low overhead and application interference? ◦ “Burning a core” probably not viable at this problem size? ◦ Heuristics vs. more disciplined methods?  I want to believe but I have yet to see it ◦ Distributed, Decentralized ◦ Must be robust and efficient ◦ Can we tolerate imperfect and unfair?

 No, the application and runtime really shouldn’t expect the OS to rescue it  System software can and shouuld provide a range of modest, inexpensive mechanisms ◦ Which can backstop app when it can’t rescue itself ◦ Need well-quantified performance for techniques ◦ On real legacy and next-generation workloads  Virtualization can give the runtime the low- level mechanisms it wants inexpensively

 Colleagues, collaborators and students on this work ◦ UNM: Dorian Arnold, Scott Levy, Cui Zheng ◦ Sandia: Ron Brightwell, Kurt Ferreira, Kevin Pedretti, Patrick Widener ◦ Northwestern: Peter Dinda, Lei Xia ◦ Oak Ridge: Barney Maccabe ◦ Pittsburgh: Jack Lange

This work was supported in part by: ◦ DOE Office of Science, Advanced Scientific Computing Research, under award number DE-SC , program manager Sonia Sachs ◦ Sandia National Labs including funding from the Hobbes project, which is funded by the 2013 Exascale Operating and Runtime Systems Program from the DOE Office of Science, Advanced Scientific Computing Research ◦ Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04- 94AL85000 ◦ U.S. National Science Foundation Awards CNS and CNS