You have exascale problems? ◦ Load Balancing? ◦ Failure? ◦ Power Management? My system software will solve these problems
Coordinated checkpointing to the traditional parallel file system won’t scale Checkpoint commit approaches node MTBF => Application efficiency drops quickly
Each MPI process runs twice, only fail if both processes in a rank fail Handle full MPI semantics at scale Ferreira, et al. SC 2011.
Your machine power budget and hardware acquisition budget (*) Act now, and you’ll get twice the capacity computing functionality for FREE! (*) plus contracting and granting
Costs and benefits are really easy to understand ◦ Large and node-scalable reduction in system mean time to interrupt (MTTI) ◦ Using it as the primary fault tolerance technique means twice the power consumption on capability problems ◦ Buying twice the number of nodes is also quite painful SC13 Panel: “Replication is too expensive…We [as a community] will have failed if we can't do better than that. ” – Marc Snir
Department of Computer Science Patrick G. Bridges April 22, 2014
Theorem Every individual complete system- level solution to an application exascale problem is “too expensive” for some real workload Rationale ◦ OS doesn’t know your application ◦ General solutions are expensive ◦ Specialized solutions have limited power or applicability
Save us, vendors! ◦ Adding reliability on the compute and control path is potentially hardware-intensive ◦ How much to pay in transistors, power, and $$? ◦ While stepping off the commodity price/performance curve… Burst Buffers ◦ How much budget to spend on the I/O system? ◦ Memory is a scarse resource at exascale ◦ NVRAM and network bandwidth aren’t free in power ◦ Some nice recent work in this area
Idea: Each node checkpoints when most convenient and out of sync with other nodes Benefit: get checkpointing off the peak B/W curve onto the sustained B/W curve Has some (low) obvious costs, some less obvious costs
Apps and BenchmarksProxy Applications Ferreira, et al. In submission. Note how bimodal these performance curves are! Clustered asynchronous checkpointing may hold promise here
Levy, et al. In submission. Cheap and powerful is here
No one inexpensive technique enough, but each solves part of the problem System software must stop trying to “rescue” the application and work with the application ◦ Application/runtime can cover part of the space ◦ System software can provide “last resort” solutions when the application cannot easily recover ◦ Right solution application and hardware dependent ◦ Like it is for linear solvers and load balancing Not just a resilience issue
Characterization of techniques at scale Continued development of new techniques Good decision support ◦ Yet more knobs someone needs to turn ◦ Many of the tradeoffs are non-linear, stochastic, etc ◦ Different problem areas interact “interestingly” ◦ Complex influence on acquisition decisions, too Clean interfaces to runtime and application ◦ “From a runtime developer’s perspective, the way that current operating systems manage resources is fundamentally broken” – Mike Bauer, Legion project
Linux (like OSF/1) will solve all your problems for you ◦ Whether you like it or not ◦ While making sure you can’t do the things you (think you) should do ◦ Which is fine, as long as you don’t need to do anything interesting
Runtimes: “…it is the OS's job to provide mechanism and stay out of the way…” Sandia lightweight kernels: “The QK provides mechanism, PCT encapsulates policy” Go ahead and try – if you fall, I’ll catch you
Applications more complex than when the LWK was originally designed ◦ Users want more complex interfaces and services ◦ Runtimes still want low-level hardware access ◦ But we still have to provide some level of isolation ◦ As well as backstop mechanisms in cooperation with hardware Two predominant approaches: ◦ Composite OS (Fused OS, MAHOS, Argo OS/R, etc.) ◦ Virtualization (Kitten+Palacios VMM, Hobbes OS/R)
Safe low-level hardware access for runtime systems Supports bringing your own OS with you Don’t have to muck with the insides of Linux Can be very fast HPCC FFT over virtualized 10GbE CTH on Palacios/Kitten on Red Storm
Multiple virtualization architectures, not just one Pick the point on the spectrum that provides the mechanisms your application/runtime needs Interesting research challenges on the right mechanisms and interfaces to provide at and between each point LWK Virtual Linux Evironment (Kitten, CNK) LWK Custom (Catamount, HybridVM) Heaviest Weight Fused OS Multiple-native OSes (Pisces, Argo) Para-virtual Implicit, VMM Changes Guest OS (Gears, Guarded Modules) Para-virtual Explicit, Guest OS Modified or Augmented (Orig. Xen, Device Drivers) Full HW VM Runs Unmodified Guest OSes, Passthru (Palacios, KVM, …) Software Virt Emulate HW, Binary Translation, … (Qemu, Vmware, Emulate HW Trans Memory pre-product) Lightest Weight
Assumption is that the runtime (and/or virtualized OS) will do this for the LWK Is a semi-static policy + local (HW or runtime) adaptation sufficient? Or global dynamic adaptive runtime system that sets policy and resource allocation for millions of cores? ◦ With low overhead and application interference? ◦ “Burning a core” probably not viable at this problem size? ◦ Heuristics vs. more disciplined methods? I want to believe but I have yet to see it ◦ Distributed, Decentralized ◦ Must be robust and efficient ◦ Can we tolerate imperfect and unfair?
No, the application and runtime really shouldn’t expect the OS to rescue it System software can and shouuld provide a range of modest, inexpensive mechanisms ◦ Which can backstop app when it can’t rescue itself ◦ Need well-quantified performance for techniques ◦ On real legacy and next-generation workloads Virtualization can give the runtime the low- level mechanisms it wants inexpensively
Colleagues, collaborators and students on this work ◦ UNM: Dorian Arnold, Scott Levy, Cui Zheng ◦ Sandia: Ron Brightwell, Kurt Ferreira, Kevin Pedretti, Patrick Widener ◦ Northwestern: Peter Dinda, Lei Xia ◦ Oak Ridge: Barney Maccabe ◦ Pittsburgh: Jack Lange
This work was supported in part by: ◦ DOE Office of Science, Advanced Scientific Computing Research, under award number DE-SC , program manager Sonia Sachs ◦ Sandia National Labs including funding from the Hobbes project, which is funded by the 2013 Exascale Operating and Runtime Systems Program from the DOE Office of Science, Advanced Scientific Computing Research ◦ Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04- 94AL85000 ◦ U.S. National Science Foundation Awards CNS and CNS