Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng.

Slides:



Advertisements
Similar presentations
Operating Systems Manage system resources –CPU scheduling –Process management –Memory management –Input/Output device management –Storage device management.
Advertisements

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.
© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.
© UC Regents 2010 Extending Rocks Clusters into Amazon EC2 Using Condor Philip Papadopoulos, Ph.D University of California, San Diego San Diego Supercomputer.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Course Instructor: Aisha Azeem
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
Installing software on personal computer
What is Unix Prepared by Dr. Bahjat Qazzaz. What is Unix UNIX is a computer operating system. An operating system is the program that – controls all the.
Linux Basics CS 302. Outline  What is Unix?  What is Linux?  Virtual Machine.
Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.
Paper on Best implemented scientific concept for E-Governance projects Virtual Machine By Nitin V. Choudhari, DIO,NIC,Akola.
Operating systems CHAPTER 7.
Windows Azure Conference 2014 Running Docker on Windows Azure.
1 port BOSS on Wenjing Wu (IHEP-CC)
UNIX and Shell Programming (06CS36)
Model a Container Runtime environment on Your Mac with VMware AppCatalyst VMworld Fabio Rapposelli
ICOM Noack Operating Systems - Administrivia Prontuario - Please time-share and ask questions Info is in my homepage amadeus/~noack/ Make bookmark.
การติดตั้งและทดสอบการทำคลัสเต อร์เสมือนบน Xen, ROCKS, และไท ยกริด Roll Implementation of Virtualization Clusters based on Xen, ROCKS, and ThaiGrid Roll.
Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.
Operating System What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. An operating.
CS 346 – Chapter 2 OS services –OS user interface –System calls –System programs How to make an OS –Implementation –Structure –Virtual machines Commitment.
UNIX and Shell Programming
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Breaking Barriers Exploding with Possibility Breaking Barriers Exploding with Possibility The Cloud Era Unveiled.
An operating system is the software that makes everything in the computer work together smoothly and efficiently. What is an Operating System?
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.
| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.
The Open Science Framework (OSF) at Notre Dame: Connecting the Workflow and Supporting the Research Mission CNI Fall 2015 Membership Meeting Washington,
Full and Para Virtualization
Docker and Container Technology
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 1.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
CEG 2400 FALL 2012 Linux/UNIX Network Operating Systems.
An operating system (OS) is a collection of system programs that together control the operation of a computer system.
Combining Containers and Workflow Systems for Reproducible Execution Douglas Thain, Alexander Vyushkov, Haiyan Meng, Peter Ivie, and Charles Zheng University.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Atmosphere Joslynn Lee – Data Science Educator Cold Spring Harbor Laboratory,
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
UNIX U.Y: 1435/1436 H Operating System Concept. What is an Operating System?  The operating system (OS) is the program which starts up when you turn.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
1 OPERATING SYSTEMS. 2 CONTENTS 1.What is an Operating System? 2.OS Functions 3.OS Services 4.Structure of OS 5.Evolution of OS.
Computer System Structures
Mike Hildreth representing the DASPOS Team
Development Environment
Fundamentals Sunny Sharma Microsoft
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Docker Birthday #3.
Mike Hildreth representing the DASPOS Team
Unified Modeling Language
Virtualisation for NA49/NA61
Containers and Virtualisation
Lecture 1 Runtime environments.
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
Hadoop Clusters Tess Fulkerson.
Introduction to Makeflow and Work Queue
Haiyan Meng and Douglas Thain
Virtualization Techniques
Weaving Abstractions into Workflows
Chapter 2: The Linux System Part 1
Partition Starter Find out what disk partitioning is, state key features, find a diagram and give an example.
Overview of Workflows: Why Use Them?
Lecture 1 Runtime environments.
Azure Container Service
Presentation transcript:

Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng

The Cooperative Computing Lab

We collaborate with people who have large scale computing problems in science, engineering, and other fields. We operate computer systems on the O(10,000) cores: clusters, clouds, grids. We conduct computer science research in the context of real people and problems. We release open source software for large scale distributed computing. 3

Some of Our Collaborators K. Lannon: Analyze 2PB of data produced by the LHC experiment at CERN J. Izaguirre: Simulate 10M different configurations of a complex protein. S. Emrich: Analyze DNA in thousands of genomes for similar sub-sequences. P. Flynn: Computational experiments on millions of acquired face videos.

Reproducibility in Scientific Computing is Very Poor Today Can I re-run a result from a colleague from five years ago, and obtain the same result? How about a student in my lab from last week? Today, are we preparing for our current results to be re-used by others five years from now? Multiple reasons why not: – Rapid technological change. – No archival of artifacts. – Many implicit dependencies. – Lack of backwards compatibility. – Lack of social incentives.

Our scientific collaborators see the value in reproducibility… But only if it can be done - easily - at large scale - with high performance

In principle, preserving a software execution is easy.

I want to preserve my simulation method and results so other people can try it out. data output DOI:10.XXXXDOI:10.ZZZZ DOI:10.CCCC DOI:10.YYYY … and repeat this 1M times with different –p values. mysim.exe –in data –out output –p 10

But it’s not that simple!

I want to preserve my simulation method and results so other people can try it out. data output mysim.exe –in data –out output –p 10 config calib HTTP GET Green Goat Linux B libsim ruby X86-64 CPU / 64GB RAM / 200GB Disk SIM_MODE=clever

The problem is implicit dependencies: (things you need but cannot see) How do we find them? How do we deliver them?

Two approaches: Preserve the Mess (VMs and Parrot) Encourage Cleanliness (Umbrella and Prune)

Preserve the Mess: Save a VM data output sim.exe –in data –out output –p 10 config Green Goat Linux B libsim ruby Hardware SIM_MODE=clever Virtual Hardware data output sim.exe –in data –out output –p 10 config Green Goat Linux B libsim ruby SIM_MODE=clever VMM DOI:10.XXXX calib HTTP GET

Preserve the Mess: Save a VM Not a bad place to start, but: – Captures more things than necessary. – Duplication across similar VM images. – Hard to disentangle things logically – what if you want to run the same thing with a new OS? – Doesn’t capture network interactions. – May be coupled to a specific VM technology. – VM services are not data archives.

Preserve the Mess: Trace Individual Dependencies data output sim.exe –in data –out output –p 10 config Green Goat Linux B libsim ruby Hardware SIM_MODE=clever calib HTTP GET /home/fred/data /home/fred/config /home/fred/sim.exe /usr/lib/libsim /usr/bin/ruby Resource Actually Used Original Machine A portable package that can be re-executed using Docker, Parrot, or Amazon Parrot Trace all system calls Case Study: 166TB reduced to a 21GB package. Haiyan Meng, Matthias Wolf, Peter Ivie, Anna Woodard, Michael Hildreth, Douglas Thain, A Case Study in Preserving a High Energy Physics Application with Parrot, Journal of Physics: Conference Series, December, 2015.

Preserve the Mess: Trace Individual Dependencies Solves some problems: – Captures only what is necessary. – Observes network interactions. – Portable across VM technologies. Still has some of the same problems: – Hard to disentangle things logically – what if you want to run the same thing with a new OS? – Duplication across similar VM images / packages. – VM services are not data archives.

What we really want: A structured way to compose an application with all of its dependencies. Enable preservation and sharing of data and images for efficiency.

Encourage Cleanliness: Umbrella kernel: { name: “Linux”, version: “3.2”; } opsys: { name: “Red Hat”, version: “6.1” } software: { mysim: { url:“doi://10.WW/ZZZZ” mount:“/soft/sim”, } data: { input: { url: “ mount :“/data/input”, } calib: { url:“doi://10.XX/YYYY” mount:“/data/calib”, } “umbrella run mysim.json” mysim.json RedHat 6.1 Linux mysim input calib RedHat 6.1 input mysim calib Online Data Archives

RedHat 6.1 RedHat 6.2 Mysim 3.1 Mysim 3.2 Institutional Repository input1 calib1 input2 calib2 Linux Linux RedHat 6.1 Linux Mysim 3.1 input1 Umbrella specifies a reproducible environment while avoiding duplication and enabling precise adjustments. Run the experiment Same thing, but use different input data. Same thing, but update the OS RedHat 6.1 Linux Mysim 3.1 input2 RedHat 6.2 Linux Mysim 3.1 input2 Haiyan Meng and Douglas Thain, Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters, Clouds, and Grids, Virt. Tech. for Distributed Computing, June 2015.

Specification is More Important Than Mechanism Umbrella can work in a variety of ways: – Native Hardware: Just check for compatibility. – Amazon: allocate VM, copy and unpack tarballs. – Docker: create container, mount volumes. – Parrot: Download tarballs, mount at runtime. – Condor: Request compatible machine. More ways will be possible in the future as technologies come and go. Key requirement: Efficient runtime composition, rather than copying, to allow shared deps.

Encourage Cleanliness: Construct workflows from carefully specified building blocks.

Encouraging Cleanliness: PRUNE Observation: The Unix execution model is part of the problem, because it allows implicit deps. Can we improve upon the standard command- line shell interface to make it reproducible? Instead of interpreting an opaque string: mysim.exe –in data –out calib Ask the user to invoke a function instead: output = mysim( input, calib ) IN ENV mysim.json

PRUNE – Preserving Run Environment PUT “/tmp/input1.dat” AS “input1”[id 3ba8c2] PUT “/tmp/input2.dat” AS “input2”[id dab209] PUT “/tmp/calib.dat” AS “calib”[id 64c2fa] PUT “sim.function” AS “sim”[id fffda7] out1 = sim( input1, calib ) IN ENV mysim.json [out1 is bab598] out2 = sim( input2, calib ) IN ENV mysim.json [out2 is 392caf] out3 = sim( input2, calib2 ) IN ENV bettersim.json [out3 is ]

RedHat 6.1 RedHat 6.2 Mysim 3.1 Mysim 3.2 Online Data Archive input1 calib1 outpu11 myenv1 Linux Linux PRUNE connects together precisely reproducible executions and gives each item a unique identifier myenv1 input1 calib1 output 1 sim output1 = sim( input1, calib1 ) IN ENV mysim.json bab598 = fffda7 ( 3ba8c2, 64c2fa ) IN ENV c8c832 myenv1 sim2

Recapitulation Key problem: User cannot see the implicit dependencies that are critical to their code. Preserve the Mess: – VMs: Just capture everything present. – Parrot: Capture only what is actually used. Encourage Cleanliness: – Umbrella: Describe all deps in a simple specification. – PRUNE: Use a specification with every task run. Preserving the mess may work for a handful of tasks, but cleanliness is required for millions!

Many Open Problems! Naming: Tension between usability and durability: DOIs, UUIDs, HMACs,... Overhead: Tools must be close to native performance, or they won’t get used. Usability: Do users have to change behavior? Layers: Preserve program binaries, or sources + compilers, or something else? Repositories: Will they take provisional data? Compatibility: Can we plug into existing tools? Composition: Connect systems together?

For more information… 27 Haiyan Meng Peter Ivie Douglas Thain Parrot Packaging: Umbrella Technology Preview: