Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters, Clouds, and Grids Haiyan Meng and Douglas Thain University of Notre Dame, Notre Dame, Indiana, USA June 2015
Reproducible Computing Your application works perfectly today on your machine Will your application still work next month? Will your application still work next year? Will your application still work 10 years later? Will your application still work today on another machine? Your application works perfectly well today, Spa Will your application still work next month? Will your application still work next year? Will your application still work 10 years later? Will your application still work on another machine? 11/17/2018
Run an Application on a New Machine Try to run the task on M2 Copy the task and data to M2 h 11/17/2018
Possible Failure Reasons on M2 Incompatible Hardware Mismatched Kernel Different Operating System Missing Software Dependencies Wrong Software Version Incorrect Environment Variables Execution environment is incompatible. 11/17/2018
Execution Environment Configuration Three hours Machine 3 Machine 4 Machine 5 Machine 1000 An application should be portable. 11/17/2018
VM? Disk Cloning? VM Disk Cloning If the local machine only misses some input data If the problem just lies in environment variables Expensive The overhead of constructing execution environment should be as low as possible. 11/17/2018
Everything is changing! Rapid changes in the underlying execution environment: Software may have a new version; OS may upgrade; Kernel may upgrade to a new version; Hardware architecture may become obsolete. What will happen 5 years later? 10 years later? How to guarantee your application which works today still work normally in the future? Migration? Simulation? Hardware Preservation? An application should be reproducible. 11/17/2018
Portable and Reproducible How to achieve both portable and reproducible? Specify execution environments for applications Materialize execution environments at runtime automatically Specify execution environments for applications Materialize execution environments at runtime Allow the users to specify execution environments for their applications from hardware all the way up to software and data. Determine the matching degrees between the specification and the execution node and choose the available minimal mechanism to run the applications. Umbrella 11/17/2018
Umbrella Specification Six Sections: Hardware Kernel OS Software Data Environment variables 11/17/2018
Workflow of Umbrella 11/17/2018
Evaluation of Matching Degrees 11/17/2018
Architecture of Umbrella 11/17/2018
Remote Archive and Metadata Database Description Location Fixity http https cvmfs Root Software: pre-built and configured Software Preservation Format: Pre-built and configured 11/17/2018
Local Cache - Mounting Mechanism 11/17/2018
Sandbox Techniques Application-level Virtualization Trap system calls of an application and replace the file access path with the desire path Parrot PTU CDE OS-level Virtualization chroot LXC (Linux Container) Docker May need add slides for Parrot, PTU and CDE 11/17/2018
Umbrella Specification Grid Integration Umbrella specification Condor submit file Umbrella Specification Condor Submit File condor_submit condor_wait 11/17/2018
Cloud Integration Umbrella on the local machine needs to communicate With the execution node directly. scp ssh 11/17/2018
EC2 Resource Database Derive Amazon EC2 AMI and Instance Type from Umbrella Specification Replace this picture with the latest pic 11/17/2018
CMS Application Input LHE file: 18MB Output: A ROOT file: 64MB Time: ~ 5 minutes 11/17/2018
Time Overhead – Local Execution Engine Sandbox Technique Parrot chroot Docker Matching Evaluation < 1s Software Preparation 2m 11s Sandbox Construction 1m 24s Application Execution 5 min 34s 4m 33s 4m 35s Post Processing 3s Total Time 7m 45s 6m 44s 8m 13s Access Authority any user only root docker group users CPUs: 4 Mem: 2GB Free Disk: 12GB Network: 1Gb/s Arch: X86_64 Kernel: 3.10.0 OS: RHEL 7.0 Software: no CMSSW Software OS Kernel Hardware 11/17/2018
Space Overhead – Local Execution Engine Type Description Size Input Specification < 1 KB CMS script OS RHEL 6.5 1.8 GB Software CMSSW 327 MB Parrot 28 MB Data CMS event 18 MB output ROOT file 64 MB Analysis log 2.1 MB 11/17/2018
Time Overhead – Cloud and Grid Subtask – Cloud (EC2) Time Start an EC2 Instance 6s Send Task to VM 2s Remote Execution 6m 40s Post Processing 4s Subtask – Grid (Condor) Time Submit File Construction < 1s Condor Job Submission Remote Execution 6m 20s Post Processing Condor: submit condor job + wait for the results EC2: babysit each step -- find a AMI and instance type, start an EC2 instance -- send task to the instance -- start the remote umbrella command -- wait for the results, pull back the results -- terminate the instance 11/17/2018
Umbrella at Scale – ND Condor Pool Attribute Description Machine number 4157 Hardware Architecture X86_64, i386, i686 Kernel version 25 kernel (2.6.18 – 3.10.0) OS Linux, Mac Linux Distribution RHEL, Debian, CentOS RHEL Versions 5.5, 5.9, 5.10, 5.11, 6.4, 6.5, 6.6, 7.0 CPU number 1, 2, 4, 8, 12, 16, 24, 32, 64 Memory Size Max: 1TB Min: 984 MB Disk Size Max: 1.7TB Min: 5GB Docker support 50 out of 4157 CVMFS support 2 out of 4157 Requirements: X86_64 >= 2.6.32 Linux RHEL 6.5 1 1GB 4GB CVMFS Needed 165 machines for parrot 25 machines for Docker 11/17/2018
Umbrella at Scale – ND Condor Pool 1000 different instances of CMS applications Parrot Docker Type Total Time Fastest Slowest Average Parrot 7158m 4m 12s 11m 53s 7m 09s Docker 8589m 4m 24s 13m 58s 8m 35s 11/17/2018
Umbrella at Scale – ND Condor Pool Attribute Description Machine number 4157 Hardware Architecture X86_64, i386, i686 Kernel version 25 kernel (2.6.18 – 3.10.0) Linux Distribution RHEL, Debian, CentOS Docker support 50 out of 4157 CVMFS support 2 out of 4157 Requirements: X86_64 >= 2.6.32 RHEL CVMFS Needed Without Umbrella: only 2 machines can be used to run the CMS app 165 machines for parrot 25 machines for Docker With Umbrella: Parrot: ~ 1000 machines can be used. 165 machines actually used. Docker: 50 machines can be used. 25 machines actually used. 11/17/2018
Summary: Umbrella Make Applications Portable and Reproducible Specify the execution environment clearly -- Hardware, Kernel, OS, Software, Data, Environment Variables Materialize the execution environment at runtime automatically -- No need to configure environment manually -- Matching evaluation & choose minimal mechanism Loose-coupled with sandbox techniques: -- Parrot, chroot, VM, Docker Construct sandbox through mounting mechanisms without copying -- multiple namespaces can be constructed concurrently Utilize more computing resources: -- Local Machine, Grid, Cloud Summary: Make Applications Portable and Reproducible 11/17/2018
DASPOS (Data and Software Preservation for Open Science): https://daspos.crc.nd.edu Cooperative Computing Lab http://ccl.cse.nd.edu Our Lab’s Github https://github.com/cooperative-computing-lab/cctools Name: Haiyan Meng Email: hmeng@nd.edu Questions? 11/17/2018
Metadata DB – with id 11/17/2018
EC2 Resource DB – with id 11/17/2018