Haiyan Meng and Douglas Thain

Slides:



Advertisements
Similar presentations
2 Copyright © 2005, Oracle. All rights reserved. Installing the Oracle Database Software.
Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing
University of Notre Dame
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
© UC Regents 2010 Extending Rocks Clusters into Amazon EC2 Using Condor Philip Papadopoulos, Ph.D University of California, San Diego San Diego Supercomputer.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Installing software on personal computer
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Windows Azure Conference 2014 Running Docker on Windows Azure.
1 port BOSS on Wenjing Wu (IHEP-CC)
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Microsoft Azure Virtual Machines. Networking Compute Storage Virtual Machine Operating System Applications Data & Access Runtime Provision & Manage.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Ceph Storage in OpenStack Part 2 openstack-ch,
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de Data-Parallel.
Installation Overview Lab#2 1Hanin Abdulrahman. Installing Ubuntu Linux is the process of copying operating system files from a CD, DVD, or USB flash.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Carrying Your Environment With You or Virtual Machine Migration Abstraction for Research Computing.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,
Breaking Barriers Exploding with Possibility Breaking Barriers Exploding with Possibility The Cloud Era Unveiled.
Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
Combining Containers and Workflow Systems for Reproducible Execution Douglas Thain, Alexander Vyushkov, Haiyan Meng, Peter Ivie, and Charles Zheng University.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
Five todos when moving an application to distributed HTC.
Harvesting Free Windows CPU Cycles for Linux Applications using Sandboxing Rasmus Andersen Dept. of Computer Science, University of Copenhagen, Denmark.
I/Watch™ Weekly Sales Conference Call Presentation (See next slide for dial-in details) Andrew May Technical Product Manager Dax French Product Specialist.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Understanding and Improving Server Performance
Welcome to Indiana University Clusters
Mike Hildreth representing the DASPOS project
Condor DAGMan: Managing Job Dependencies with Condor
Dockerize OpenEdge Srinivasa Rao Nalla.
Virtualisation for NA49/NA61
NA61/NA49 virtualisation:
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Dag Toppe Larsen UiB/CERN CERN,
Diskpool and cloud storage benchmarks used in IT-DSS
Provisioning 160,000 cores with HEPCloud at SC17
Work report Xianghu Zhao Nov 11, 2014.
PROOF – Parallel ROOT Facility
Virtualisation for NA49/NA61
Tools and Services Workshop Overview of Atmosphere
Containers and Virtualisation
Virtualization in the gLite Grid Middleware software process
Containers in HPC By Raja.
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
US CMS Testbed.
Building and Testing using Condor
1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No
Integration of Singularity With Makeflow
Virtual Machines.
Advanced Operating Systems
Microsoft Virtual Academy
Different types of Linux installation
Overview of Workflows: Why Use Them?
Backfilling the Grid with Containerized BOINC in the ATLAS computing
Azure Container Service
Client/Server Computing and Web Technologies
Presentation transcript:

Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters, Clouds, and Grids Haiyan Meng and Douglas Thain University of Notre Dame, Notre Dame, Indiana, USA June 2015

Reproducible Computing Your application works perfectly today on your machine Will your application still work next month? Will your application still work next year? Will your application still work 10 years later? Will your application still work today on another machine? Your application works perfectly well today, Spa Will your application still work next month? Will your application still work next year? Will your application still work 10 years later? Will your application still work on another machine? 11/17/2018

Run an Application on a New Machine Try to run the task on M2 Copy the task and data to M2 h 11/17/2018

Possible Failure Reasons on M2 Incompatible Hardware Mismatched Kernel Different Operating System Missing Software Dependencies Wrong Software Version Incorrect Environment Variables Execution environment is incompatible. 11/17/2018

Execution Environment Configuration Three hours Machine 3 Machine 4 Machine 5 Machine 1000 An application should be portable. 11/17/2018

VM? Disk Cloning? VM Disk Cloning If the local machine only misses some input data If the problem just lies in environment variables Expensive The overhead of constructing execution environment should be as low as possible. 11/17/2018

Everything is changing! Rapid changes in the underlying execution environment: Software may have a new version; OS may upgrade; Kernel may upgrade to a new version; Hardware architecture may become obsolete. What will happen 5 years later? 10 years later? How to guarantee your application which works today still work normally in the future? Migration? Simulation? Hardware Preservation? An application should be reproducible. 11/17/2018

Portable and Reproducible How to achieve both portable and reproducible? Specify execution environments for applications Materialize execution environments at runtime automatically Specify execution environments for applications Materialize execution environments at runtime Allow the users to specify execution environments for their applications from hardware all the way up to software and data. Determine the matching degrees between the specification and the execution node and choose the available minimal mechanism to run the applications. Umbrella 11/17/2018

Umbrella Specification Six Sections: Hardware Kernel OS Software Data Environment variables 11/17/2018

Workflow of Umbrella 11/17/2018

Evaluation of Matching Degrees 11/17/2018

Architecture of Umbrella 11/17/2018

Remote Archive and Metadata Database Description Location Fixity http https cvmfs Root Software: pre-built and configured Software Preservation Format: Pre-built and configured 11/17/2018

Local Cache - Mounting Mechanism 11/17/2018

Sandbox Techniques Application-level Virtualization Trap system calls of an application and replace the file access path with the desire path Parrot PTU CDE OS-level Virtualization chroot LXC (Linux Container) Docker May need add slides for Parrot, PTU and CDE 11/17/2018

Umbrella Specification Grid Integration Umbrella specification Condor submit file Umbrella Specification Condor Submit File condor_submit condor_wait 11/17/2018

Cloud Integration Umbrella on the local machine needs to communicate With the execution node directly. scp ssh 11/17/2018

EC2 Resource Database Derive Amazon EC2 AMI and Instance Type from Umbrella Specification Replace this picture with the latest pic 11/17/2018

CMS Application Input LHE file: 18MB Output: A ROOT file: 64MB Time: ~ 5 minutes 11/17/2018

Time Overhead – Local Execution Engine Sandbox Technique Parrot chroot Docker Matching Evaluation < 1s Software Preparation 2m 11s Sandbox Construction 1m 24s Application Execution 5 min 34s 4m 33s 4m 35s Post Processing 3s Total Time 7m 45s 6m 44s 8m 13s Access Authority any user only root docker group users CPUs: 4 Mem: 2GB Free Disk: 12GB Network: 1Gb/s Arch: X86_64 Kernel: 3.10.0 OS: RHEL 7.0 Software: no CMSSW Software OS Kernel Hardware 11/17/2018

Space Overhead – Local Execution Engine Type Description Size Input Specification < 1 KB CMS script OS RHEL 6.5 1.8 GB Software CMSSW 327 MB Parrot 28 MB Data CMS event 18 MB output ROOT file 64 MB Analysis log 2.1 MB 11/17/2018

Time Overhead – Cloud and Grid Subtask – Cloud (EC2) Time Start an EC2 Instance 6s Send Task to VM 2s Remote Execution 6m 40s Post Processing 4s Subtask – Grid (Condor) Time Submit File Construction < 1s Condor Job Submission Remote Execution 6m 20s Post Processing Condor: submit condor job + wait for the results EC2: babysit each step -- find a AMI and instance type, start an EC2 instance -- send task to the instance -- start the remote umbrella command -- wait for the results, pull back the results -- terminate the instance 11/17/2018

Umbrella at Scale – ND Condor Pool Attribute Description Machine number 4157 Hardware Architecture X86_64, i386, i686 Kernel version 25 kernel (2.6.18 – 3.10.0) OS Linux, Mac Linux Distribution RHEL, Debian, CentOS RHEL Versions 5.5, 5.9, 5.10, 5.11, 6.4, 6.5, 6.6, 7.0 CPU number 1, 2, 4, 8, 12, 16, 24, 32, 64 Memory Size Max: 1TB Min: 984 MB Disk Size Max: 1.7TB Min: 5GB Docker support 50 out of 4157 CVMFS support 2 out of 4157 Requirements: X86_64 >= 2.6.32 Linux RHEL 6.5 1 1GB 4GB CVMFS Needed 165 machines for parrot 25 machines for Docker 11/17/2018

Umbrella at Scale – ND Condor Pool 1000 different instances of CMS applications Parrot Docker Type Total Time Fastest Slowest Average Parrot 7158m 4m 12s 11m 53s 7m 09s Docker 8589m 4m 24s 13m 58s 8m 35s 11/17/2018

Umbrella at Scale – ND Condor Pool Attribute Description Machine number 4157 Hardware Architecture X86_64, i386, i686 Kernel version 25 kernel (2.6.18 – 3.10.0) Linux Distribution RHEL, Debian, CentOS Docker support 50 out of 4157 CVMFS support 2 out of 4157 Requirements: X86_64 >= 2.6.32 RHEL CVMFS Needed Without Umbrella: only 2 machines can be used to run the CMS app 165 machines for parrot 25 machines for Docker With Umbrella: Parrot: ~ 1000 machines can be used. 165 machines actually used. Docker: 50 machines can be used. 25 machines actually used. 11/17/2018

Summary: Umbrella Make Applications Portable and Reproducible Specify the execution environment clearly -- Hardware, Kernel, OS, Software, Data, Environment Variables Materialize the execution environment at runtime automatically -- No need to configure environment manually -- Matching evaluation & choose minimal mechanism Loose-coupled with sandbox techniques: -- Parrot, chroot, VM, Docker Construct sandbox through mounting mechanisms without copying -- multiple namespaces can be constructed concurrently Utilize more computing resources: -- Local Machine, Grid, Cloud Summary: Make Applications Portable and Reproducible 11/17/2018

DASPOS (Data and Software Preservation for Open Science): https://daspos.crc.nd.edu Cooperative Computing Lab http://ccl.cse.nd.edu Our Lab’s Github https://github.com/cooperative-computing-lab/cctools Name: Haiyan Meng Email: hmeng@nd.edu Questions? 11/17/2018

Metadata DB – with id 11/17/2018

EC2 Resource DB – with id 11/17/2018