Lobster: Personalized Opportunistic Computing for CMS at Large Scale Douglas Thain (on behalf of the Lobster team) University of Notre Dame CVMFS Workshop,

Slides:

Advertisements

Similar presentations

1 Real-World Barriers to Scaling Up Scientific Applications Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012.

Advertisements

Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

1 port BOSS on Wenjing Wu (IHEP-CC)

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

Threads, Thread management & Resource Management.

Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

Predrag Buncic (CERN/PH-SFT) WP9 - Workshop Summary

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Introduction to Makeflow and Work Queue Prof. Douglas Thain, University of Notre Dame

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN

Analyzing LHC Data on 10K Cores with Lobster and Work Queue Douglas Thain (on behalf of the Lobster Team)

Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng.

2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

36 th LHCb Software Week Pere Mato/CERN.  Provide a complete, portable and easy to configure user environment for developing and running LHC data analysis.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Combining Containers and Workflow Systems for Reproducible Execution Douglas Thain, Alexander Vyushkov, Haiyan Meng, Peter Ivie, and Charles Zheng University.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

Five todos when moving an application to distributed HTC.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Parrot and ATLAS Connect

Containers as a Service with Docker to Extend an Open Platform

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Introduction to Makeflow and Work Queue

Virtualisation for NA49/NA61

Blueprint of Persistent Infrastructure as a Service

Dag Toppe Larsen UiB/CERN CERN,

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

Provisioning 160,000 cores with HEPCloud at SC17

Scaling Up Scientific Workflows with Makeflow

Virtualisation for NA49/NA61

Introduction to Makeflow and Work Queue

Integration of Singularity With Makeflow

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

Introduction to Makeflow and Work Queue

Introduction to Makeflow and Work Queue

Ch 4. The Evolution of Analytic Scalability

Introduction to Makeflow and Work Queue with Containers

Support for ”interactive batch”

Module 01 ETICS Overview ETICS Online Tutorials

What’s New in Work Queue

Creating Custom Work Queue Applications

Azure Container Service

CS703 – Advanced Operating Systems

Presentation transcript:

Lobster: Personalized Opportunistic Computing for CMS at Large Scale Douglas Thain (on behalf of the Lobster team) University of Notre Dame CVMFS Workshop, March 2015

The Cooperative Computing Lab University of Notre Dame

The Cooperative Computing Lab We collaborate with people who have large scale computing problems in science, engineering, and other fields. We operate computer systems on the O(10,000) cores: clusters, clouds, grids. We conduct computer science research in the context of real people and problems. We release open source software for large scale distributed computing. 3

Our Philosophy: Harness all the resources that are available: desktops, clusters, clouds, and grids. Make it easy to scale up from one desktop to national scale infrastructure. Provide familiar interfaces that make it easy to connect existing apps together. Allow portability across operating systems, storage systems, middleware… Make simple things easy, and complex things possible. No special privileges required.

Technology Tour

6 Makeflow = Make + Workflow Makeflow LocalCondorSGE Work Queue Provides portability across batch systems. Enable parallelism (but not too much!) Fault tolerance at multiple scales. Data and resource management.

Makeflow Applications

8 Work Queue Library #include “work_queue.h” while( not done ) { while (more work ready) { task = work_queue_task_create(); // add some details to the task work_queue_submit(queue, task); } task = work_queue_wait(queue); // process the completed task }

Application Local Files and Programs Work Queue Architecture Worker Process Cache Dir A C B Work Queue Master Library 4-core machine Task.1 Sandbox A B T 2-core task Task.2 Sandbox C A T 2-core task Send files Submit Task1(A,B) Submit Task2(A,C) A B C Submit Wait Send tasks

Private Cluster Campus Condor Pool Public Cloud Provider Shared SGE Cluster Makeflow Work Queue Master Local Files and Programs Run Workers Everywhere sge_submit_workers W W W ssh WW W W W WvWv W condor_submit_workers W W W Hundreds of Workers in a Personal Cloud submit tasks

T as k Lobster Master WQ Master Worker WQ Worker T as k XRootd CVMFS Parrot ROOT Chirp Wrapp er DBDB HTCondor Input Data Software Output Data WQ Pool Submit Workers Send Tasks Submit Tasks Submit Jobs Start Workers HDFS Squid T as k Query # Tasks Work Queue Applications Nanoreactor MD Simulations Adaptive Weighted Ensemble Scalable Assembler at Notre Dame ForceBalance Lobster HEP

Parrot Virtual File System Unix Appl Parrot Virtual File System Local iRODS Chirp HTTP CVMFS Capture System Calls via ptrace /home = /chirp/server/myhome /software = /cvmfs/cms.cern.ch/cmssoft Custom Namespace File Access Tracing Sandboxing User ID Mapping... Parrot runs as an ordinary user, so no special privileges required to install and use. Makes it useful for harnessing opportunistic machines via a batch system.

% export HTTP_PROXY= % parrot _run bash % cd /cvmfs/cms.cern.ch %./cmsset_default.sh %... % parrot _run –M /cvmfs/cms.cern.ch=/tmp/cmspackage bash % cd /cvmfs/cms.cern.ch %./cmsset_default.sh %...

Parrot Requires Porting New, obscure, system calls get added and get used by the standard libraries, if not by the apps. Custom kernels in HPC centers often have odd bugs and features. (mmap on GPFS) Some applications depend upon idiosyncratic behaviors not formally required. (inode!=0) Our experience: a complex HEP stack doesn’t work out of the box…. but after some focused collaboration with the CCL team, it works. Time to rethink if a less comprehensive implementation would meet the needs of CVMFS.

We Like to Repurpose Existing Interfaces POSIX Filesystem Interface (Parrot and Chirp) – open/read/write/seek/close Unix Process Interface (Work Queue) – fork / wait /kill Makefile DAG structure (Makeflow) – inputs : outputs \n command

Opportunistic Opportunities

Opportunistic Computing Much of LHC computing is done in conventional computing centers with a fixed operating environment with professional sysadmins. But, there exists a large amount of computing power available to end users that is not prepared or tailored to your specific application: – National HPC facility that is not associated with HEP. – Campus-level cluster and batch system. – Volunteer computing systems: Condor, BOINC, etc. – Cloud services. Can we effectively use these systems for HEP computing?

Opportunistic Challenges When borrowing someone else’s machines, you cannot change the OS distribution, update RPMs, patch kernels, run as root… This can often put important technology just barely out of reach of the end user, e.g.: – FUSE might be installed, but without setuid binary. – Docker might be available, but you aren’t a member of the required Unix group. The resource management policies of the hosting system may work against you: – Preemption due to submission by higher priority users. – Limitations on execution time and disk space. – Firewalls only allow certain kinds of network connections.

Backfilling HPC with Condor at Notre Dame

Users of Opportunistic Cycles

Superclusters by the Hour 21

Lobster: An Opportunistic Job Management System

Lobster Objectives ND-CMS group has a modest Tier-3 facility of O(300) cores, but wants to harness the ND campus facility of O(10K) cores for their own analysis needs. But, CMS infrastructure is highly centralized – One global submission point. – Assumes standard operating environment. – Assumes unit of submission = unit of execution. We need a different infrastructure to harness opportunistic resources for local purposes.

Key Ideas in Lobster Give each user their own scheduler/allocation. Separate task execution unit from output unit. Rely on CVMFS to deliver software. Share caches at all levels. Most importantly: Nothing left behind after eviction!

Task Lobster Architecture Lobster Master WQ Master Worker WQ Worker Task XRootd CVMFS Parrot ROOT Chirp Wrapper DB HTCondor Input Data Software Output Data WQ Pool Submit Workers Send Tasks Submit Tasks Submit Jobs Start Workers HDFS Squid Task Query # Tasks

First Try on Worker 4-core Worker Process Worker Cache Task.1 Sandbox Task.2 Sandbox T HTCondor Execution Node Other 4-core Job Parrot CVMFS cms.cern.ch grid.cern.ch T Parrot CVMFS T Parrot /tmp Task.2 $$$ CVMFS Task.1 $$$

Problems and Solutions Parrot-CVMFS could not access multiple repositories simultaneously; switching between was possible but slow. – Workaround: Package up entire contents of (small) grid.cern.ch and use parrot to mount in place. – Fix: CVMFS team working on multi-instance support in libcvmfs. Each instance of parrot was downloading and storing its own copy of the software, then leaving a mess in /tmp. – Fix: Worker directs parrot to proper caching location. – Fix: CVMFS team implemented shared cache space, known as “alien” caching.

Improved Worker in Production 4-core Worker Process Worker Cache Task.1 Sandbox Task.2 Sandbox T HTCondor Execution Node Other 4-core Job Parrot CVMFS cms.cern.ch grid.cern.ch T Parrot CVMFS T Parrot $$$ CVMFS grid.cern.ch

Idea for Even Better Worker 6-core Worker Process Worker Cache Task.1 Sandbox Task.2 Sandbox T HTCondor Execution Node Parrot CVMFS T Parrot CVMFS T Parrot $$$ CVMFS Shared Proxy Proxy Sandbox

Competitive with CSA14 Activity

Looking Ahead Dynamic Proxy Cache Allocations Containers: Hype and Reality Pin Down the Container Execution Model Portability and Reproducibility Taking Advantage of HPC

Containers Pro: Are making a big difference in that users now expect to be able to construct and control their own namespace. Con: Now everyone will do it, thus exponentially increasing efforts for porting and debugging. Limitation: Do not pass power down the hierarchy. E.g. Containers can’t contain containers, create sub users, mount FUSE, etc… Limitation: No agreement yet on how to integrate containers into a common execution model.

Where to Apply Containers? Worker Process Cache A C B Task.1 A B T Task.2 C A T Worker in Container Worker Process Cache A C B Task.1 A B T Task.2 C A T Worker Starts Containers Cache Task.1 A B T Task.2 Worker Process A C B C A T C A T Task Starts Container

Abstract Model of Makeflow and Work Queue data calib code output output: data calib code code –i data –p 5 > output If inputs exist, then: Make a sandbox. Connect the inputs. Run the task. Save the outputs. Destroy the sandbox. Ergo, if it’s not an input dependency, it won’t be present. data calib code output

Abstract Model of Execution in a Container data calib code output OPSYS=GreenHatLinux output: data calib code $OPSYS code –i data –p 5 > output Ergo, if it’s not an input dependency, it won’t be present. data calib code output Green Hat Linux

Clearly Separate Three Things: Inputs/Outputs: The items of enduring interest to the end user. Code: The technical expression of the algorithms to be run. Environment: The furniture required to run the code.

Portability and Reproducibility Observation by Kevin ND: Portability and reproducibility are two sides of the same coin. Both require that you know all of the input dependencies and the execution environment in order to save it, or move to the right place. Workflow systems: Force us to treat tasks as proper functions with explicit inputs and outputs. CVMFS allows us to be explicit about most of the environment critical to the application. VMs/Containers handle the rest of the OS. The trick to efficiency is to identify and name common dependencies, rather than saving everything independently.

Acknowledgements 41 Lobster Team Anna Woodard Matthias Wolf Nil Valls Kevin Lannon Michael Hildreth Paul Brenner Collaborators Jakob Blomer - CVMFS Dave Dykstra – Frontier Dan Bradley – OSG/CMS Rob Gardner – ATLAS CCL Team Ben Tovar Patrick Donnelly Peter Ivie Haiyan Meng Nick Hazekamp Peter Sempolinski Haipeng Cai Chao Zheng

Cooperative Computing Lab Douglas Thain