HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
European HTCondor Workshop December 2014 summary Ian Collier (Brial Bockelman, Greg Thain, Todd Tannenbaum) GDB 10th December 2014.
Greg Thain HTCondor Week 2015
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
LARK Bringing Distributed High Throughput Computing to the Network Todd Tannenbaum U of Wisconsin-Madison Garhan Attebury
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Grid job submission using HTCondor Andrew Lahiff.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
HTCondor Recent Enhancement and Future Directions HTCondor Recent Enhancement and Future Directions HEPiX Fall 2015 Todd Tannenbaum Center for High Throughput.
how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Docker Overview Automating.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
Putting your users in a Box John (TJ) Knoeller Center for High Throughput Computing.
HTCondor Networking Concepts
HTCondor Networking Concepts
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Workload Management System
High Availability in HTCondor
Building Grids with Condor
Condor: Job Management
Haiyan Meng and Douglas Thain
HTCondor and the Network
What’s new in HTCondor. What’s coming
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

HTCondor / HEP Partnership and Activities HEPiX Fall 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

University of Wisconsin Center for High Throughput Computing 2

› Open source distributed high throughput computing › Management of resources, jobs, and workflows › Primary objective: assist the scientific community with their high throughput computing needs › Mature technology… HTCondor 3

Enslow’s DPS paper Miron Livny’s PhD Condor Deployed

› Last year : 96 new enhancements, commits, 8 releases, 39k regression test runs › Open source development model › Evolve to meet the needs of the science community in a ever-changing computing landscape Mature… but actively developed 5

BNL (US-ATLAS T1) and FNAL (US-CMS T1) adopted Condor in the early days of PPDG (~2000). Rutherford Appleton Laboratory (UK T1) adopted HTCondor in

CERN (T0) HTCondor activity in

Desire to work together with the HEP community to leverage our collective experience / effort / know- how to offer an open source solution that meets the growing need of HEP high throughput computing in a challenging budget environment Why am I here? 8 Please talk to one of these handsome fellas!

› Documentation › Community support list (htcondor- users) › Ticket-tracked developer support Current Channels 9 › Bi-weekly/monthly phone conferences  Identify and track current problems  Communicate and plan future goals  Identify and collaborate on challenges, f2f › Fully open development model › Commercial options for 24/7 Meet w/ CMS, LIGO, IceCube, LSST, FNAL, iPlant, …

› Each year in early May in Madison, WI › Perhaps others locales/focuses? HTCondor Week 10

› Now that we see more HTCondor pools in HEP/LHC, "inter" HTCondor functionality is more important -  Distributed scheduling policy  No reliance on shared UID or file servers file movement scheduling, overlap of file stage out and computation  Networking can traverse firewalls, NATs, private nets CCB (“Condor Connection Broker”)  Federation technologies: flocking, glidein via grid universe jobs, HTCondor CE HTCondor R&D heavily influenced by HEP requirements 11

13 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool

14 How Flocking Works › Add a line to your condor_config : FLOCK_HOSTS = Pool-Foo, Pool-Bar Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Pool-Foo Central Manager Collector Negotiator Pool-Bar Central Manager Submit Machine

15 HTCondor Flocking › Used by many sites in OSG to easily federate all institutional clusters – fast and easy to setup › Remote pools are contacted in the order specified until jobs are satisfied › User-priority system is “flocking-aware”  A pool’s local users can have priority over remote users “flocking” in.

16 Glidein via Grid Universe › Reliable, durable submission of a job to a remote scheduler to build a batch scheduler overlay network › Supports many “back end” types:  Globus: GT2, GT5  NorduGrid  UNICORE  HTCondor  PBS  LSF  SGE  EC2  OpenStack  Deltacloud  Cream  SSH  BOINC

 Dynamically created via glideInWMS from grid, cloud, and HPC resources  200,000 cores, 400,000 job queued  10 submit nodes – 5 for production, 5 for analysis using DAGMan to manage workflows  Execute nodes will run behind firewalls and NATs, necessitating CCB use. We will also want to minimize the number and frequency of TCP connections.  We expect at least CERN to provide worker nodes with outbound IPv6 but without outbound IPv4 networking. Another current collaboration: CMS “global pool” project 17

Scalability Work 18 › Test, measure, enhance, repeat! › Enhanced protocols to lower latency and connection counts, more non-blocking I/O, statistics › Currently at ~155,000 jobs running reliably across 8 submit machines

› Two pools: IPv4-only and IPv6-only › IPv4 & IPv6 HTCondor submit node (schedd) can participate in both pools › schedd appears single protocol to each pool  Schedd rewrites advertised addresses to match protocol of active stream IPv6 Work: Mixed-mode Flocking schedd 19

› Today:  Our test suite exclusively uses IPv4 › Goals:  Existing tests run twice, once using IPv4, once using IPv6 Excepting stuff we don’t plan on supporting (e.g. standard universe)  New tests for mixed-mode Automated IPv6 testing 20

HTCondor Linux containers support Power to the admin! Tame those jobs! 21

› HTCondor can currently leverage Linux containers / cgroups to run jobs  Limiting/monitoring CPU core usage  Limiting/monitoring physical RAM usage  Tracking all subprocesses  Private file namespace (each job can have its own /tmp!)  Private PID namespace  Chroot jail  Private network namespace (soon! each job can have its own network address) Containers in HTCondor 22

More containers… Docker and HTCondor

This is Docker Docker manages Linux containers. Containers give Linux processes a private: Root file system Process space NATed network

Examples This is an “ubuntu” container This is my host OS, running Fedora Processes in other containers on this machine can NOT see what’s going on in this “ubuntu” container

Command line example $ docker run ubuntu cat /etc/debian_version All docker commands are bound into the “docker” executable “run” command runs a process in a container “ubuntu” is the base filesystem for the container an “image” “cat” is the Unix process, from the image we will run (followed by the arguments)

At the Command Line $ hostname whale $ cat /etc/redhat-release Fedora release 20 (Heisenbug) $ docker run ubuntu cat /etc/debian_version jessie/sid $ time docker run ubuntu sleep 0 real0m1.825s user0m0.017s sys0m0.024s

Images Images provide the user level filesystem Everything but the linux kernel You can make your own The docker hub provides many standard ones Docker can pull images from the hub

Images are copy-on-write All changes written to top-level layer Changes can be pulled out of container after exit

Why should you care? › Reproducibility  How many.so’s in /usr/lib64 do you use?  Will a RHEL 6 app run on RHEL 9 in five years? › Packaging  Image is a great to package large software stacks › Imagine an OSG with container support!

I Know What You Are Thinking!

Isn’t this a Virtual Machine? › Containers share Linux kernel with host › Host can “ps” into container  One-way mirror, not black box › Docker provides namespace for images › Docker containers do not run system daemons  CUPS, , cron, init, fsck, (think about security!) › Docker images much smaller than VM ones  Just a set of files, not a disk image › Much more likely to be universally available

Semantics: VM vs. Container › VMs provide ONE operation:  Boot the black box  Run until poweroff › Containers provide process-like interface:  Start this process within the container  Run until that process exits  Much more HTCondor-like

Docker and HTCondor › Package HTCondor as docker image › Add new “docker universe”

Docker Universe universe = docker executable = /bin/my_executable arguments = arg1 docker_image = deb7_and_HEP_stack transfer_input_files = some_input output = out error = err log = log queue

Docker Universe universe = docker executable = /bin/my_executable Executable comes either from submit machine or image NOT FROM execute machine

Docker Universe universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack Image is the name of the docker image stored on execute machine

Docker Universe HTCondor can transfer input files from submit machine into container (same with output in reverse) universe = docker executable = /bin/my_executable docker_image =deb7_and_HEP_stack transfer_input_files = some_input

Thank You! 39 Lets talk about your high throughput computing hopes, dreams, and nightmares – thank you for your input!