Taming Local Users and Remote Clouds with HTCondor at CERN

Slides:

Advertisements

Similar presentations

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Advertisements

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

European HTCondor Workshop December 2014 summary Ian Collier (Brial Bockelman, Greg Thain, Todd Tannenbaum) GDB 10th December 2014.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

Grid job submission using HTCondor Andrew Lahiff.

Tim Bell 24/09/2015 2Tim Bell - RDA.

June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

DTI Mission – 29 June LCG Security Ian Neilson LCG Security Officer Grid Deployment Group CERN.

Computing for LHC Physics 7th March 2014 International Women's Day - CERN- GOOGLE Networking Event Maria Alandes Pradillo CERN IT Department.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.

LHC Computing, CERN, & Federated Identities

tons, 150 million sensors generating data 40 millions times per second producing 1 petabyte per second The ATLAS experiment.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

ATLAS Computing Wenjing Wu outline Local accounts Tier3 resources Tier2 resources.

HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Dynamic Extension of the INFN Tier-1 on external resources

Review of the WLCG experiments compute plans

HTCondor Annex (There are many clouds like it, but this one is mine.)

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

HTCondor Security Basics

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

VOs and ARC Florido Paganelli, Lund University

Dynamic Deployment of VO Specific Condor Scheduler using GT4

U.S. ATLAS Grid Production Experience

HTCondor at RAL: An Update

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

Kerberos token renewal & HTCondor

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

IW2D migration to HTCondor

CREAM-CE/HTCondor site

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

ATLAS Sites Jamboree, CERN January, 2017

Southwest Tier 2.

Grid Deployment Board meeting, 8 November 2006, CERN

Artem Trunov and EKP team EPK – Uni Karlsruhe

Building Grids with Condor

Condor: Job Management

FCT Follow-up Meeting 31 March, 2017 Fernando Meireles

Ákos Frohner EGEE'08 September 2008

US CMS Testbed.

Patrick Dreher Research Scientist & Associate Director

HTCondor Security Basics HTCondor Week, Madison 2016

Basic Grid Projects – Condor (Part I)

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Condor: Firewall Mirroring

GRID Workload Management System for CMS fall production

Condor-G Making Condor Grid Enabled

JRA 1 Progress Report ETICS 2 All-Hands Meeting

Credential Management in HTCondor

Job Submission Via File Transfer

Condor-G: An Update.

Presentation transcript:

Taming Local Users and Remote Clouds with HTCondor at CERN Ben Jones -- ben.dylan.jones@cern.ch HTCondor Week

Worldwide LHC Computing Grid nearly 170 sites, 40 countries TIER-0 (CERN): data recording, reconstruction and distribution ~350’000 cores TIER-1: permanent storage, re-processing, analysis 500 PB of storage > 2 million jobs/day TIER-2: Simulation, end-user analysis I’m the Service Manager of the batch system in tier-0 of this diagram. The LCG is obviously an extremely important use case for the Batch system, often making up most, but not all, its use 10-100 Gb links HTCondor Week

Run 2 has only just started Hint of an excess with diphoton mass of 750 GeV Seen by ATLAS and CMS – coincidence or a new signal? Two experiments are showing something “strange” 300 theory papers already being written Interesting but can also be a statistical fluctuation HTCondor Week

There are though of course other experiments, such as this one which is behind my office in building 31. The AD has experiments such as ALPHA which tries to trap antihydrogen and see things like whether it falls up or down in gravity (down, they think) HTCondor Week

For AD: beam of protons from the PS that gets fired into a block of metal. Energy from the collisions is enough to create a new proton-antiproton pair about once in every million collisions Other experiments, and indeed batch system users that feed from the complex, such as COMPASS at the SPS And then batch users that aren’t part of the accelerator complex, like AMS which is looking for anti-matter and evidence of dark matter with a module mounted on the ISS. HTCondor Week

Batch Service at CERN Service used for both grid and local submissions (roughly equal %) Local public share open to all CERN users 400-600K job submissions / day ~60K Running jobs Migration to HTCondor from LSF underway Some grid already moved, local imminent Vanilla, but planning on Docker HTCondor Week

LSF v HTCondor HTCondor Week

Requirements Users need: We need: Auth tokens for internal services (yes, including AFS $HOME) A schedd to use, and to query An accounting group to be charged to Some nice defaults for things like job timing We need: Defined units of compute: 1 core means 2gb RAM, 20GB scratch HTCondor Week

Token expiry Standard expiry time is 1 day, renewable to 7 days Our jobs can last for > 7 days Mechanism required to keep authentication tokens refreshed (and in our case, beyond standard) HTCondor Week

Integration with Kerberos Controlled by config vars: SEC_CREDENTIAL_PRODUCER SEC_CREDENTIAL_MONITOR SEC_CREDENTIAL_DIRECTORY Ability to add scripts to generate initial kerberos tickets, and to renew as necessary Condor will monitor, copy to sandbox and refresh as necessary HTCondor Week

Submit Side condor_submit calls SEC_CREDENTIAL_PRODUCER which acquires a kerberos AS-REQ The AP-REQ is handed to the schedd’s condor_credd by condor_submit, and is written to the schedd’s SEC_CREDENTIAL_DIRECTORY The SEC_CREDENTIAL_MONITOR monitors the directory, and acquires/renews TGT HTCondor Week

Execute Side condor_starter copies credentials into the SEC_CREDENTIAL_DIRECTORY when a user has jobs scheduled to execute, and removes when there are no jobs left The SEC_CREDENTIAL_MONITOR will acquire/renew TGTs with the stored AP-REQ The TGT will be copied to the job sandbox when the job runs, with KRB5_CCNAME set HTCondor Week

Schedds Number of user jobs means we need multiple schedds Want to make it easy & cheap to query, so needs to be static assignment Currently using zookeeper as the k/v store znode contains schedd /htcondor/users/$username/{current,old} LOCAL_CONFIG_FILE piped script to contact zk on submit / query for schedd HTCondor Week

Jobflavours Current LSF service has defined queues for local Defined as “normalised” minutes/hours/weeks Slot limits for individual queues (ie limit long queues) Use mixture of startd and schedd policy to enforce Default to short time to ensure users think about it, restrict number of long slots HTCondor Week

Accounting groups Assigning users to accounting groups and delegating control to VO admin (hopefully using BNL’s solution) Use SUBMIT_REQUIREMENTS and PROTECTED attribute to make Account group required and immutable Also use SUBMIT_REQUIREMENTS to ensure users don’t request > 2gb / core HTCondor Week

Compute Growth Outlook Compute: Growth > x50 Moore’s law only x16 What we can afford EYETS – extended year technical stop … and 400PB/year in Run 4 HTCondor Week

Cloudy with a chance of jobs… “There is no cloud: it’s just someone elses computer” Current public cloud investigations around adding flat capacity (cheaper), not elasticity Treat additions to htcondor pool as just more computers, somewhere else We maybe trust them slightly less HTCondor Week

HTCondor cloud details Most of plant mapped to worker-node, cloudy nodes mapped to xcloud-worker Differences in how we add in mapfile We use a less trusted CA (no perms beyond joining pool) For most VOs, common submission point using routes Jobs: +remote_queue=“externalcloud” Machines: XBatch =?= True HTCondor Week

Questions HTCondor Week