HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
1 October 2013 APF Summary Oct 2013 John Hover John Hover.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
HTCondor within the European Grid & in the Cloud
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Expanding scalability of LCG CE A.Kiryanov, PNPI.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
David Adams ATLAS DIAL status David Adams BNL July 16, 2003 ATLAS GRID meeting CERN.
F.Pacini - Milan - 8 May, n° 1 Results of Meeting on Workload Manager Components Interaction DataGrid WP1 F. Pacini
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid job submission using HTCondor Andrew Lahiff.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Remote Cluster Connect Factories David Lesny University of Illinois.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Job-Property-Based Scheduling at Sites This is about the way sites present their local resources at the Grid Interface level – So far Sites maintained.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
AGLT2 Site Report Shawn McKee University of Michigan March / OSG-AHM.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 December 2007.
1. 2 Introduction SUMS (STAR Unified Meta Scheduler) overview –Usage Architecture Deprecated Configuration Current Configuration –Configuration via Information.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarksEGEE-III INFSO-RI MPI on the grid:
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
HTCondor Accounting Update
Project Management: Messages
Bob Ball/University of Michigan
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
GWE Core Grid Wizard Enterprise (
Moving from CREAM CE to ARC CE
The CREAM CE: When can the LCG-CE be replaced?
LCGAA nightlies infrastructure
Monitoring HTCondor with Ganglia
ADC Requirements and Recommendations for Sites
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Condor-G: An Update.
Presentation transcript:

HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015

Why HTCondor-CE? Bob Ball University of Michigan OSG AHM – March 2015 Slide 2 USATLAS has made it a goal for the next run to use a more highly scalable GRID access method. The GRAM protocol that has been in use can strain gatekeeper resources without reaching the high number of batch jobs needed as our clusters scale up in size. OSG has developed and provided HTCondor-CE to fill this order “The biggest difference you will see between an HTCondor CE and a GRAM CE is in the way that jobs are submitted to your batch system; HTCondor CE uses the built-in JobRouter daemon whereas GRAM CE uses jobmanager scripts written in Perl. Customizing your site’s CE now requires editing configuration files instead of editing jobmanager scripts.“ OSG has published a calendar that shows GRAM support dropped by ansitionPlan

What is Condor-CE? Bob Ball University of Michigan OSG AHM – March 2015 Slide 3 Replacement for GRAM Far better scalability Available with OSG-CE software 3.2.x Incoming jobs interpreted by JobRouter Direct insertion from there into standard batch system queue Works well with HTCondor, PBS/slurm LSF work at WT2 and SWT2 in progress Working with SGE at BU Correlation between condor_ce_q and batch queue accessed via ClassAd variables RoutedToJobId and RoutedFromJobId

What is Condor-CE? Bob Ball University of Michigan OSG AHM – March 2015 Slide 4 Configuration macros for JobRouter control insertion into scheduler.

Transition Steps Bob Ball University of Michigan OSG AHM – March 2015 Slide 5 Initially run HTCondor-CE in parallel with GRAM Low rate while confirming operations Later on move to full HTCondor-CE submission Retain GRAM functionality SAM tests will still use GRAM for a while Full transition -- Disable GRAM support once full usage of HTC-CE Future (2016) OSG updates will remove GRAM Included in distributions but not configured Full removal

Initial Implementation Bob Ball University of Michigan OSG AHM – March 2015 Slide 6 Quoting John Hover, “Panda functionality for fully property-based job submission doesn’t exist yet” Start with “queue” based submission queue = “prod” at AGLT2 is legacy central Production queue Possible to set similar values, but not yet implemented in BNL factories Due to manual configuration Values that can be passed in Condor-G submit +remote_queue = “queue name” +maxMemory = value in MB, eg, xcount = number of cores, eg, 1 +maxWallTime = max minutes of wall time There are no IO-related parameters (now)

Proposed Standard Queues Bob Ball University of Michigan OSG AHM – March 2015 Slide 7 On March 3, the following “standards” were proposed by John Hover Production: queue = “prod”, xcount = 1 MCORE: queue = “prod”, xcount = 8 Analysis: queue = “analy”, xcount = 1 maxWallTime and maxMemory to be taken from AGIS for a site Queue values may vary from site to site THESE SHOULD BE STANDARDIZED This full standard set is not yet easily possible Need to extract from AGIS, etc.

Current USATLAS Participation Bob Ball University of Michigan OSG AHM – March 2015 Slide 8 AGLT2_SL6 AGLT2_TEST AGLT2_MCORE ANALY_AGLT2_SL6 ANALY_AGLT2_TIER3_TEST ANALY_BNL_SHORT-grigk07 ANALY_BNL_LONG-gridgk07 ANALY_MWT2_SL6-uct2-gk MWT2_SL6-uct2-gk MWT2_MCORE-uct2-gk ANALY_SLAC_SHORT_1HR OU_OSCER_ATLAS NET2 ready, but not yet receiving pilots These are all either testing or fully stable

TimetableTimetable Bob Ball University of Michigan OSG AHM – March 2015 Slide 9 Goal is to have full HTCondor-CE usage by the time the run starts Reality is that neither AGIS nor factories are ready, not to mention the sites. Phase in as follows Phase 1. Queue only defined, with xcount=8 for MCORE Phase 1.5. Standardize queue names. Possibly pass maxMemory and wall time by queue name, but also accept those specific parameters Phase x. Full acceptance of all 4 possible parameters, with simple set of standard queues Phase y. Should “queue” be eliminated?

What do we need? Bob Ball University of Michigan OSG AHM – March 2015 Slide 10 HTCondor-CE reporting to BDII just added. Is this working correctly? SAM must work with HTCondor-CE Standardize on combinations of queue, xcount, etc Still testing with LSF, PBS/slurm, SGE working well Do we need an IO Factor? If we implement an IO Factor, can queue be discarded? Understanding by site admins on writing JobRouter This is non-trivial, some things are not very obvious. Lots of discovered hints on Twiki pages

JobRouter Sample Coding Bob Ball University of Michigan OSG AHM – March 2015 Slide 11 JOB_ROUTER_ENTRIES = \ /* ***** Route no 1 ***** */ \ /* ***** Analysis queue ***** */ \ [ \ GridResource = "condor localhost localhost"; \ eval_set_GridResource = strcat("condor ", "$(FULL_HOSTNAME)", " $(JOB_ROUTER_SCHEDD2_POOL)"); \ Requirements = target.queue=="analy"; \ Name = "Analysis Queue"; \ eval_set_RequestMemory = ifThenElse(maxMemory isnt undefined, \ ifThenElse(maxMemory <= 4096, 3968, maxMemory), 3968 ); \ eval_set_RequestCpus = ifThenElse(xcount isnt undefined, xcount, 1); \ ] \ /* ***** Route no 6 ***** */ \ /* ***** mp8 queue ***** */ \ [ \ GridResource = "condor localhost localhost"; \ eval_set_GridResource = strcat("condor ", "$(FULL_HOSTNAME)", " $(JOB_ROUTER_SCHEDD2_POOL)"); \ Requirements = ifThenElse(target.queue is undefined, \ false, \ ifThenElse(target.xcount is undefined, \ false, \ target.queue=="prod" && target.xcount==8)); \

JobRouter Sample Coding (cont) Bob Ball University of Michigan OSG AHM – March 2015 Slide 12 Name = "MCORE Queue"; \ eval_set_RequestMemory = ifThenElse(maxMemory isnt undefined, \ ifThenElse(maxMemory <= 32768, 32640, maxMemory), ); \ eval_set_RequestCpus = ifThenElse(xcount isnt undefined, xcount, 8); \ ] \ /* ***** Route no 8 ***** */ \ /* ***** Default queue for usatlas1 user ***** */ \ [ \ GridResource = "condor localhost localhost"; \ eval_set_GridResource = strcat("condor ", "$(FULL_HOSTNAME)", " $(JOB_ROUTER_SCHEDD2_POOL)"); \ Requirements = ifThenElse(target.queue is undefined, \ regexp("usatlas1",target.Owner), \ ifThenElse(target.xcount is undefined, \ regexp("usatlas1",target.Owner), \ target.queue=="prod" && target.xcount==1)); \ Name = "ATLAS Production Queue"; \ eval_set_varTest = target.queue; \ eval_set_RequestMemory = ifThenElse(maxMemory isnt undefined, \ ifThenElse(maxMemory <= 4096, 3968, maxMemory), 3968 ); \ eval_set_RequestCpus = ifThenElse(xcount isnt undefined, xcount, 1); \ ] Note the insertion in the ClassAd of the variable varTest in Route no 8. This is a good mechanism to see if what you are doing evaluates correctly; just look at the ClassAd via “condor_q –long” after the job is submitted.

Some Discussion Points Bob Ball University of Michigan OSG AHM – March 2015 Slide 13 Should “queue” be retained? Or should an IO factor be implemented? Standardize queue names to a simple set What differentiations in job types are relevant? Possible IO parameter What PanDA functionality is still needed? And what is impacted by that lack? Interaction between AGIS and the pilot factories to supply needed parameters Non-Condor batch systems LSF seems well in hand

Some Discussion Points (2) Bob Ball University of Michigan OSG AHM – March 2015 Slide 14 Interaction with BDII (and status) Interaction with SAM tests (and status) What setup files should be sourced? There is a long thread on this topic $OSG_GRID/setup.sh $OSG_APP/atlas_app/atlas_rel/cctools/latest/set up.sh $OSG_APP/atlas_app/atlaswn/setup.sh $OSG_APP/atlas_app/atlas_rel/local/setup.sh

ReferencesReferences Bob Ball University of Michigan OSG AHM – March 2015 Slide 15 lease3/HTCondorCEOverview stallHTCondorCE 43SfH9D1ERgnRp8InxBtmxEsdS4S5frVtwXn8v0/edit?pli=1#slid e=id.g79f425739_030 lease3/JobRouterRecipes lease3/TroubleshootingHTCondorCE lease3/SubmittingHTCondorCE