A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Slides:



Advertisements
Similar presentations
Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.
HTCondor at the RAL Tier-1
Pre-GDB on Batch Systems (Bologna)11 th March Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)
HTCondor within the European Grid & in the Cloud
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.
Two Years of HTCondor at the RAL Tier-1
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
OSG Public Storage and iRODS
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Grid job submission using HTCondor Andrew Lahiff.
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.
Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
The GridPP DIRAC project DIRAC for non-LHC communities.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
UKI-LT2-RHUL ATLAS STEP09 report Duncan Rand on behalf of the RHUL Grid team.
Monitoring with InfluxDB & Grafana
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI MPI and Parallel Code Support Alessandro Costantini, Isabel Campos, Enol.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
The GridPP DIRAC project DIRAC for non-LHC communities.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.
Farming Andrea Chierici CNAF Review Current situation.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
HTCondor Accounting Update
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
HTCondor at RAL: An Update
HTCondor at the RAL Tier-1
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Moving from CREAM CE to ARC CE
The CREAM CE: When can the LCG-CE be replaced?
CREAM-CE/HTCondor site
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Presentation transcript:

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop

Outline Overview of HTCondor at RAL Computing elements Multi-core jobs Monitoring 2

Introduction RAL is a Tier-1 for all 4 LHC experiments –In terms of Tier-1 computing requirements, RAL provides 2% ALICE 13% ATLAS 8% CMS 32% LHCb –Also support ~12 non-LHC experiments, including non-HEP Computing resources –784 worker nodes, over 14K cores –Generally have 40-60K jobs submitted per day Torque / Maui had been used for many years –Many issues –Severity & number of problems increased as size of farm increased –In 2012 decided it was time to start investigating moving to a new batch system 3

Choosing a new batch system Considered, tested & eventually rejected the following –LSF, Univa Grid Engine* Requirement: avoid commercial products unless absolutely necessary –Open source Grid Engines Competing products, not sure which has the next long term future Communities appear less active than HTCondor & SLURM Existing Tier-1s running Grid Engine using the commercial version –Torque 4 / Maui Maui problematic Torque 4 seems less scalable than alternatives (but better than Torque 2) –SLURM Carried out extensive testing & comparison with HTCondor Found that for our use case –Very fragile, easy to break –Unable to get reliably working above 6000 running jobs 4 * Only tested open source Grid Engine, not Univa Grid Engine

Choosing a new batch system HTCondor chosen as replacement for Torque/Maui –Has the features we require –Seems very stable –Easily able to run 16,000 simultaneous jobs Didn’t do any tuning – it “just worked” Have since tested > 30,000 running jobs –Is more customizable than all other batch systems 5

Migration to HTCondor Strategy –Start with a small test pool –Gain experience & slowly move resources from Torque / Maui Migration 2012 Aug Started evaluating alternatives to Torque / Maui (LSF, Grid Engine, Torque 4, HTCondor, SLURM) 2013 Jun Began testing HTCondor with ATLAS & CMS ~1000 cores from old WNs beyond MoU commitments 2013 AugChoice of HTCondor approved by management 2013 SepHTCondor declared production service Moved 50% of pledged CPU resources to HTCondor 2013 NovMigrated remaining resources to HTCondor 6

Experience so far Experience –Very stable operation Generally just ignore the batch system & everything works fine Staff don’t need to spend all their time fire-fighting problems –No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores –Job start rate much higher than Torque / Maui even when throttled Farm utilization much better –Very good support 7

Problems A few issues found, but fixed quickly by developers –Found job submission hung when one of a HA pair of central managers was down Fixed & released in –Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS Fixed & released in –Experienced jobs dying 2 hours after network break between CEs and WNs Fixed & released in

All job submission to RAL is via the Grid –No local users Currently have 5 CEs –2 CREAM CEs –3 ARC CEs CREAM doesn’t currently support HTCondor –We developed the missing functionality ourselves –Will feed this back so that it can be included in an official release ARC better –But didn’t originally handle partitionable slots, passing CPU/memory requirements to HTCondor, … –We wrote lots of patches, all included in the recent release Will make it easier for more European sites to move to HTCondor Computing elements 9

ARC CE experience –Have run almost 9 million jobs so far across our 3 ARC CEs –Generally ignore them and they “just work” –VOs ATLAS & CMS fine from the beginning LHCb added ability to submit to ARC CEs to DIRAC –Seem to be ready to move entirely to ARC ALICE not yet able to submit to ARC –They have said they will work on this Non-LHC VOs –Some use DIRAC, which now can submit to ARC –Others use EMI WMS, which can submit to ARC CREAM CE status –Plan to phase-out CREAM CEs this year Computing elements 10

HTCondor & ARC in the UK Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC –RAL T2HTCondor + ARC (in production) –BristolHTCondor + ARC (in production) –OxfordHTCondor + ARC (small pool in production, migration in progress) –DurhamSLURM + ARC –GlasgowTesting HTCondor + ARC –LiverpoolTesting HTCondor 7 more sites considering moving to HTCondor or SLURM Configuration management: community effort –The Tier-2s using HTCondor and ARC have been sharing Puppet modules 11

Multi-core jobs Current situation –ATLAS have been running multi-core jobs at RAL since November –CMS started submitting multi-core jobs in early May –Interest so far only for multi-core jobs, not whole-node jobs Only 8-core jobs Our aims –Fully dynamic No manual partitioning of resources –Number of running multi-core jobs determined by fairshares 12

Getting multi-core jobs to work Job submission –Haven’t setup dedicated multi-core queues –VO has to request how many cores they want in their JDL, e.g. (count=8) Worker nodes configured to use partitionable slots –Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs Setup multi-core groups & associated fairshares –HTCondor configured to assign multi-core jobs to the appropriate groups Adjusted the order in which the negotiator considers groups –Consider multi-core groups before single core groups 8 free cores are “expensive” to obtain, so try not to lose them to single core jobs too quickly 13

Getting multi-core jobs to work If lots of single-core jobs are idle & running, how does a multi- core job start? –By default it probably won’t condor_defrag daemon –Finds WNs to drain, triggers draining & cancels draining as required –Configuration changes from default: Drain 8-cores only, not whole WNs Pick WNs to drain based on how many cores they have that can be freed up –E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than draining a full 8-core WN –Demand for multi-core jobs not known by condor_defrag Setup simple cron to adjust number of concurrent draining WNs based on demand –If many idle multi-core jobs but few running, drain aggressively –Otherwise very little draining 14

Results Effect of changing the way WNs to drain are selected –No change in the number of concurrent draining machines –Rate in increase in number of running multi-core jobs much higher 15 Running multi-core jobs

Recent ATLAS activity Results 16 Running & idle multi-core jobs Gaps in submission by ATLAS results in loss of multi-core slots. Significantly reduced CPU wastage due to the cron Aggressive draining: 3% waste Less-aggressive draining: < 1% waste Number of WNs running multi- core jobs & draining WNs

Worker node health check Startd cron –Checks for problems on worker nodes Disk full or read-only CVMFS Swap … –Prevents jobs from starting in the event of problems If problem with ATLAS CVMFS, then only prevents ATLAS jobs from starting –Information about problems made available in machine ClassAds Can easily identify WNs with problems, e.g. # condor_status –constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUS lcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch lcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.ch lcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free 17

Worker node health check Also can put this data into ganglia –RAL tests new CVMFS releases Therefore it’s important for us to detect increases in CVMFS problems –Generally have only small numbers of WNs with issues: –Example: a user’s “problematic” jobs affected CVMFS on many WNs: 18

Jobs monitoring CASTOR team at RAL have been testing Elasticsearch –Why not try using it with HTCondor? Elasticsearch ELK stack –Logstash: parses log files –Elasticsearch: search & analyze data in real-time –Kibana: data visualization Hardware setup –Test cluster of 13 servers (old diskservers & worker nodes) But 3 servers could handle 16 GB of CASTOR logs per day Adding HTCondor –Wrote config file for Logstash to enable history files to be parsed –Add Logstash to machines running schedds 19 HTCondor history files Logstash Elastic search Kibana

Jobs monitoring Can see full job ClassAds 20

Jobs monitoring 21 Custom plots –E.g. completed jobs by schedd

Custom dashboards Jobs monitoring 22

Jobs monitoring 23 Benefits –Easy to setup Took less than a day to setup the initial cluster –Seems to be able to handle the load from HTCondor For us (so far): < 1 GB, < 100K documents per day –Arbitrary queries Seem faster than using native HTCondor commands (condor_history) –Horizontal construction Need more capacity? Just add more nodes

Summary Due to scalability problems with Torque/Maui, migrated to HTCondor last year We are happy with the choice we made based on our requirements –Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future Multi-core jobs working well –Looking forward to ATLAS and CMS running multi-core jobs at the same time 24

Future plans HTCondor –Phase in cgroups onto WNs Integration with private cloud –When production cloud is ready, want to be able to expand the batch system into the cloud –Using condor_rooster for provisioning resources HEPiX Fall 2013: Monitoring –Move Elasticsearch into production –Try sending all HTCondor & ARC CE log files to Elasticsearch E.g. could easily find information about a particular job from any log file 25

Future plans Ceph –Have setup a 1.8 PB test Ceph storage system –Accessible from some WNs using Ceph FS –Setting up an ARC CE with shared filesystem (Ceph) ATLAS testing with arcControlTower –Pulls jobs from PanDA, pushes jobs to ARC CEs –Unlike the normal pilot concept, jobs can have more precise resource requirements specified Input files pre-staged & cached by ARC on Ceph 26

Thank you! 27