Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

Slides:

Advertisements

Similar presentations

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

HTCondor at the RAL Tier-1

Scheduling under LCG at RAL UK HEP Sysman, Manchester 11th November 2004 Steve Traylen

Pre-GDB on Batch Systems (Bologna)11 th March Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)

European HTCondor Workshop December 2014 summary Ian Collier (Brial Bockelman, Greg Thain, Todd Tannenbaum) GDB 10th December 2014.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

HTCondor within the European Grid & in the Cloud

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

Workload Management Massimo Sgaravatto INFN Padova.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

Two Years of HTCondor at the RAL Tier-1

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

David Cameron Riccardo Bianchi Claire Adam Bourdarios Andrej Filipcic Eric Lançon Efrat Tal Hod Wenjing Wu on behalf of the ATLAS Collaboration CHEP 15,

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

Grid job submission using HTCondor Andrew Lahiff.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Monitoring with InfluxDB & Grafana

Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

ATLAS FroNTier cache consistency stress testing David Front Weizmann Institute 1September 2009 ATLASFroNTier chache consistency stress testing.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Farming Andrea Chierici CNAF Review Current situation.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

HTCondor Accounting Update

HTCondor Accounting Update

WLCG IPv6 deployment strategy

The New APEL Client Will Rogers, STFC.

Cluster Optimisation using Cgroups

Moving from CREAM CE to ARC CE

CREAM-CE/HTCondor site

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Exploring Multi-Core on

Presentation transcript:

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN

2 Current batch system Current production batch system –656 worker nodes, 9424 job slots –Torque , Maui Issues –pbs_server, maui sometimes unresponsive –pbs_server needs to be restarted sometimes due to excessive memory usage –Job start rate sometimes not high enough to keep the farm full –Regular job submission failures - Connection timed out-qsub: cannot connect to server –Unable to schedule jobs to the whole-node queue –Sometimes schedules SL5 jobs on SL6 worker nodes & vice-versa –DNS issues, network issues & problematic worker nodes cause it to “require attention”

3 Alternatives Tested the following technologies –Torque 4 + Maui –LSF –Grid Engine –SLURM –HTCondor Identified SLURM & HTCondor as suitable for further testing at larger scales – ✗ LSF, Univa Grid Engine, Oracle Grid Engine Avoid expensive solutions unless all open source products unsuitable – ✗ Torque 4 + Maui Still need to use Maui – ✗ Open source Grid Engines Competing products, not clear which has best long-term future; neither seems to have communities as active as SLURM & HTCondor

4 HTCondor vs SLURM Larger scale tests with 110 worker nodes HTCondor –No problems running jobs Need to ensure machines with job queues (schedds) have enough memory –No problems with > pending jobs SLURM –Stability problems experienced when running > ~6000 jobs Everything fine when no jobs are completing or starting (!) –Queries (sinfo, squeue, …) and job submission fail: Socket timed out on send/recv operation –Using FIFO scheduling helps Cannot use this in production –Some activities (restarting SLURM controller, killing large numbers of jobs, …) trigger unresponsiveness Takes many hours to return to a stable situation

5 HTCondor vs SLURM Currently, HTCondor seems like a better choice for us than SLURM –Move onto next stage of testing with HTCondor Testing with selected VOs (more info later) –Will continue try to resolve problems with SLURM SLURM –Advertised as handling up to nodes, hundreds of thousands of processors –However, in the Grid community Need fairshares, “advanced” job scheduling, accounting, … Tier-2s using SLURM in production have less job slots than we tried –e.g. Estonia have 4580 job slots Other people who have tested SLURM did so at smaller scales than us –We have tried identical configuration to a SLURM Tier-2, without success at large scales

6 HTCondor Features we’re using –High-availability of central manager –Hierarchical fairshares –Partitionable slots: available cores & memory divided up as necessary for each job –Concurrency limits, e.g. restrictions on max running jobs for specific users Features not tried yet –IPv6 –Saving power: hibernation of idle worker nodes –CPU affinity, per-job PID namespaces, chroots, per-job /tmp directories, cgroups, … Queues –HTCondor has no traditional queues – jobs specify their requirements e.g. SL5 or SL6, number of cores, amount of memory, …

7 Compatibility with Middleware HTCondor & EMI-3 CREAM CE –BUpdaterCondor & condor-*_.sh scripts exist Very easy to get a CREAM CE working successfully with HTCondor –Script for publishing dynamic information doesn’t exist What about ARC CEs? –Could migrate from CREAM to ARC CEs –Benefits of ARC CEs Support HTCondor (including all publishing) Simpler than CREAM CEs (no YAIM, no Tomcat, no MySQL, …) –Accounting ARC CE accounting publisher (JURA) can send accounting records directly to APEL using SSM –Not tried yet

8 Current status “Almost” production quality service setup –HTCondor with 2 central managers –2 ARC CEs (EMI ), using LCAS/LCMAPS + Argus –112 8-core EMI-2 SL6 worker nodes Testing –Evaluation period up to ~3 months using resources beyond WLCG pledges –Aim to gain experience running “real” work Stability, reliability, functionality, dealing with problems, … –Initial testing with ATLAS and CMS ATLAS: production & analysis SL6 queues CMS: Monte Carlo production (integration testbed) –Next steps Ramp up CMS usage Try other VOs, including non-LHC VO submission through WMS

9 Backup slides

10 Test Test to reproduce problems seen with production batch system –High job submission rate (1000 jobs as quickly as possible) –High job query rate (1000 qstat’s or equivalent) Torque –90% job submissions failed –70-80% job queries failed Torque 4.x –< 10% failure rate HTCondor, LSF, Grid Engine, SLURM –100% successful

11 HTCondor: basic principles Two parts –Jobs –Machines/Resources Jobs state their requirements, preferences & attributes –E.g. I require a Linux/x86 platform –E.g. I am owned by atlas001 –E.g. I prefer a machine owned by CMS Machines state their requirements, preferences & attributes –E.g. I require jobs that require less than 4 GB of RAM –E.g. I am a Linux node; I have 16 GB of memory HTCondor matches jobs to resources –Also has priorities, fairshares, etc taken into account

12 HTCondor overview

13 Basic ganglia monitoring Monitoring showing –Running & idle jobs –Jobs starting, exiting & submitted per schedd (CE)

14 ARC CE