“Better to avoid problems than deal with them when they occur”

Slides:



Advertisements
Similar presentations
NAGIOS AND CACTI NETWORK MANAGEMENT AND MONITORING SYSTEMS.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Creating Distributed Setup WEBLOGIC SERVER. Create a Distributed Setup Using Node Manager Basic idea: When the servers are in production, and all of them.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.
Two Years of HTCondor at the RAL Tier-1
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
BESIII distributed computing and VMDIRAC
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Grid Appliance – On the Design of Self-Organizing, Decentralized Grids David Wolinsky, Arjun Prakash, and Renato Figueiredo ACIS Lab at the University.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Grid job submission using HTCondor Andrew Lahiff.
AEgir Maintain your Drupal sites. The name: AEgir “In Norse mythology, AEgir was the god of the oceans and if Drupal is a drop of water, AEgir is the.
Monitoring at GRIF Frédéric SCHAER cea.fr Hepix Workshop, April
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
01/13/051 Cheap, Easy Virtual Hosts for Web-Based Services Richard L. Goerwitz III.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Drush: The Drupal Shell Utility Trevor Mckeown Founder & Owner Sublime Technologies
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
High Availability For Nagios Mike Weber
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
CHTC Policy and Configuration
Site Administration Tools: Ansible
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
HTCondor at RAL: An Update
Examples Example: UW-Madison CHTC Example: Global CMS Pool
INFNGRID Monitoring Group report
HTCondor at the RAL Tier-1
Things you may not know about HTCondor
ATLAS Cloud Operations
Do-more Technical Training
High Availability in HTCondor
Moving from CREAM CE to ARC CE
Sergio Fantinel, INFN LNL/PD
Marian Babik Andrea Sciabà
CREAM-CE/HTCondor site
OSG Connect and Connect Client
Quattor Usage at Nikhef
Accounting in HTCondor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Intro to Config Management Using Salt Open Source
Accounting, Group Quotas, and User Priorities
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Basic Dynamic Analysis VMs and Sandboxes
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

“Better to avoid problems than deal with them when they occur” Healthcheck script “Better to avoid problems than deal with them when they occur”

History Pbs Nagios Lots of elaborate scripts to monitor stuff Very centralized Not easily re-configured Nagios Lots of elaborate scripts to monitor stuff

Now HtCondor Nagios Not at all centralized – distributed Can be dynamically configured Nagios Very centralized Not easily re-configured

HtCondor Make a node check itself How to do this … Condor CRON STARTD_CRON_JOBLIST=$(STARTD_CRON_JOBLIST) WN_HEALTHCHECK

Condor CRON config STARTD_CRON_JOBLIST=$(STARTD_CRON_JOBLIST) WN_HEALTHCHECK STARTD_CRON_WN_HEALTHCHECK_EXECUTABLE=/usr/local/bin/healhcheck_wn_condor STARTD_CRON_WN_HEALTHCHECK_KILL=true STARTD_CRON_WN_HEALTHCHECK_MODE=periodic STARTD_CRON_WN_HEALTHCHECK_PERIOD=10m STARTD_CRON_WN_HEALTHCHECK_RECONFIG=false

Condor CRON Returns two values. NODE_IS_HEALTHY = True NODE_STATUS = "All_OK" START=(NODE_IS_HEALTHY =?= True) && (StartJobs =?= True)

Condor self monitoring 1 Worked well Questions about what constitutes a worker node, 'real machines', VMs, containers, ...

Condor self monitoring 2 Many nagios checks in healthcheck script read-only filesystem ntp check cvmfs checks uptime swap usage Etc …

Note that healthcheck scripts runs as user Condor, so no root checks. possibly not a 'good thing', but it works for us.

Extras 1 Selectively disable Vos START=(NODE_IS_HEALTHY =?= True) && (StartJobs =?= True) NODE_IS_HEALTHY = True NODE_IS_HEALTHY = True && (regexp("atl", Owner) =?= False)

Extras 2 Report to nagios via send_nsca Query nodes in condor condor_status -constraint 'NODE_STATUS =!= "All_OK" && partitionableslot == True' -autoformat Machine NODE_STATUS

Conclusions healhcheck_wn_condor works well. Uses nagios checks Very RAL specific