“Better to avoid problems than deal with them when they occur”

Slides:

Advertisements

Similar presentations

NAGIOS AND CACTI NETWORK MANAGEMENT AND MONITORING SYSTEMS.

Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Creating Distributed Setup WEBLOGIC SERVER. Create a Distributed Setup Using Node Manager Basic idea: When the servers are in production, and all of them.

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.

Two Years of HTCondor at the RAL Tier-1

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

BESIII distributed computing and VMDIRAC

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Grid Appliance – On the Design of Self-Organizing, Decentralized Grids David Wolinsky, Arjun Prakash, and Renato Figueiredo ACIS Lab at the University.

HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.

Grid job submission using HTCondor Andrew Lahiff.

AEgir Maintain your Drupal sites. The name: AEgir “In Norse mythology, AEgir was the god of the oceans and if Drupal is a drop of water, AEgir is the.

Monitoring at GRIF Frédéric SCHAER cea.fr Hepix Workshop, April

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

01/13/051 Cheap, Easy Virtual Hosts for Web-Based Services Richard L. Goerwitz III.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Drush: The Drupal Shell Utility Trevor Mckeown Founder & Owner Sublime Technologies

Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.

HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.

DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi

Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.

Experience integrating a production private cloud in a Tier 1 Grid site Ian Collier Andrew Lahiff, George Ryall STFC RAL Tier 1 ISGC 2015 – March 20 th.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

High Availability For Nagios Mike Weber

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

CHTC Policy and Configuration

Site Administration Tools: Ansible

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

HTCondor at RAL: An Update

Examples Example: UW-Madison CHTC Example: Global CMS Pool

INFNGRID Monitoring Group report

HTCondor at the RAL Tier-1

Things you may not know about HTCondor

ATLAS Cloud Operations

Do-more Technical Training

High Availability in HTCondor

Moving from CREAM CE to ARC CE

Sergio Fantinel, INFN LNL/PD

Marian Babik Andrea Sciabà

CREAM-CE/HTCondor site

OSG Connect and Connect Client

Quattor Usage at Nikhef

Accounting in HTCondor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

HTCondor Command Line Monitoring Tool

Intro to Config Management Using Salt Open Source

Accounting, Group Quotas, and User Priorities

Basic Grid Projects – Condor (Part I)

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Condor: Firewall Mirroring

Basic Dynamic Analysis VMs and Sandboxes

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

“Better to avoid problems than deal with them when they occur” Healthcheck script “Better to avoid problems than deal with them when they occur”

History Pbs Nagios Lots of elaborate scripts to monitor stuff Very centralized Not easily re-configured Nagios Lots of elaborate scripts to monitor stuff

Now HtCondor Nagios Not at all centralized – distributed Can be dynamically configured Nagios Very centralized Not easily re-configured

HtCondor Make a node check itself How to do this … Condor CRON STARTD_CRON_JOBLIST=$(STARTD_CRON_JOBLIST) WN_HEALTHCHECK

Condor CRON config STARTD_CRON_JOBLIST=$(STARTD_CRON_JOBLIST) WN_HEALTHCHECK STARTD_CRON_WN_HEALTHCHECK_EXECUTABLE=/usr/local/bin/healhcheck_wn_condor STARTD_CRON_WN_HEALTHCHECK_KILL=true STARTD_CRON_WN_HEALTHCHECK_MODE=periodic STARTD_CRON_WN_HEALTHCHECK_PERIOD=10m STARTD_CRON_WN_HEALTHCHECK_RECONFIG=false

Condor CRON Returns two values. NODE_IS_HEALTHY = True NODE_STATUS = "All_OK" START=(NODE_IS_HEALTHY =?= True) && (StartJobs =?= True)

Condor self monitoring 1 Worked well Questions about what constitutes a worker node, 'real machines', VMs, containers, ...

Condor self monitoring 2 Many nagios checks in healthcheck script read-only filesystem ntp check cvmfs checks uptime swap usage Etc …

Note that healthcheck scripts runs as user Condor, so no root checks. possibly not a 'good thing', but it works for us.

Extras 1 Selectively disable Vos START=(NODE_IS_HEALTHY =?= True) && (StartJobs =?= True) NODE_IS_HEALTHY = True NODE_IS_HEALTHY = True && (regexp("atl", Owner) =?= False)

Extras 2 Report to nagios via send_nsca Query nodes in condor condor_status -constraint 'NODE_STATUS =!= "All_OK" && partitionableslot == True' -autoformat Machine NODE_STATUS

Conclusions healhcheck_wn_condor works well. Uses nagios checks Very RAL specific