HTCondor Command Line Monitoring Tool

Slides:



Advertisements
Similar presentations
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Advertisements

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
HTCondor within the European Grid & in the Cloud
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Bigben Pittsburgh Supercomputing Center J. Ray Scott
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid job submission using HTCondor Andrew Lahiff.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
9/21/04 James Gallagher Server Installation and Testing: Hands-on ● Install the CGI server with the HDF and FreeForm handlers ● Link data so the server.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
CHTC Policy and Configuration
Debugging Common Problems in HTCondor
Improvements to Configuration
Condor DAGMan: Managing Job Dependencies with Condor
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Elastic Computing Resource Management Based on HTCondor
Welcome to Indiana University Clusters
HTCondor Security Basics
HORIZONT TWS/WebAdmin DS TWS/WebAdmin DS Tips & Tricks
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Quick Look on dCache Monitoring at FNAL
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Things you may not know about HTCondor
Workload Management System
IW2D migration to HTCondor
Getting Started with R.
High Availability in HTCondor
Things you may not know about HTCondor
Moving from CREAM CE to ARC CE
Submit BOSS Jobs on Distributed Computing System
CREAM-CE/HTCondor site
OSG Connect and Connect Client
Monitoring HTCondor with Ganglia
Accounting in HTCondor
“Better to avoid problems than deal with them when they occur”
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Accounting, Group Quotas, and User Priorities
HTCondor Security Basics HTCondor Week, Madison 2016
Job Matching, Handling, and Other HTCondor Features
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Grid Management Challenge - M. Jouvin
Late Materialization has (lately) materialized
Condor-G Making Condor Grid Enabled
HTML Forms What are clients? What are servers?
Production client status
Pete Gronbech, Kashif Mohammad and Vipul Davda
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

HTCondor Command Line Monitoring Tool Vipul Davda

Checking HTCondor Submitted Jobs condor_q displays information about jobs in the HTCondor job queue. Useful options: -global Show jobs submitted to all the schedulers -wide Do not truncate long lines. -analyze <job_id> Analyse a specific job and show the reason why it is in its current state -better-analyze <job_id> Analyse a specific job and show the reason why it is in its current state, giving extended info -long <job_id> Show all information of the job. -run Show information about running jobs. -hold Show only jobs in the "on hold" state. -af <attr1> <attr2> <...> List specific attributes of jobs, using automatic formatting. Condor_q is the basic supplied command

HTCondor Online/Offline Worker Nodes To offline a worker node: condor_config_val –name <worker node> -startd –set “StartJobs = False” sleeps 2 condor_reconfig –name <worker node> To online a worker node: condor_config_val –name <worker node> -startd –set “StartJobs = True” The above is the standard way of doing things which is a bit clunky

HTCondor List Online Worker Nodes To list online worker nodes: condor_status –server –wide|grep slot1@|awk ‘{print $1}’|awk –F@ ‘{print $2}’|sort To list offline worker nodes: ??? Is there a way to list offline nodes??

HTCondor List Jobs on a Worker Nodes To list jobs on a worker node: condor_q –constraint ‘RemoteHost == “slot1@<worker node>”’

condor_wn condor_wn a python script which gives a snapshot view of what is running on the cluster. Uses the below additional python modules: htcondor and classad – (https://htcondor-python.readthedocs.io/en/latest/) to look at the inventory in the pool prettytable - to get the pretty output yaml - to read values from a configuration file Home grown script to simplify things…..

condor_wn – summary tables displays summary tables: CPU Summary Number of Jobs Jobs per VO Idle Jobs (displayed if the jobs in the idle queue are more than 1 day old)* Held Jobs (displayed if the jobs in the held queue are more than 1 day old)* Long Running Jobs (displayed if the jobs in the running queue are more than 4 days old)* * The default value can be changed in the condor_wn.yaml config file Example Condor_wn script is in /usr/local/bin Idle in Condor speak is equivalent to Queued in Torque terminology Yaml file is in /usr/local/etc

condor_wn – list jobs on workernodes condor_wn --workernode t2wn2,t2wn3 - displays jobs on worker node(s) Note memory is allocated 1.5 times

condor_wn – list jobs on workernodes condor_wn --workernode t2wn2,t2wn3 –column 6 --descending - displays jobs on worker nodes, sorted by column 6 (AverageLoad) and descending order. Also can use short option version ie –w and –c and -d

condor_wn – online/offline worker nodes condor_wn --offline t2wn16 offline worker node t2wn16 condor_wn --list offline list all offline worker nodes condor_wn --online t2wn16 online worker node t2wn16 Putting All worker nodes online/offline condor_wn --offline ALL Are you sure that you want to put ALL workernodes offline [y/N] y Are you REALLY sure that you want to put ALL workernodes offline [y/N] N Need to be on collector demon node Eg t2condor01 for Oxford’s production cluster.

condor_wn – list condor_wn --list offline list all offline worker nodes condor_wn --list online list all online worker nodes condor_wn --list multicore list all multicore jobs condor_wn --list all list all worker nodes

condor_wn – html condor_wn --list all --html list all worker nodes and create an html output If you don’t like the layout of the tables, the template for the output is defined in condor_wn.yaml file To avoid copying the html file to a webserver Install HTCondor RPMs on the webserver Copy configuration files: pool_password condor_config condor_config.local 10_security.config Make sure: SCHEDD_HOST is defined in 10_security.config SCHEDD_HOST = <scheduler server> Do not start any htcondor daemons Can direct the output into an html file. The format is customizable. Can enable the web server to run the command locally by following the steps listed above. Scheduler is t2arc01 for example

condor_wn – html If a “Long Running Jobs” criteria is met, a table listing the jobs ids will be displayed. In this example the extra Long Job table appears as there is data. Script will be provided on the GridPP github.

Questions QUESTIONS?