Monitoring Primer HTCondor Workshop Barcelona 2016-03-01 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Advertisements

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Implementing a Central Quill Database in a Large Condor Installation Preston Smith Condor Week April 30, 2008.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:

Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison

Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs.

1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.

Grid Computing I CONDOR.

Grid job submission using HTCondor Andrew Lahiff.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.

Monitoring with InfluxDB & Grafana

How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

HTCondor stats John (TJ) Knoeller Condor Week 2016.

Metrics data published Via different methods Monitoring Server

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.

CHTC Policy and Configuration

HTCondor Networking Concepts

More HTCondor Monday AM, Lecture 2 Ian Ross

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

HORIZONT TWS/WebAdmin DS TWS/WebAdmin DS Tips & Tricks

Monitoring with Clustered Graphite & Grafana

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Things you may not know about HTCondor

Matchmaker Policies: Users and Groups HTCondor Week, Madison 2017

Job and Machine Policy Configuration

High Availability in HTCondor

Things you may not know about HTCondor

Moving from CREAM CE to ARC CE

CREAM-CE/HTCondor site

Monitoring HTCondor with Ganglia

A Marriage of Open Source and Custom William Deck

Accounting in HTCondor

“Better to avoid problems than deal with them when they occur”

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

HTCondor Command Line Monitoring Tool

Negotiator Policy and Configuration

Accounting, Group Quotas, and User Priorities

Submitting Many Jobs at Once

湖南大学-信息科学与工程学院-计算机与科学系

Semiconductor Manufacturing (and other stuff) with Condor

Monitoring for large infrastructure

Job Matching, Handling, and Other HTCondor Features

Basic Grid Projects – Condor (Part I)

Upgrading Condor Best Practices

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison

› startd ads  An ad for each slot on each machine  State = Unclaimed | Claimed | Drained …  Activity = Busy | Idle | Retiring | Vacating …  CPUs, Memory, Disk, GPUs, … › submitter ads  An ad for each unique user  RunningJobs, IdleJobs, HeldJobs, … › schedd, master, negotiator, collector ads  One ad from each daemon instance Ad types in the condor_collector 2

Q: How many slots are running a job? A: Count slots where State == Claimed (and Activity != Idle) How? 3

% condor_status LINUX X86_64 Claimed Busy :28:24 LINUX X86_64 Claimed Busy :14:13 LINUX X86_64 Unclaimed Idle :00:04 … Obvious solutions aren’t the best 4 › condor_status | grep Claimed | grep –v Idle | wc –l  Output subject to change, wrong answers, slow › condor_status –l | grep Claimed | wc –l  Wrong answers, slow

› condor_status [-startd | -scedd | -master…] –constraint –autoformat Use constraints and projections 5 Select From Where condor_status -startd \ -cons 'State=="Claimed" && Activity!="Idle"' \ -af name | wc -l

› Ask startd to run a script/program to test health of NFS STARTD_CRON_JOB_LIST = tag STARTD_CRON_tag_EXECUTABLE = detect.sh  Script returns a ClassAd with attribute NFS_Broken = True | False › condor_status -cons 'NFS_Broken==True' › Could specify customized output (i.e. a report) for condor_status to display broken machines Q: Which slots are running on machines where NFS is broken? 6

› Sum the Cpus attribute for each slot that is Claimed and Busy: Q: How many CPU cores are being utilized? 7 % condor_status -startd \ -cons 'State=="Claimed" && Activity!="Idle"' \ -af Cpus | less 1 4 … % condor_status -startd \ -cons 'State=="Claimed" && Activity!="Idle"' \ -af Cpus | st N min max sum mean stddev Simple Statistics from command line

› Could have a cron job run every minute… #!/bin/sh echo `date`, ; condor_status \ -cons 'State=="Claimed" && Activity!="Idle"' \ -af Cpus | st --sum › What if you have hundreds or thousands of metrics?  COLLECTOR_QUERY_WORKERS = 5000? › How about query the collector just once per minute for all attributes needed to compute all metrics? Graph of CPU utilization over time 8

› condor_gangliad queries the condor_collector once per minute  DAEMON_LIST = MASTER, GANGLIAD,… › condor_gangliad has config file to filter and aggregate attributes from the ads in the condor_collector in order to form metrics › Forwards these metrics to Ganglia, which stores these values in a database and provides graphs over the web Ganglia and condor_gangliad 9

[ Name = "CpusInUse"; Aggregate = "SUM"; Value = Cpus; Requirements = State=="Claimed" && Activity!="Idle"; TargetType = "Machine"; ] [ Name = "CpusNotInUse"; Aggregate = "SUM"; Value = Cpus; Requirements = State!="Claimed" || Activity=="Idle"; TargetType = "Machine"; ] Example metric definitions in condor_gangliad 10

{ "aggregate_graph":"true", "host_regex":[ {"regex":"cm.chtc.wisc.edu"} ], "metric_regex":[ {"regex":"(Cpus(InUse|NotInUse)"} ], "graph_type":"stack", "vertical_label":"cores", "title":"CPU Core Utilization" }, Add a graph to a view dashboard: /var/lib/ganglia/view_Miron.json 11

Voila! 12

[ Name = "CpusNotInUse_LowMemory"; Aggregate = "SUM"; Value = Cpus; Requirements = State=="Unclaimed" && Memory < 1024; TargetType = "Machine"; ] [ Name = "CpusNotInUse_Draining"; Aggregate = "SUM"; Value = Cpus; Requirements = State=="Drained"; TargetType = "Machine"; ] Why are cores not in use? 13

Unused Core Reasons 14

Memory Provisioned 15 [ Name = "MemoryInUse"; Aggregate = "SUM"; Value = Memory; Requirements = State=="Claimed" && Activity!="Idle"; TargetType = "Machine"; ] [ Name = "MemoryNotInUse"; Aggregate = "SUM"; Value = Memory; Requirements = State!="Claimed" || Activity=="Idle"; TargetType = "Machine"; ]

Memory Used vs Provisioned 16 In condor_config.local : STARTD_JOB_EXPRS = $(START_JOB_EXPRS) MemoryUsage Then define MemoryEfficiency metric as: [ Name = "MemoryEfficiency"; Aggregate = "AVG"; Value = real(MemoryUsage)/Memory*100; Requirements = MemoryUsage > 0.0; TargetType = "Machine"; ]

Example: Metrics Per User 17 [ Name = strcat(RemoteUser,"-UserMemoryEfficiency"); Title = strcat(RemoteUser," Memory Efficiency"); Aggregate = "AVG"; Value = real(MemoryUsage)/Memory*100; Requirements = MemoryUsage > 0.0; TargetType = "Machine"; ]

Dashboard(s) of useful charts 18

› Grafana  Open Source  Makes pretty and interactive dashboards from popular backends including Graphite's Carbon, Influxdb, and very recently ElasticSearch  Easy for individual users to create their own custom persistent graphs and dashboards › condor_gangliad -> ganglia -> graphite # gmetad.conf - Forward metrics to Carbon via UDP carbon_server "mongodbtest.chtc.wisc.edu" carbon_port 2003 carbon_protocol udp graphite_prefix "ganglia" New Hotness: Grafana 19

20

21 Adding Grafana graph (Graphite)

Adding Grafana graph (Influxdb) 22

› Lots of good attributes in the collector by default; browse via condor_status -schedd -l, condor_status -submitter -l condor_status -startd -l › Lots more available via HTCondor "Statistics"  Especially in the schedd, collector  condor_status –direct –schedd –statistics all:2  Send to the collector via knobs STATISTICS_TO_PUBLISH and STATISTICS_TO_PUBLISH_LIST  All kinds of output, mostly aggregated  See TJ or Manual for details What sort of attributes are avail? 23

SCHEDD_COLLECT_STATS_FOR_Gthain = (Owner==“gthain") Schedd will maintain distinct sets of status per owner, with name as prefix: Gthain_JobsCompleted = 7 Gthain_JobsStarted = 100 Condor Statistics, cont. Disaggregated stats 24

SCHEDD_COLLECT_STATS_BY_Owner = Owner For all owners, collect & publish stats: Owner_gthain_JobsStarted = 7 Owner_tannenba_JobsStarted = 100 Even better 25

› Todd's favorite statistic for watching the health of submit points (schedds) and central manager (collector) › Measures time not idle › If goes 98%, your schedd or collector is saturated RecentDaemonCoreDutyCycle 26

Individual job monitoring 27 Job event log ( log = foo.log in submit file) 001 ( ) 07/30 16:39:50 Job executing on host: ( ) 07/30 16:39:58 Image size of job updated: MemoryUsage of job (MB) ResidentSetSize of job (KB) ( ) 07/30 16:40:02 Job was evicted. (0) Job was not checkpointed. Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Run Bytes Sent By Job Run Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : Memory (MB) :

› Schedd Event Log (rotates)  Union of all job event logs for all jobs on a sched  Config Knob: EVENT_LOG = /some/file › Audit Log (rotates)  Provenance of modifications to any job  Config Knob: SCHEDD_AUDIT_LOG = /file › History File (rotates)  Schedd history : all job ads that left the queue HISTORY = /file  Startd history : all job ads that used a slot STARTD_HISTORY = /file  View with condor_history (local or remote) Job Monitoring, cont. 28

condor_pool_job_report 29

› CMS pushes 1M history ads per day  4GB disk/day, Server: 64GB Ram, 4 cores History Ads Pushed to ElasticSearch 30

› condor_gangliad  condor_metricd  Send aggregated metrics to Ganglia  Write out aggregated metrics to rotating JSON files  Callouts per metric (today) and per metric batch (tomorrow)  Send aggregated metrics to Graphite? Influx?  More input sources: today only ads from the collector, tomorrow? history data? › HTCondor View replacement Upcoming 31

Thank you! 32