HTCondor stats John (TJ) Knoeller Condor Week 2016.

Slides:



Advertisements
Similar presentations
CS0007: Introduction to Computer Programming
Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Grid job submission using HTCondor Andrew Lahiff.
Research Computing Environment at the University of Alberta Diego Novillo Research Computing Support Group University of Alberta April 1999.
Lecture – Performance Performance management on UNIX.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
5 1 Data Files CGI/Perl Programming By Diane Zak.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Using Map-reduce to Support MPMD Peng
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
CS0007: Introduction to Computer Programming The for Loop, Accumulator Variables, Seninel Values, and The Random Class.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
Company LOGO Security in Linux PhiHDN - VuongNQ. Contents Introduction 1 Fundamental Concepts 2 Security System Calls in Linux 3 Implementation of Security.
Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
CHTC Policy and Configuration
Monitoring Windows Server 2012
Debugging Common Problems in HTCondor
Improvements to Configuration
Condor DAGMan: Managing Job Dependencies with Condor
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
More HTCondor Monday AM, Lecture 2 Ian Ross
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Intermediate HTCondor: Workflows Monday pm
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Things you may not know about HTCondor
IW2D migration to HTCondor
High Availability in HTCondor
Things you may not know about HTCondor
Moving from CREAM CE to ARC CE
Chapter 2 Assignment and Interactive Input
Adding High Availability to Condor Central Manager Tutorial
Monitoring HTCondor with Ganglia
Review If you want to display a floating-point number in a particular format use The DecimalFormat Class printf A loop is… a control structure that causes.
Condor: Job Management
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Troubleshooting Your Jobs
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
What’s New in DAGMan HTCondor Week 2013
Chapter 5: CPU Scheduling
Basic Grid Projects – Condor (Part I)
Process Description and Control
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
MAT 4830 Mathematical Modeling
Late Materialization has (lately) materialized
5/8/2019 3:20 AM bQuery-Tool 3.0 A new and elegant way to create queries and ad-hoc reports on your Baan/Infor ERP LN data. This Baan session is a query.
Overview Development of an advanced submit file
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Introduction to research computing using Condor
Presentation transcript:

HTCondor stats John (TJ) Knoeller Condor Week 2016

› Daemons track count and runtime stats › Stats are assigned verbosity levels 0 = Chatty 1 = Normal 2 = Verbose 3 = Never › Stats are published in daemon ad STATISTICS_TO_PUBLISH = : General statistics 2

› Basic control of stats to collector  Categories are DC – Commands, Timers, Signals, DNS Resolver SCHEDD – Job counters TRANSFER – Transfer Queue counters COLLECTOR – Update counters ALL – All categories DEFAULT – Any category not otherwise specified STATISTICS_TO_PUBLISH = DEFAULT:0, SCHEDD:1 STATISTICS_TO_PUBLISH 3

› Fine control of stats to collector › Set to a list of attribute names › Changes verbosity of stats that match to equal the STATISTICS_TO_PUBLISH verbosity STATISTICS_TO_PUBLISH_LIST 4

Condor_status –statistics : › Override STATISTICS_TO_PUBLISH › Use with condor_status –direct –schedd  ‘live’ DC, SCHEDD and/or TRANSFER stats › Use with condor_status –collector  ‘live’ DC and/or COLLECTOR stats Condor_status -statistics 5

› Time spent on DNS Lookups  Special counter for ‘slow’ lookups › Counter for ResourceRequestsSent › Per-user transfer stats in Submitter ads Statistics 6

› Absolute values  ShadowsRunning › Counters – count up as long as the daemon is alive  JobsCompleted › Runtime counters – count up and accumulate time  DCNameResolve / DCNameResolveRuntime › Moving averages – count up with windowed averages  FileTransferUploadBytesPerSecond_1h › “Miron” probes – count / min / max / avg / std  DCCommand_NEGOTIATERuntimeAvg Types of statistics probes 7

› Runtime counters  Foo = count, FooRuntime = runtime › Recent window stats  Prefix “Recent” is stat over RecentWindowMax › Moving averages  Foo = count, FooPerSecond_ = avg  Can be multiple windows Most stats are multiple attrs 8

› “Miron” runtime probes  Foo = count  FooRuntime = accumulated time  FooRuntimeMin  FooRuntimeMax  FooRuntimeAvg  FooRuntimeStd › See DCCommand_RESCHEDULE Most stats are multiple attrs(2) 9

› Use condor_status to get recent-ish stats  “Recent” prefix attributes are quick ‘n dirty  RecentWindowMax – span of “Recent” time  RecentStatsTickTime – last “Recent” update › Use condor_status –direct –statistics  Difference of 2 queries is most accurate view  Works with schedd, submitter ads  StatsLastUpdateTime – last stats update Querying statistics 10

› DC category stats start with DC  Mostly runtime stats DCCommand, DCTimer, DCPumpCycle, etc RecentDamonCoreDutyCycle › SCHEDD category stats start with  Jobs – job counters updated on shadow exit  SC – Schedd runtime stats › TRANSFER category stats start with  FileTransfer – byte counts, transfer rates  TransferQueue – use counts for transfer queue Taxonomy of stats names 11

› Running, Idle and Held Job counts by universe › “Jobs” prefix  Counts by exit code  Runtime & Badput time › “FileTransfer”  Bytes and rates in and out  Bytes and rates for each submitter What do we measure 12

› “Transfer” prefix  Transfer queue usage › “SC” prefix  Schedd task execution time › “Updates” prefix  Collector update rates › “DC” prefix  execution time for commands, DNS, fsync  Daemon health What do we measure (2) 13

› For ordinary users, condor_q will default to showing only jobs from that user unless  User is a queue superuser (root, condor)  -allusers arg is used  or earlier SCHEDD  CONDOR_Q_ONLY_MY_JOBS = false For tool OR for SCHEDD config Show only my jobs 14

› One line of output for each batch of jobs where 'batch' is  A DAG and all of it's children  A Cluster and all of it's procs  A Clusters that have the same Owner and Cmd attributes. (Cmd is Executable) › -batch will be the new default in 8.6  -unbatch to get pre 8.6 output condor_q -batch 15

› Accountant ads in the collector › Revamped keyboard / mouse detection  Using X screen saver api  Changed defaults to protect all slots › Condor_status & condor_q pay attention to terminal width, truncate data less. Random stuff 16