HTCondor stats John (TJ) Knoeller Condor Week 2016.

Slides:

Advertisements

Similar presentations

CS0007: Introduction to Computer Programming

Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.

Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:

Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.

Grid Computing I CONDOR.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Grid job submission using HTCondor Andrew Lahiff.

Research Computing Environment at the University of Alberta Diego Novillo Research Computing Support Group University of Alberta April 1999.

Lecture – Performance Performance management on UNIX.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.

Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.

5 1 Data Files CGI/Perl Programming By Diane Zak.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Using Map-reduce to Support MPMD Peng

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

CS0007: Introduction to Computer Programming The for Loop, Accumulator Variables, Seninel Values, and The Random Class.

TANYA LEVSHINA Monitoring, Diagnostics and Accounting.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.

Company LOGO Security in Linux PhiHDN - VuongNQ. Contents Introduction 1 Fundamental Concepts 2 Security System Calls in Linux 3 Implementation of Security.

Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.

CHTC Policy and Configuration

Monitoring Windows Server 2012

Debugging Common Problems in HTCondor

Improvements to Configuration

Condor DAGMan: Managing Job Dependencies with Condor

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

More HTCondor Monday AM, Lecture 2 Ian Ross

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Intermediate HTCondor: Workflows Monday pm

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Primer for Site Debugging

Things you may not know about HTCondor

IW2D migration to HTCondor

High Availability in HTCondor

Things you may not know about HTCondor

Moving from CREAM CE to ARC CE

Chapter 2 Assignment and Interactive Input

Adding High Availability to Condor Central Manager Tutorial

Monitoring HTCondor with Ganglia

Review If you want to display a floating-point number in a particular format use The DecimalFormat Class printf A loop is… a control structure that causes.

Condor: Job Management

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

HTCondor Command Line Monitoring Tool

Troubleshooting Your Jobs

Negotiator Policy and Configuration

Accounting, Group Quotas, and User Priorities

What’s New in DAGMan HTCondor Week 2013

Chapter 5: CPU Scheduling

Basic Grid Projects – Condor (Part I)

Process Description and Control

Introduction to High Throughput Computing and HTCondor

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

MAT 4830 Mathematical Modeling

Late Materialization has (lately) materialized

5/8/2019 3:20 AM bQuery-Tool 3.0 A new and elegant way to create queries and ad-hoc reports on your Baan/Infor ERP LN data. This Baan session is a query.

Overview Development of an advanced submit file

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Introduction to research computing using Condor

Presentation transcript:

HTCondor stats John (TJ) Knoeller Condor Week 2016

› Daemons track count and runtime stats › Stats are assigned verbosity levels 0 = Chatty 1 = Normal 2 = Verbose 3 = Never › Stats are published in daemon ad STATISTICS_TO_PUBLISH = : General statistics 2

› Basic control of stats to collector  Categories are DC – Commands, Timers, Signals, DNS Resolver SCHEDD – Job counters TRANSFER – Transfer Queue counters COLLECTOR – Update counters ALL – All categories DEFAULT – Any category not otherwise specified STATISTICS_TO_PUBLISH = DEFAULT:0, SCHEDD:1 STATISTICS_TO_PUBLISH 3

› Fine control of stats to collector › Set to a list of attribute names › Changes verbosity of stats that match to equal the STATISTICS_TO_PUBLISH verbosity STATISTICS_TO_PUBLISH_LIST 4

Condor_status –statistics : › Override STATISTICS_TO_PUBLISH › Use with condor_status –direct –schedd  ‘live’ DC, SCHEDD and/or TRANSFER stats › Use with condor_status –collector  ‘live’ DC and/or COLLECTOR stats Condor_status -statistics 5

› Time spent on DNS Lookups  Special counter for ‘slow’ lookups › Counter for ResourceRequestsSent › Per-user transfer stats in Submitter ads Statistics 6

› Absolute values  ShadowsRunning › Counters – count up as long as the daemon is alive  JobsCompleted › Runtime counters – count up and accumulate time  DCNameResolve / DCNameResolveRuntime › Moving averages – count up with windowed averages  FileTransferUploadBytesPerSecond_1h › “Miron” probes – count / min / max / avg / std  DCCommand_NEGOTIATERuntimeAvg Types of statistics probes 7

› Runtime counters  Foo = count, FooRuntime = runtime › Recent window stats  Prefix “Recent” is stat over RecentWindowMax › Moving averages  Foo = count, FooPerSecond_ = avg  Can be multiple windows Most stats are multiple attrs 8

› “Miron” runtime probes  Foo = count  FooRuntime = accumulated time  FooRuntimeMin  FooRuntimeMax  FooRuntimeAvg  FooRuntimeStd › See DCCommand_RESCHEDULE Most stats are multiple attrs(2) 9

› Use condor_status to get recent-ish stats  “Recent” prefix attributes are quick ‘n dirty  RecentWindowMax – span of “Recent” time  RecentStatsTickTime – last “Recent” update › Use condor_status –direct –statistics  Difference of 2 queries is most accurate view  Works with schedd, submitter ads  StatsLastUpdateTime – last stats update Querying statistics 10

› DC category stats start with DC  Mostly runtime stats DCCommand, DCTimer, DCPumpCycle, etc RecentDamonCoreDutyCycle › SCHEDD category stats start with  Jobs – job counters updated on shadow exit  SC – Schedd runtime stats › TRANSFER category stats start with  FileTransfer – byte counts, transfer rates  TransferQueue – use counts for transfer queue Taxonomy of stats names 11

› Running, Idle and Held Job counts by universe › “Jobs” prefix  Counts by exit code  Runtime & Badput time › “FileTransfer”  Bytes and rates in and out  Bytes and rates for each submitter What do we measure 12

› “Transfer” prefix  Transfer queue usage › “SC” prefix  Schedd task execution time › “Updates” prefix  Collector update rates › “DC” prefix  execution time for commands, DNS, fsync  Daemon health What do we measure (2) 13

› For ordinary users, condor_q will default to showing only jobs from that user unless  User is a queue superuser (root, condor)  -allusers arg is used  or earlier SCHEDD  CONDOR_Q_ONLY_MY_JOBS = false For tool OR for SCHEDD config Show only my jobs 14

› One line of output for each batch of jobs where 'batch' is  A DAG and all of it's children  A Cluster and all of it's procs  A Clusters that have the same Owner and Cmd attributes. (Cmd is Executable) › -batch will be the new default in 8.6  -unbatch to get pre 8.6 output condor_q -batch 15

› Accountant ads in the collector › Revamped keyboard / mouse detection  Using X screen saver api  Changed defaults to protect all slots › Condor_status & condor_q pay attention to terminal width, truncate data less. Random stuff 16