Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 A Year of HTCondor Monitoring.

Slides:



Advertisements
Similar presentations
Utility Sentry | 5245 Old Dowd Road Suite 8 Charlotte, NC | |
Advertisements

Process Monitoring is only the first step in improving process efficiency.
LeadManager™- Internet Marketing Lead Management Solution May, 2009.
Practice Insight Instructional Webinar Series Eligibility Manager
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
TRACK 2™ Version 5 The ultimate process management software.
Advanced Workgroup System. Printer Admin Utility Monitors printers over IP networks Views Sharp and non-Sharp SNMP Devices Provided Standard with Sharp.
Evaluation of NoSQL databases for DIRAC monitoring and beyond
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Wireless LAN Topology Visualiser Project Supervisor: Dr Arkady Zaslavsky Project Team Members: Jignesh Rambhia Robert Mark Bram Tejas Magia.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
TRACK 3™ The ultimate process management software.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
ManageEngine ADAudit Plus A detailed walkthrough.
1 Chapter Overview Monitoring Server Performance Monitoring Shared Resources Microsoft Windows 2000 Auditing.
This presentation will guide you though the initial stages of installation, through to producing your first report Click your mouse to advance the presentation.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
MCTS Guide to Microsoft Windows 7
DE&T (QuickVic) Reporting Software Overview Term
Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)
LSC Segment Database Duncan Brown Caltech LIGO-G Z.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
THOMSON SCIENTIFIC Web of Science 7.0 via the Web of Knowledge 3.0 Platform Access to the World’s Most Important Published Research.
Module 7: Fundamentals of Administering Windows Server 2008.
Network Management Tool Amy Auburger. 2 Product Overview Made by Ipswitch Affordable alternative to expensive & complicated Network Management Systems.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Event Management & ITIL V3
6 th Annual Focus Users’ Conference Manage Integrations Presented by: Mike Morris.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
The Initiative For School Empowerment and Excellence (i.4.see) “Empowering teachers, administrators, policy makers, and parents to increase student achievement.”
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Server to Server Communication Redis as an enabler Orion Free
DAM-Alarming Data Analytics from Monitoring, for alarming Summer Student Project 2015 A. Martin, C. Cristovao, G. Domenico thanks to Luca Magnoni IT-SDC-MI.
Worldwide Lexicon Brian McConnell May, WWL – Brian McConnell Worldwide Lexicon Intro Automatic discovery of dictionary, semantic net and translation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
CASTOR logging at RAL Rob Appleyard, James Adams and Kashyap Manjusha.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Monitoring with InfluxDB & Grafana
1 Chapter Overview Monitoring Access to Shared Folders Creating and Sharing Local and Remote Folders Monitoring Network Users Using Offline Folders and.
Troubleshooting Workflow 8 Raymond Cruz, Software Support Engineer.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
AdisInsight User Guide July 2015
Kibana, Grafana and Zeppelin on Monitoring data
Monitoring Evolution and IPv6
Project Management: Messages
Monitoring with Clustered Graphite & Grafana
Experience in CMS with Analytic Tools for Data Flow and HLT Monitoring
Monitoring HTCondor with Ganglia
A Marriage of Open Source and Custom William Deck
Practice Insight Instructional Webinar Series Reporting
Troubleshooting Your Jobs
HTCondor at Collin Mehring Introduction My background.
Monitoring for large infrastructure
Get your ETL flow under statistical process control
Grauer and Barber Series Microsoft Access Chapter One
Presentation transcript:

Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 A Year of HTCondor Monitoring

Analytics vs. Operations ●Two parallel tracks in mind: o Operations o Analytics ●Operations needs to: o Observe trends. o Be alerted and respond to problems. ●Analytics needs to: o Store and retrieve the full ClassAds of every job. o Perform deep queries over the entire history of jobs.

Motivations ●For OSG Connect, we want to: o track HTCondor status o discover job profiles of various users to determine what resources users are need o determine when user jobs are failing and help users to correct failures o create dashboards to provide high-level overviews o open our monitoring up to end-users o be alerted when there are problems ●Existing tools like Cycleserver, Cacti, Ganglia, etc.. did not adequately cover our use case

Operations

●Real-time graphing for time series data ●Open source, developed by Orbitz ●Three main components: o Whisper - Data format, replacement for RRD o Carbon - Listener daemon o Front-end - Web interface for metrics ●Dead simple protocol o Open socket, fire off metrics in the form of: path.to.metric Graphite Graphite home page:

●To try it out, you can just parse the output of “condor_q” into the desired format ●Then, simply use netcat to send it to the Graphite server #!/bin/bash metric=”htcondor.running” value=$(condor_q | grep R | wc -l) timestamp=$(date +%s) echo ”$metric $value $timestamp” | nc \ graphite.yourdomain.edu 2003 Sending HTCondor stats

●Run the script with cron: A simple first metric

●Parsing condor_q output is a heavyweight and potentially fragile operation o Especially if you do it once a minute o Be prepared for gaps in the data if your schedd is super busy ●What to do then? o Ask the daemons directly with the Python bindings! Problems with parsing

import classad, htcondor coll = htcondor.Collector("htcondor.domain.edu") slotState = coll.query(htcondor.AdTypes.Startd, "true",['Name','JobId','State','RemoteOwner','COLLECTOR_HOST_STRI NG']) for slot in slotState[:]: if (slot['State'] == "Claimed"): slot_claimed += 1 print "condor.claimed "+ str(slot_claimed) + " " + str(timestamp) Collecting Collector stats via Python ●Here we ask the collector for slots in “claimed” state and sum them up:

Sample HTCondor Collector summary ●Fire off a cron & come back to a nice plot:

●Graphite is nice, but static PNGs are so Web 1.0 ●Fortunately, Graphite can export raw JSON instead ●Grafana is an overlay for Graphite that renders the JSON data using flot o Nice HTML / Javascript-based graphs o Admins can quickly assemble dashboards, carousels, etc o Saved dashboards are backed by nosql database ●All stored Graphite metrics should work out of the box Grafana Grafana home page: flot home page:

Sample dashboard

An example trend ●A user’s jobs were rapidly flipping between idle and running. Why? ●Turns out to be a problematic job with an aggressive periodic release: periodic_release = ((CurrentTime - EnteredCurrentStatus) > 60) (Credit to Mats Rynge for pointing this out)

●Nominally a continuous integration tool for building software ●Easily configured to submit simple Condor jobs instead ●Behaves similar to a real user o Grabs latest functional tests via git o Runs “condor_submit” for a simple job o Checks for correct output, notifies upon failure ●Plethora of integrations for notifying sysops of problems: o , IRC, XMPP, SMS, Slack, etc. Active monitoring with Jenkins

●Dashboard gives a reassuring all-clear: ●Slack integration logs & notifies support team of problems: Jenkins monitoring samples

●Using the HTCondor Python bindings for monitoring is just as easy, if not easier, than scraping condor_{q,status} ●If you plan to have a lot of metrics, the sooner you move to SSD(s), the better ●Weird oscillatory patterns, sudden drops in running jobs, huge spikes in idle jobs can all be indicative of problems ●Continuous active testing + alerting infrastructure is key for catching problems before end-users do Operations - Lessons learned

Analytics

In the beginning... ●We started with a summer project where students would be visualizing HTCondor job data ●To speed up startup, we wrote a small python script (~50 lines) that queried HTCondor for history information and added any new records to a MongoDB server o Intended so that students would have an existing data source o Ended up being used for much longer

Initial data visualization efforts ●Had a few students working on visualizing data over last summer o Generated a few visualizations using MongoDB and other sources o Tried importing data from MongoDB to Elasticsearch using Kibana o Used a few visualizations for a while but eventually stopped due to maintenance required ●Created a homebrew system using MongoDB, python, highcharts and cherrypy

Current setup ●Probes to collect condor history from log file and to check the schedd every minute ●Redis server for pub/sub channels for probes ●Logstash to follow Redis channels and to insert data into Elasticsearch ●Elasticsearch cluster for storage and queries ●Kibana for user and operational dashboards, RStudio/python scripts for more complicated analytics

Indexing job history information Python script polls the history logs periodically for new entries and publishes this to a Redis channel. Classads get published to a channel on the Redis server and read by Logstash Due to size of classads on Elasticsearch and because ES only works on data in memory, data goes into a new index each month

Indexing schedd data Python script is run every minute by a cronjob and collects classads for all jobs. The complete set of job classads is put into an ES index for that week. Script also inserts a record with number of jobs in each state into another ES index.

Querying/Visualizing information ●All of this information is good, but need a way of querying and visualizing it ●Luckily, Elasticsearch integrates with Kibana which provides a web interface for querying/visualization

Kibana Overview Query field Hits over time Sampling of documents found Time range selector

Plotting/Dashboards with Kibana Can also generate plots of data with Kibana as well as dashboards Plots and dashboards can be exported!

Some (potentially interesting) plots ●Queries/plots developed in Kibana, data exported as csv file and plotted using RStudio/ggplot

Average number of job starts over time 94% of jobs succeed on first attempt Note: most jobs use periodic hold and period release ClassAds to retry failed jobs Thus invalid submissions may result in multiple restarts, inflating this number

Average job duration on compute node Plot of the average job duration on compute nodes, majority of jobs complete within an hour with a few projects having outliers (mean: 1050s, median: 657s )

Average bytes transferred per job using HTCondor Most jobs transfer significantly less than 1GB per job through HTCondor (mean: 214MB, median: 71MB) Note: this does not take into account data transferred by other methods (e.g. wget)

Memory usage Most jobs use less than 2GB of memory with proteomics and snoplus projects having the highest average memory utilization Mean use: 378 MB Median use: 118 MB

Other uses ●Analytics framework is not just for dashboards and pretty plots ●Can also be used to troubleshoot issues

Troubleshooting a job In November 2014, there was a spike in the number of restarts, would like to investigate this and see if we can determine the cause and fix

Troubleshooting part 2 Select the time range in question Split out the projects with the most restarts algdock project seems to be responsible for most of the restarts

Troubleshooting concluded Next, we go to the discover tab in Kibana and search for records in the relevant time frame and look at the classads, it quickly becomes apparently that the problem is due to a missing output file that results in the job being held combined with a PeriodicRelease for the job

Future directions ●Update schedd probe to use LogReader API to get classad updates rather than querying schedd for all classads ●More and better analytics: o Explore data more and pull out analytics that are useful for operational needs o Update user dashboards to present information that users are interested in

Links ●Github repo - Tools/logstash-confs/tree/master/condorhttps://github.com/DHTC- Tools/logstash-confs/tree/master/condor

Questions? Comments?