Monitoring with Clustered Graphite & Grafana

Slides:



Advertisements
Similar presentations
Summary XBRL Challenge Objective: Tools that rely on XBRL data, e.g., tool that extracts data for multi-company comparison via desktop application; or.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
New Release Announcements and Product Roadmap Chris DiPierro, Director of Software Development April 9-11, 2014
The Premier Software Usage Analysis and Reporting Toolset CELUG Presentation – May 12, 2010 LT-Live : License Tracker’s License Server Monitor.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 A Year of HTCondor Monitoring.
Evaluation of NoSQL databases for DIRAC monitoring and beyond
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
TM Herding Penguins with Performance Co-Pilot Ken McDonell Performance Tools Group SGI, Melbourne.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
OASIS V2+ Next Generation Open Access Server CSD 2006 / Team 12.
TechEd /22/2017 5:40 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
Module 7: Fundamentals of Administering Windows Server 2008.
Meet with the AppEngine Márk Gergely eu.edge. What is AppEngine? It’s a tool, that lets you run your web applications on Google's infrastructure. –Google's.
Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat.
WordFreak A Language Independent, Extensible Annotation Tool.
SQL Server User Group Meeting Reporting Services Tips & Tricks Presented by Jason Buck of Custom Business Solutions.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
VMware vSphere Configuration and Management v6
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Stairway to the cloud or can we take the highway? Taivo Liik.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Ceilometer + Gnocchi + Aodh Architecture
Distributed Time Series Database
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
FlowLevel Client, server & elements monitoring and controlling system Message Include End Dial Start.
Monitoring with InfluxDB & Grafana
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Monitoring Alfresco with Nagios/Icinga Toni de la Fuente Alfresco Senior Solutions Engineer Blog: blyx.com
Alfresco Monitoring with OpenSource Tools Miguel Rodriguez Technical Account Manager.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Docker for Ops: Operationalize Your Apps in Production Vivek Saraswat Sr. Product Evan Hazlett Sr. Software
Spacewalk + Fedora = 42. What is Spacewalk? A systems management platform designed to provide complete lifecycle management of the operating system and.
Metrics data published Via different methods Monitoring Server
Network measurements with InfluxDB
Monitoring Evolution and IPv6
Application or server monitoring
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Dockerize OpenEdge Srinivasa Rao Nalla.
Netscape Application Server
Dynamic Deployment of VO Specific Condor Scheduler using GT4
HTCondor at RAL: An Update
Embeddable Discussions Ivelin Elenchev
Diskpool and cloud storage benchmarks used in IT-DSS
GWE Core Grid Wizard Enterprise (
Open Source distributed document DB for an enterprise
CWG10 Control, Configuration and Monitoring
Using Grafana to show Postgres Statistics
PDSF Host Database Shane Canon HEPiX Fall 2003.
Monitoring HTCondor with Ganglia
A Marriage of Open Source and Custom William Deck
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
DUCKS – Distributed User-mode Chirp-Knowledgeable Server
HTCondor at Collin Mehring Introduction My background.
CANalytics TM CAN Interface Software BY.
Denys FOSDEM 2018 What's new in Graphite 1.1 Denys FOSDEM 2018.
Monitoring for large infrastructure
Alan Chalker and Eric Franz Ohio Supercomputer Center
Overview of big data tools
Get your ETL flow under statistical process control
AIMS Equipment & Automation monitoring solution
Client/Server Computing and Web Technologies
Setting up PostgreSQL for Production in AWS
Presentation transcript:

Monitoring with Clustered Graphite & Grafana Metrics data analytics at the RACF for HTCondor & More William Strecker-Kellogg HEPiX—October 2016

Overview Goals: Revamp Batch Monitoring Outline: Current monitoring status Evaluation of new tools for storage, presentation, & collection of metrics Time-Series Database selection, Frontend selection, & metrics-gatherer selection HTCondor-specific metrics collection 2 Tools and examples Conclusions HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Current Condor Monitoring RRDTool + Scripts Custom python script gathering What to gather—hardcoded in scripts! Remap q1…q64  meaningful names @ plot-time (messy) Ganglia Stable, not going anywhere, OK, … HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Current Condor Monitoring Our monitoring script Uses a custom parser/query language for classads Configuration hardcoded Just look at how great it is!... Plotting is worse But… how we classify jobs is unique per-experiment HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Evaluating Time-Series DBs InfluxDB Went “open core”: clustering only for paid product… (╯°□°)╯︵┻━┻ OpenTSDB Complex setup…requires Hadoop/HDFS/Hbase Graphite Simple design, easy deployment Performant (50kHz+) Best query API Statsd Not a TSDB, but a useful aggregator / cache / frontend to several TSDBs HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Graphite Architecture Four components—relay, cache, web, & db Cluster node: 1 x Relay → N x Caches N x Caches → 1 x DB 1 x Web ← DB + Caches Inter-Node Any architecture of relays (consistent hashing) Upstream Direct Clients (Collectd, Ganglia, etc…) Statsd DB: RRD-like, 1 file per metric (whisper) Directory structure is fast index Cache: holds points in memory & throttles disk IO (carbon-cache) Configurable IO load, responds to web queries Web: (graphite-web) WSGI App, GUI and REST API (used by Grafana) Merges results from across cluster HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Graphite Architecture HA Proxy can go here Arbitrary layers of relays + replication factors here… Grafana HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Frontend: Grafana Context Features Kibana vs Grafana  Elasticsearch vs Graphite Logs vs Metrics Grafana grew out of Kibana Main goal is dashboard creation & sharing Share socially Define Alerts Templating on variables and sources Users & Orginizations UserOrg. many-to-many But…no multi-org dashboards Data Sources proxy or passthrough, but per-org Influx, Promethius, Graphite, Elasticsearch, OpenTSDB & more… Flexible Auth Dashboards Permissions can be RO, Copy, RW HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Query Interface Template variable Query Builder autofill (edit query string) autofill HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Grafana Dashboards HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Evaluating Data Collection Ganglia has graphite component Don’t use TCP mode! Want application-aware monitoring + customizable framework Collectd has many, many plugins Somewhat complex to build packages Widely used, large community, well supported Diamond has many plugins Extensible with custom plugins Handlers for sending data to dozens of places Fairly widely used, etc… HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Evaluating Monitoring Clients (HTCondor) HEPiX Batch Monitoring working group Gathering data from the community No community standard, current solutions are generally patched together Central vs Distributed Collection Write plugins for ganglia/collectd? In HTCondor, many things in the collector Exception: schedd, but O(10) schedd/exp; therefore easy to poll What do we care about How many jobs of each type/user over time Running and Idle (where / how many) Performance self-reporting HEPiX Fall 2016 @ LBL — Brookhaven National Lab

HTCondor Monitoring Fifemon Project Developed at Fermilab Monitors HTCondor by polling daemon & slot usage-statistics Heavily modified by me to allow custom hierarchies of ClassAds Only wanted some data Certain classads we need to monitor Reports to Graphite & InfluxDB HEPiX Fall 2016 @ LBL — Brookhaven National Lab

HTCondor Monitoring Collects condor jobs from startds <experiment>.<slot-type>.<state>.<custom1…n>.<group>.<user> Schedd + Collector + Negotiator stats <experiment>.<daemon>.<stat-name> User Priorities + Usage <experiment>.<domain>.<group>.<user>.<usage|prio…> Tip: Keep wildcards “above” high-multiplicity points in the hierarchy HEPiX Fall 2016 @ LBL — Brookhaven National Lab

HTCondor Resource Usage Monitoring (CGroups) Condor keeps track of some resourse usage CPU time & Memory Can use CGroup-based process tracking CGroups “know” much more Swap, efficiency, IO, etc… Per job per node I wrote a program to report HTCondor Cgroup data to Graphite Writes to graphite or statsd HEPiX Fall 2016 @ LBL — Brookhaven National Lab

HTCondor Resource Usage Monitoring (CGroups) Why How Condor’s view into User/Sys_cpu, RSS, etc… is incomplete Delayed updating Incomplete picture CAdvisor Google project to monitor CGroups Powerful…and overkill Docker container, embedded webserver, etc… Written in C, cmake RPM Note: do not look into the depths of libcgroup! Reports to graphite via TCP or UDP Also to StatsD Measurements per-job Alert high memory / swap usage Alert low CPU-efficiency Can be extended to measure IO per device, etc… Run under cron puppet’s fqdn_rand() for splay HEPiX Fall 2016 @ LBL — Brookhaven National Lab

HTCondor Resource Usage Monitoring (Examples) HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Conclusions (opinions) Better to use well-supported community solutions Graphite is a stable but effective choice of a TSDB Good performance, most powerful query interface Grafana is the most flexible dashboard HTCondor monitoring is cludgey but Fifemon is a step in the right direction Is condor_gangliad going to become more general purpose? More work is needed on batch monitoring—HEPiX Group For collection, builtin application-specific monitoring is nice… … ability to add custom monitors is nicer! ongoing evaluation / deployment of Diamond & Collectd HEPiX Fall 2016 @ LBL — Brookhaven National Lab

Questions? Comments? Thank You For Your Attention! William Strecker-Kellogg <willsk@bnl.gov> HEPiX Fall 2016 @ LBL — Brookhaven National Lab